5. Exploratory Data Analysis#

Warning

I decided not to actively pursue my ideas for exploration due to already having a strong set of hypotheses to test and the time constraint. Consider this page just a brain dump, please skip to next section

5.1. Retailers who missed the first collection attempt#

In the Loans dataset, there are a few cases when a retailer is not able to pay back the money they borrowed in the first collection attempt. That does not necessarily mean they will default on their payment, perhaps it’s just a delay. Let’s take a look at the retailer with the most delayed-but-paid loans:

import pandas as pd


try:
    _ = loans_df
except NameError:
    loans_df = pd.read_excel("../../data/Loans_Data.xlsx")

(
    loans_df
    # .query("FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS == 'Paid'")
    .groupby("MAIN_SYSTEM_ID")
    .agg(
        nunique_loan_id=("LOAN_ID", "nunique"),
        nunique_loan_id_delayed_payment=(
            "LOAN_ID",
            lambda df: df[(loans_df["FIRST_TRIAL_BALANCE"] < 0) & (loans_df["PAYMENT_STATUS"] == "Paid")].nunique()
            # the above is very inefficient compute but I'm a bit rusty with Pandas
        ),
    )
    .sort_values("nunique_loan_id_delayed_payment", ascending=False)
    .head(5)
)
nunique_loan_id nunique_loan_id_delayed_payment
MAIN_SYSTEM_ID
42714 14 10
58316 19 10
102778 15 9
104839 11 8
24063 20 7

Let’s investigate retailer MAIN_SYSTEM_ID == 58316, first looking at all 10 loans

loans_df.query("MAIN_SYSTEM_ID == 58316").sort_values("LOAN_ISSUANCE_DATE")
INDEX MAIN_SYSTEM_ID RETAILER_ID LOAN_ID LOAN_ISSUANCE_DATE LOAN_AMOUNT INITIAL_COST TOTAL_INITIAL_AMOUNT INITIAL_DATE LOAN_PAYMENT_DATE ... FIRST_TRAIL_DELAYS FIRST_TRIAL_BALANCE SPENT FINAL_COST TOTAL_FINAL_AMOUNT REPAYMENT_ID REPAYMENT_AMOUNT REPAYMENT_UPDATED CUMMULATIVE_OUTSTANDING PAYMENT_STATUS
47308 298520 58316 66776 697510 2022-09-02 00:38:59.309 1000.0 0 1000.0 2022-09-02 2022-09-03 19:27:33.733 ... 1 211.36 88.64 0 88.64 697510 0.0 2022-09-03 19:27:33.733 211.36 Paid
62663 432451 58316 66776 723339 2022-09-08 12:32:13.904 1000.0 0 1000.0 2022-09-08 2022-09-09 19:42:24.713 ... 1 191.46 108.54 0 108.54 723339 0.0 2022-09-09 19:42:24.713 191.46 Paid
7866 594403 58316 66776 750715 2022-09-11 23:38:50.204 1000.0 0 1000.0 2022-09-12 2022-09-12 19:01:41.513 ... 0 230.56 69.44 0 69.44 750715 0.0 2022-09-12 19:01:41.513 230.56 Paid
41078 242794 58316 66776 763590 2022-09-15 14:52:35.497 1000.0 0 1000.0 2022-09-15 2022-09-16 20:07:58.496 ... 1 -84.87 84.87 0 84.87 771485 250.0 2022-09-17 18:28:30.091 165.13 Paid
49219 313904 58316 66776 776182 2022-09-18 20:55:42.708 1000.0 0 1000.0 2022-09-19 2022-09-20 18:41:03.105 ... 1 -730.07 730.07 0 730.07 782955 800.0 2022-09-20 20:25:46.865 69.93 Paid
51662 334727 58316 66776 783557 2022-09-21 02:37:30.731 1000.0 0 1000.0 2022-09-21 2022-09-22 14:58:35.910 ... 1 46.95 103.05 0 103.05 783557 0.0 2022-09-22 14:58:35.910 46.95 Paid
59844 405915 58316 66776 791325 2022-09-23 14:20:37.542 1000.0 0 1000.0 2022-09-23 2022-09-24 16:21:40.566 ... 1 -108.56 108.56 0 108.56 807424 150.0 2022-09-28 17:50:15.576 41.44 Paid
68914 487963 58316 66776 814724 2022-09-29 11:34:20.555 1000.0 0 1000.0 2022-09-29 2022-09-29 19:17:11.349 ... 1 -91.36 91.36 0 91.36 816729 100.0 2022-09-29 20:05:02.673 8.64 Paid
12992 639440 58316 66776 816704 2022-09-29 19:41:44.898 1000.0 0 1000.0 2022-09-29 2022-10-01 17:59:32.282 ... 1 -100.96 100.96 0 100.96 831639 101.0 2022-10-02 18:17:29.442 0.04 Paid
23374 86011 58316 66776 832285 2022-10-02 18:19:08.181 1000.0 0 1000.0 2022-10-02 2022-10-03 19:37:34.605 ... 1 -100.74 100.74 0 100.74 838376 130.0 2022-10-04 11:33:34.686 29.26 Paid
2856 549976 58316 66776 838411 2022-10-04 11:35:42.055 1000.0 0 1000.0 2022-10-05 2022-10-07 18:53:27.332 ... 1 -139.25 139.25 0 139.25 860599 200.0 2022-10-09 15:41:16.183 60.89 Paid
17867 38315 58316 66776 866013 2022-10-09 22:50:35.375 1000.0 0 1000.0 2022-10-10 2022-10-11 17:59:21.048 ... 1 41.78 58.22 0 58.22 866013 0.0 2022-10-11 17:59:21.048 41.78 Paid
14394 8532 58316 66776 873912 2022-10-11 23:30:19.495 1000.0 0 1000.0 2022-10-12 2022-10-13 17:11:45.320 ... 1 -87.47 87.47 0 87.47 891721 100.0 2022-10-16 12:33:27.146 12.53 Paid
61535 422479 58316 66776 894023 2022-10-16 22:35:14.941 1000.0 0 1000.0 2022-10-17 2022-10-18 17:18:31.248 ... 1 -55.42 55.42 0 55.42 901669 100.0 2022-10-18 18:18:48.244 44.58 Paid
45487 282051 58316 66776 902395 2022-10-18 23:01:58.502 1000.0 0 1000.0 2022-10-19 2022-10-20 17:14:49.627 ... 1 -97.98 97.98 0 97.98 915284 200.0 2022-10-21 15:05:55.259 102.02 Paid
20509 60688 58316 66776 916292 2022-10-21 19:43:51.875 1000.0 0 1000.0 2022-10-22 2022-10-23 20:09:05.284 ... 1 866.01 133.99 0 133.99 917815 0.0 2022-10-23 20:09:05.284 866.01 Paid
22819 80971 58316 66776 917815 2022-10-22 10:41:49.732 1000.0 0 1000.0 2022-10-22 2022-10-23 20:09:05.284 ... 1 866.01 0.00 0 0.00 917815 0.0 2022-10-23 20:09:05.284 866.01 Paid
63143 435219 58316 66776 953683 2022-10-29 23:46:11.320 1000.0 0 1000.0 2022-10-30 2022-10-31 18:43:48.993 ... 1 352.39 0.00 0 0.00 953683 0.0 2022-10-31 18:43:48.993 352.39 Paid
32303 163978 58316 66776 964219 2022-10-31 20:58:53.714 1000.0 0 1000.0 2022-11-01 2022-11-02 17:18:39.520 ... 1 25.00 375.00 0 375.00 964219 0.0 2022-11-02 17:18:39.520 25.00 Paid

19 rows × 21 columns

import pandas as pd

try:
    _ = loans_df
except NameError:
    loans_df = pd.read_excel("../../data/Loans_Data.xlsx")

# Are there any LOAN_IDs that have multiple observations, i.e. first collection attempt, second, etc. ?
(
    loans_df.query("FIRST_TRIAL_BALANCE < 0")
    .groupby(["MAIN_SYSTEM_ID", "LOAN_ID"])
    [["REPAYMENT_UPDATED"]]
    .nunique()
    .query("REPAYMENT_UPDATED > 1")
)
REPAYMENT_UPDATED
MAIN_SYSTEM_ID LOAN_ID

?? If FIRST_TRIAL_BALANCE < 0, the retailer missed first collection attempt. Can we assume that retailers are given only a second collection attempt before being marked as PAYMENT_STATUS in ('Unpaid', 'Partially paid')?

I’m starting to believe I was given a snapshot of the Loans dataset instead of time series as for the other 2 datasets. We can check that by looking at the following case: retailer misses the first payment but is able to fully pay on following collection attempt. We can represent that as FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS = 'Paid', then check if any LOAN_ID has more than one observation.

import pandas as pd

try:
    _ = loans_df
except NameError:
    loans_df = pd.read_excel("../../data/Loans_Data.xlsx")

(
    loans_df.query("FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS == 'Paid'")
    .sort_values(by=["MAIN_SYSTEM_ID", "LOAN_ID"])
    [["MAIN_SYSTEM_ID", "LOAN_ID", "REPAYMENT_ID", "REPAYMENT_AMOUNT", "TOTAL_FINAL_AMOUNT"]]
)
MAIN_SYSTEM_ID LOAN_ID REPAYMENT_ID REPAYMENT_AMOUNT TOTAL_FINAL_AMOUNT
21130 388 697040 710745 1000.0 463.35
30622 484 723923 748868 2500.0 2492.69
9603 1559 817070 819766 5020.0 325.34
68669 1653 779240 787659 1000.0 855.50
19165 2023 722893 745754 1500.0 1487.88
... ... ... ... ... ...
717 155535 850258 863752 700.0 496.52
12557 155535 921816 933511 650.0 900.54
42368 155535 946771 961470 600.0 575.94
3043 155803 866687 879437 3450.0 3449.71
12433 155952 901395 915464 550.0 549.18

686 rows × 5 columns