Exploratory Data Analysis

5. Exploratory Data Analysis#

Warning

I decided not to actively pursue my ideas for exploration due to already having a strong set of hypotheses to test and the time constraint. Consider this page just a brain dump, please skip to next section

5.1. Retailers who missed the first collection attempt#

In the Loans dataset, there are a few cases when a retailer is not able to pay back the money they borrowed in the first collection attempt. That does not necessarily mean they will default on their payment, perhaps it’s just a delay. Let’s take a look at the retailer with the most delayed-but-paid loans:

import pandas as pd


try:
    _ = loans_df
except NameError:
    loans_df = pd.read_excel("../../data/Loans_Data.xlsx")

(
    loans_df
    # .query("FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS == 'Paid'")
    .groupby("MAIN_SYSTEM_ID")
    .agg(
        nunique_loan_id=("LOAN_ID", "nunique"),
        nunique_loan_id_delayed_payment=(
            "LOAN_ID",
            lambda df: df[(loans_df["FIRST_TRIAL_BALANCE"] < 0) & (loans_df["PAYMENT_STATUS"] == "Paid")].nunique()
            # the above is very inefficient compute but I'm a bit rusty with Pandas
        ),
    )
    .sort_values("nunique_loan_id_delayed_payment", ascending=False)
    .head(5)
)

	nunique_loan_id	nunique_loan_id_delayed_payment
MAIN_SYSTEM_ID
42714	14	10
58316	19	10
102778	15	9
104839	11	8
24063	20	7

Let’s investigate retailer MAIN_SYSTEM_ID == 58316, first looking at all 10 loans

loans_df.query("MAIN_SYSTEM_ID == 58316").sort_values("LOAN_ISSUANCE_DATE")

	INDEX	MAIN_SYSTEM_ID	RETAILER_ID	LOAN_ID	LOAN_ISSUANCE_DATE	LOAN_AMOUNT	TOTAL_INITIAL_AMOUNT	INITIAL_DATE	LOAN_PAYMENT_DATE	...	FIRST_TRAIL_DELAYS	FIRST_TRIAL_BALANCE	SPENT	TOTAL_FINAL_AMOUNT	REPAYMENT_ID	REPAYMENT_AMOUNT	REPAYMENT_UPDATED	CUMMULATIVE_OUTSTANDING	PAYMENT_STATUS
47308	298520	58316	66776	697510	2022-09-02 00:38:59.309	1000.0	1000.0	2022-09-02	2022-09-03 19:27:33.733	...	1	211.36	88.64	88.64	697510	0.0	2022-09-03 19:27:33.733	211.36	Paid
62663	432451	58316	66776	723339	2022-09-08 12:32:13.904	1000.0	1000.0	2022-09-08	2022-09-09 19:42:24.713	...	1	191.46	108.54	108.54	723339	0.0	2022-09-09 19:42:24.713	191.46	Paid
7866	594403	58316	66776	750715	2022-09-11 23:38:50.204	1000.0	1000.0	2022-09-12	2022-09-12 19:01:41.513	...	0	230.56	69.44	69.44	750715	0.0	2022-09-12 19:01:41.513	230.56	Paid
41078	242794	58316	66776	763590	2022-09-15 14:52:35.497	1000.0	1000.0	2022-09-15	2022-09-16 20:07:58.496	...	1	-84.87	84.87	84.87	771485	250.0	2022-09-17 18:28:30.091	165.13	Paid
49219	313904	58316	66776	776182	2022-09-18 20:55:42.708	1000.0	1000.0	2022-09-19	2022-09-20 18:41:03.105	...	1	-730.07	730.07	730.07	782955	800.0	2022-09-20 20:25:46.865	69.93	Paid
51662	334727	58316	66776	783557	2022-09-21 02:37:30.731	1000.0	1000.0	2022-09-21	2022-09-22 14:58:35.910	...	1	46.95	103.05	103.05	783557	0.0	2022-09-22 14:58:35.910	46.95	Paid
59844	405915	58316	66776	791325	2022-09-23 14:20:37.542	1000.0	1000.0	2022-09-23	2022-09-24 16:21:40.566	...	1	-108.56	108.56	108.56	807424	150.0	2022-09-28 17:50:15.576	41.44	Paid
68914	487963	58316	66776	814724	2022-09-29 11:34:20.555	1000.0	1000.0	2022-09-29	2022-09-29 19:17:11.349	...	1	-91.36	91.36	91.36	816729	100.0	2022-09-29 20:05:02.673	8.64	Paid
12992	639440	58316	66776	816704	2022-09-29 19:41:44.898	1000.0	1000.0	2022-09-29	2022-10-01 17:59:32.282	...	1	-100.96	100.96	100.96	831639	101.0	2022-10-02 18:17:29.442	0.04	Paid
23374	86011	58316	66776	832285	2022-10-02 18:19:08.181	1000.0	1000.0	2022-10-02	2022-10-03 19:37:34.605	...	1	-100.74	100.74	100.74	838376	130.0	2022-10-04 11:33:34.686	29.26	Paid
2856	549976	58316	66776	838411	2022-10-04 11:35:42.055	1000.0	1000.0	2022-10-05	2022-10-07 18:53:27.332	...	1	-139.25	139.25	139.25	860599	200.0	2022-10-09 15:41:16.183	60.89	Paid
17867	38315	58316	66776	866013	2022-10-09 22:50:35.375	1000.0	1000.0	2022-10-10	2022-10-11 17:59:21.048	...	1	41.78	58.22	58.22	866013	0.0	2022-10-11 17:59:21.048	41.78	Paid
14394	8532	58316	66776	873912	2022-10-11 23:30:19.495	1000.0	1000.0	2022-10-12	2022-10-13 17:11:45.320	...	1	-87.47	87.47	87.47	891721	100.0	2022-10-16 12:33:27.146	12.53	Paid
61535	422479	58316	66776	894023	2022-10-16 22:35:14.941	1000.0	1000.0	2022-10-17	2022-10-18 17:18:31.248	...	1	-55.42	55.42	55.42	901669	100.0	2022-10-18 18:18:48.244	44.58	Paid
45487	282051	58316	66776	902395	2022-10-18 23:01:58.502	1000.0	1000.0	2022-10-19	2022-10-20 17:14:49.627	...	1	-97.98	97.98	97.98	915284	200.0	2022-10-21 15:05:55.259	102.02	Paid
20509	60688	58316	66776	916292	2022-10-21 19:43:51.875	1000.0	1000.0	2022-10-22	2022-10-23 20:09:05.284	...	1	866.01	133.99	133.99	917815	0.0	2022-10-23 20:09:05.284	866.01	Paid
22819	80971	58316	66776	917815	2022-10-22 10:41:49.732	1000.0	1000.0	2022-10-22	2022-10-23 20:09:05.284	...	1	866.01	0.00	0.00	917815	0.0	2022-10-23 20:09:05.284	866.01	Paid
63143	435219	58316	66776	953683	2022-10-29 23:46:11.320	1000.0	1000.0	2022-10-30	2022-10-31 18:43:48.993	...	1	352.39	0.00	0.00	953683	0.0	2022-10-31 18:43:48.993	352.39	Paid
32303	163978	58316	66776	964219	2022-10-31 20:58:53.714	1000.0	1000.0	2022-11-01	2022-11-02 17:18:39.520	...	1	25.00	375.00	375.00	964219	0.0	2022-11-02 17:18:39.520	25.00	Paid

19 rows × 21 columns

import pandas as pd

try:
    _ = loans_df
except NameError:
    loans_df = pd.read_excel("../../data/Loans_Data.xlsx")

# Are there any LOAN_IDs that have multiple observations, i.e. first collection attempt, second, etc. ?
(
    loans_df.query("FIRST_TRIAL_BALANCE < 0")
    .groupby(["MAIN_SYSTEM_ID", "LOAN_ID"])
    [["REPAYMENT_UPDATED"]]
    .nunique()
    .query("REPAYMENT_UPDATED > 1")
)

		REPAYMENT_UPDATED
MAIN_SYSTEM_ID	LOAN_ID

?? If FIRST_TRIAL_BALANCE < 0, the retailer missed first collection attempt. Can we assume that retailers are given only a second collection attempt before being marked as PAYMENT_STATUS in ('Unpaid', 'Partially paid')?

I’m starting to believe I was given a snapshot of the Loans dataset instead of time series as for the other 2 datasets. We can check that by looking at the following case: retailer misses the first payment but is able to fully pay on following collection attempt. We can represent that as FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS = 'Paid', then check if any LOAN_ID has more than one observation.

import pandas as pd

try:
    _ = loans_df
except NameError:
    loans_df = pd.read_excel("../../data/Loans_Data.xlsx")

(
    loans_df.query("FIRST_TRIAL_BALANCE < 0 and PAYMENT_STATUS == 'Paid'")
    .sort_values(by=["MAIN_SYSTEM_ID", "LOAN_ID"])
    [["MAIN_SYSTEM_ID", "LOAN_ID", "REPAYMENT_ID", "REPAYMENT_AMOUNT", "TOTAL_FINAL_AMOUNT"]]
)

	MAIN_SYSTEM_ID	LOAN_ID	REPAYMENT_ID	REPAYMENT_AMOUNT	TOTAL_FINAL_AMOUNT
21130	388	697040	710745	1000.0	463.35
30622	484	723923	748868	2500.0	2492.69
9603	1559	817070	819766	5020.0	325.34
68669	1653	779240	787659	1000.0	855.50
19165	2023	722893	745754	1500.0	1487.88
...	...	...	...	...	...
717	155535	850258	863752	700.0	496.52
12557	155535	921816	933511	650.0	900.54
42368	155535	946771	961470	600.0	575.94
3043	155803	866687	879437	3450.0	3449.71
12433	155952	901395	915464	550.0	549.18

686 rows × 5 columns

Exploratory Data Analysis

Contents

5. Exploratory Data Analysis#

5.1. Retailers who missed the first collection attempt#