2. Initial Data Discovery#
My understanding of the datasets is that:
Ecommerce Data relates to inventory purchase retailers make on MaxAB
Fintech Data relates to customer-facing sales retailers make using MaxAB payment processing
Loans Data relates to micro-loans retailers request to purchase iventory through MaxAB
2.1. Data points#
Although Retailer Loans dataset contains definition of columns, the other 2 datasets do not. So first order of business is to inspect them and infer what their columns mean.
Loans data#
Show code cell source
import pandas as pd
pd.read_excel(
"../../data/Loans_Data.xlsx"
).iloc[0].sort_index()
CUMMULATIVE_OUTSTANDING 975.39
FINAL_COST 0
FIRST_TRAIL_DELAYS 0
FIRST_TRIAL_BALANCE 975.39
INDEX 524291
INITIAL_COST 0
INITIAL_DATE 2022-09-13 00:00:00
LOAN_AMOUNT 1000.0
LOAN_ID 754724
LOAN_ISSUANCE_DATE 2022-09-13 07:58:24.292000
LOAN_PAYMENT_DATE 2022-09-13 16:39:44.127000
MAIN_SYSTEM_ID 35668
PAYMENT_AMOUNT 1020.0
PAYMENT_STATUS Paid
REPAYMENT_AMOUNT 0.0
REPAYMENT_ID 754724
REPAYMENT_UPDATED 2022-09-13 16:39:44.127000
RETAILER_ID 34019
SPENT 44.61
TOTAL_FINAL_AMOUNT 44.61
TOTAL_INITIAL_AMOUNT 1000.0
Name: 0, dtype: object
Column Description
Column |
Description |
|---|---|
INDEX |
Negligible number |
MAIN_SYSTEM_ID |
the retailer’s common id between e-commerce and fintech |
RETAILER_ID |
the retailer’s id on the fintech system only |
LOAN_ID |
id of the Loan |
LOAN_ISSUANCE_DATE |
the datetime when the loan was requested |
LOAN_AMOUNT |
the pricipal amount the retailer requested in credit (value only shows intention, but amount is considered lent once the retailer uses money for transaction) |
INITIAL_COST |
the interest rate (from retailer’s credit profile) * loan amount (value is also non-binding; is recalculated on consumed credit amounts) |
TOTAL_INITIAL_AMOUNT |
initial principal + initial cost |
INITIAL_DATE |
Due date of the loan |
LOAN_PAYMENT_DATE |
The finalization date of the loan’s first collection order (could be different than the initial date for one of two reasons: -retailer was visited and requested a delay -the operations team was not able to fulfill the order on the correct date) |
PAYMENT_AMOUNT |
The paid amount during the loans’s first collection order |
FIRST_TRAIL_DELAYS |
The number of delays received on the first collection order |
SPENT |
The consumed amount of the loan = total value of transactions fulfilled from the initial credit amount |
FIRST_TRIAL_BALANCE |
The retailer’s balance after the payment amount was logged in the wallet; -If the first trial balance is negative, it means the payment amount did not cover the spent credit amount and the negative amount is now overdue; -If the first trial balance >= 0 then it means the payment amount covered the spent credit amount and no amounts are due for the said loan following the payment date |
FINAL_COST |
the interest rate (from retailer’s credit profile) * spent amount |
TOTAL_FINAL_AMOUNT |
spent amount + final cost (the minimum amount required to collect from the retailer to fulfill due amounts) |
REPAYMENT_ID |
the id of the last cash-in used to repay the loan -if first trial balance >= 0, then it will equal the loan_id -if first trial balance < 0 and there were no attempts to recollect yet, then it will equal the loan_id (incorrect design, loan_id does not have a functional purpose in this column) |
REPAYMENT_AMOUNT |
amount paid during last collection order used to repay the loan -if first trial balance >=0, then = 0 (because loan was already paid) |
REPAYMENT_UPDATED |
the date of the last collection trial |
CUMMULATIVE_OUTSTANDING |
It’s the cummulation of the retailer’s balance from the due date until the date it is paid -If it is positive, it means no amounts are unpaid on the said loan level -If it is negative, it means the retailer’s balance is still negative, and there are overdue amounts on the said loan level |
PAYMENT_STATUS |
the status of the loan and it can take 3 values paid - unpaid - partialy paid |
Fintech data#
Show code cell source
import pandas as pd
pd.read_csv(
"../../data/Retailer_Transactions_Data.csv",
header=0
).iloc[0].sort_index()
AMOUNT 7.0
CREATED_AT 2022-07-20T17:17:27.569
FEES 0.23
ID 21321668
MAIN_SYSTEM_ID 78550
RETAILER_CUT 0.13
STATUS SUCCESSFUL
TOTAL_AMOUNT_INCLUDING_TAX 10.0
TOTAL_AMOUNT_PAID 10.0
UPDATED_AT 2022-07-20T17:17:27.62
WALLET_BALANCE_BEFORE_TRANSACTION 1017.11
Name: 0, dtype: object
From the column definitions in Loans data, we know MAIN_SYSTEM_ID represents the retailer’s identity between Fintech and Ecommerce.
Furthermore, we can already predict that there is strong correlation between
Column Description
Column |
Description |
|---|---|
AMOUNT |
Eq. |
CREATED_AT |
Timestamp when payment processing started |
FEES |
MaxAB fee for providing payment processing service, etc |
ID |
Internal identification of payment processing |
MAIN_SYSTEM_ID |
Retailer internal identification across ecommerce and fintech, table |
RETAILER_CUT |
Estimated profit margin of retailer on the sale (?) |
STATUS |
Latest status of payment processing according to |
TOTAL_AMOUNT_INCLUDING_TAX |
|
TOTAL_AMOUNT_PAID |
The amount paid to the retailer by their customer |
UPDATED_AT |
Timestamp of the current |
WALLET_BALANCE_BEFORE_TRANSACTION |
Retailer balance before the current sale |
Ecommerce data#
Show code cell source
import pandas as pd
pd.read_csv(
"../../data/Ecommerce_orders_Data.csv",
header=0
).iloc[0].sort_index()
DISCOUNT 0.0
MAIN_SYSTEM_ID 40591
ORDER_CREATION_DATE 2022-05-11T18:49:20
ORDER_ID 2561005
ORDER_PRICE 415.25
ORDER_PRICE_AFTER_DISCOUNT 415.25
Name: 0, dtype: object
This dataset is simpler in nature if compared with the other 2, and as expected MAIN_SYSTEM_ID is present here too. Let’s organize column definitions is a table:
Column Description
Column |
Description |
|---|---|
DISCOUNT |
Amount to be deducted from |
MAIN_SYSTEM_ID |
Retailer internal identification across ecommerce and fintech, table |
ORDER_CREATION_DATE |
Timestamp when inventory purchase order is created |
ORDER_ID |
Table index, unique identifier for orders |
ORDER_PRICE |
Invoice value before |
ORDER_PRICE_AFTER_DISCOUNT |
Invoice value after |