2. Initial Data Discovery#

My understanding of the datasets is that:

  • Ecommerce Data relates to inventory purchase retailers make on MaxAB

  • Fintech Data relates to customer-facing sales retailers make using MaxAB payment processing

  • Loans Data relates to micro-loans retailers request to purchase iventory through MaxAB

2.1. Data points#

Although Retailer Loans dataset contains definition of columns, the other 2 datasets do not. So first order of business is to inspect them and infer what their columns mean.

Loans data#

Hide code cell source
import pandas as pd


pd.read_excel(
    "../../data/Loans_Data.xlsx"
).iloc[0].sort_index()
CUMMULATIVE_OUTSTANDING                        975.39
FINAL_COST                                          0
FIRST_TRAIL_DELAYS                                  0
FIRST_TRIAL_BALANCE                            975.39
INDEX                                          524291
INITIAL_COST                                        0
INITIAL_DATE                      2022-09-13 00:00:00
LOAN_AMOUNT                                    1000.0
LOAN_ID                                        754724
LOAN_ISSUANCE_DATE         2022-09-13 07:58:24.292000
LOAN_PAYMENT_DATE          2022-09-13 16:39:44.127000
MAIN_SYSTEM_ID                                  35668
PAYMENT_AMOUNT                                 1020.0
PAYMENT_STATUS                                   Paid
REPAYMENT_AMOUNT                                  0.0
REPAYMENT_ID                                   754724
REPAYMENT_UPDATED          2022-09-13 16:39:44.127000
RETAILER_ID                                     34019
SPENT                                           44.61
TOTAL_FINAL_AMOUNT                              44.61
TOTAL_INITIAL_AMOUNT                           1000.0
Name: 0, dtype: object
Column Description

Column

Description

INDEX

Negligible number

MAIN_SYSTEM_ID

the retailer’s common id between e-commerce and fintech

RETAILER_ID

the retailer’s id on the fintech system only

LOAN_ID

id of the Loan

LOAN_ISSUANCE_DATE

the datetime when the loan was requested

LOAN_AMOUNT

the pricipal amount the retailer requested in credit (value only shows intention, but amount is considered lent once the retailer uses money for transaction)

INITIAL_COST

the interest rate (from retailer’s credit profile) * loan amount (value is also non-binding; is recalculated on consumed credit amounts)

TOTAL_INITIAL_AMOUNT

initial principal + initial cost

INITIAL_DATE

Due date of the loan

LOAN_PAYMENT_DATE

The finalization date of the loan’s first collection order (could be different than the initial date for one of two reasons: -retailer was visited and requested a delay -the operations team was not able to fulfill the order on the correct date)

PAYMENT_AMOUNT

The paid amount during the loans’s first collection order

FIRST_TRAIL_DELAYS

The number of delays received on the first collection order

SPENT

The consumed amount of the loan = total value of transactions fulfilled from the initial credit amount

FIRST_TRIAL_BALANCE

The retailer’s balance after the payment amount was logged in the wallet; -If the first trial balance is negative, it means the payment amount did not cover the spent credit amount and the negative amount is now overdue; -If the first trial balance >= 0 then it means the payment amount covered the spent credit amount and no amounts are due for the said loan following the payment date

FINAL_COST

the interest rate (from retailer’s credit profile) * spent amount

TOTAL_FINAL_AMOUNT

spent amount + final cost (the minimum amount required to collect from the retailer to fulfill due amounts)

REPAYMENT_ID

the id of the last cash-in used to repay the loan -if first trial balance >= 0, then it will equal the loan_id -if first trial balance < 0 and there were no attempts to recollect yet, then it will equal the loan_id (incorrect design, loan_id does not have a functional purpose in this column)

REPAYMENT_AMOUNT

amount paid during last collection order used to repay the loan -if first trial balance >=0, then = 0 (because loan was already paid)

REPAYMENT_UPDATED

the date of the last collection trial

CUMMULATIVE_OUTSTANDING

It’s the cummulation of the retailer’s balance from the due date until the date it is paid -If it is positive, it means no amounts are unpaid on the said loan level -If it is negative, it means the retailer’s balance is still negative, and there are overdue amounts on the said loan level

PAYMENT_STATUS

the status of the loan and it can take 3 values paid - unpaid - partialy paid

Fintech data#

Hide code cell source
import pandas as pd


pd.read_csv(
    "../../data/Retailer_Transactions_Data.csv",
    header=0
).iloc[0].sort_index()
AMOUNT                                                   7.0
CREATED_AT                           2022-07-20T17:17:27.569
FEES                                                    0.23
ID                                                  21321668
MAIN_SYSTEM_ID                                         78550
RETAILER_CUT                                            0.13
STATUS                                            SUCCESSFUL
TOTAL_AMOUNT_INCLUDING_TAX                              10.0
TOTAL_AMOUNT_PAID                                       10.0
UPDATED_AT                            2022-07-20T17:17:27.62
WALLET_BALANCE_BEFORE_TRANSACTION                    1017.11
Name: 0, dtype: object

From the column definitions in Loans data, we know MAIN_SYSTEM_ID represents the retailer’s identity between Fintech and Ecommerce.

Furthermore, we can already predict that there is strong correlation between

Column Description

Column

Description

AMOUNT

Eq. invoice_value before tax and fees

CREATED_AT

Timestamp when payment processing started

FEES

MaxAB fee for providing payment processing service, etc

ID

Internal identification of payment processing

MAIN_SYSTEM_ID

Retailer internal identification across ecommerce and fintech, table foreign_key

RETAILER_CUT

Estimated profit margin of retailer on the sale (?)

STATUS

Latest status of payment processing according to UPDATED_AT

TOTAL_AMOUNT_INCLUDING_TAX

AMOUNT + FEES + TAX

TOTAL_AMOUNT_PAID

The amount paid to the retailer by their customer

UPDATED_AT

Timestamp of the current STATUS update

WALLET_BALANCE_BEFORE_TRANSACTION

Retailer balance before the current sale

Ecommerce data#

Hide code cell source
import pandas as pd


pd.read_csv(
    "../../data/Ecommerce_orders_Data.csv",
    header=0
).iloc[0].sort_index()
DISCOUNT                                      0.0
MAIN_SYSTEM_ID                              40591
ORDER_CREATION_DATE           2022-05-11T18:49:20
ORDER_ID                                  2561005
ORDER_PRICE                                415.25
ORDER_PRICE_AFTER_DISCOUNT                 415.25
Name: 0, dtype: object

This dataset is simpler in nature if compared with the other 2, and as expected MAIN_SYSTEM_ID is present here too. Let’s organize column definitions is a table:

Column Description

Column

Description

DISCOUNT

Amount to be deducted from ORDER_PRICE

MAIN_SYSTEM_ID

Retailer internal identification across ecommerce and fintech, table foreign_key

ORDER_CREATION_DATE

Timestamp when inventory purchase order is created

ORDER_ID

Table index, unique identifier for orders

ORDER_PRICE

Invoice value before DISCOUNT is applied

ORDER_PRICE_AFTER_DISCOUNT

Invoice value after DISCOUNT is applied