3. Data Profile Reports#

In the next 3 sections, I present a full profiler report on the 3 datasets. These reports are generated using Python package ydata-profiling and aim to provide an overview on the distribution of data in each column, aside from useful statistics, correlation analysis, among other things.

3.1. Low signal columns#

These are columns that bring little-to-no information, thus can interfere with ML model training. I’ll remove the following columns from any further step:

  • Loans dataset

    • INITIAL_COST: constant value (zero), currently not charging retailers

    • FINAL_COST: constant value (zero), not currently charging retailers

    • INDEX: another identifier field

    • LOAN_ID: another identifier field

    • REPAYMENT_ID: another identifier field

    • RETAILER_ID: another identifier field

  • Ecommerce dataset

    • ORDER_ID: another identifier field

3.2. Highly correlated columns#

These are data points that capture the same underlying phenomenon, thus behave closely to one another. Using multiple columns with high correlation tends to bias the importance of a given phenomenon captured by the data. I won’t straight away remove those columns from analysis, but the correlation group must be consolidated into 1 representative when it comes to feature engineering and model induction.

  • Loans dataset

    • Retailer identity (as expected):

      • RETAILER_ID

      • MAIN_SYSTEM_ID

    • Loan invoice details:

      • LOAN_AMOUNT

      • TOTAL_INITIAL_AMOUNT

      • INITIAL_COST: currently not charging retailers, but it will be part of the correlation group once we start charging

    • Retailer consumption of loan:

      • SPENT

      • FINAL_COST: currently not charging retailers, but it will be part of the correlation group once we start charging.

      • PAYMENT_AMOUNT

      • TOTAL_FINAL_AMOUNT

      • CUMMULATIVE_OUTSTANDING:

        • FIRST_TRIAL_BALANCE + REPAYMENT_AMOUNT: 99.9% match with 1 decimal point precision

  • Ecommerce dataset

    • ORDER_PRICE_AFTER_DISCOUNT:

      • ORDER_PRICE - DISCOUNT: 99.9% match with 1 decimal point precision, 99.6% with 2 decimal points precision

Although there is strong correlation between columns of “Loan invoice details” and “Retailer consumption of loan”, I choose to keep them separate groups. The reasoning is based on how the credit product is implemented at the moment. We evaluate retailers request for a given amount, but allow them to spend only a fraction of that amount. This means there are 2 related but distinct behaviors to capture: the rationale when requesting funds, and the actual need to spend them.


The sections above were drawn from the following data profile reports:

Code to generate each report:

import pandas as pd
from ydata_profiling import ProfileReport


df = pd.read_excel("data/Loans_Data.xlsx")
profile = ProfileReport(df, explorative=True, interactions=None)
profile.config.html.navbar_show = False
profile.config.html.full_width = True
profile.to_file("credit-risk/discovery/data-profile-reports/loans-profile.html")
import pandas as pd
from ydata_profiling import ProfileReport


df = pd.read_csv("data/Retailer_Transactions_Data.csv", header=0)
profile = ProfileReport(df, explorative=True, interactions=None)
profile.config.html.navbar_show = False
profile.config.html.full_width = True
profile.to_file("credit-risk/discovery/data-profile-reports/fintech-profile.html")
import pandas as pd
from ydata_profiling import ProfileReport


df = pd.read_csv("data/Ecommerce_orders_Data.csv", header=0)
profile = ProfileReport(df, explorative=True, interactions=None)
profile.config.html.navbar_show = False
profile.config.html.full_width = True
profile.to_file("credit-risk/discovery/data-profile-reports/ecommerce-profile.html")

Code for CUMMULATIVE_OUTSTANDING percentage match
import pandas as pd
import numpy as np


df = pd.read_excel("data/Loans_Data.xlsx")

df.assign(
    match=np.isclose(
      df["FIRST_TRIAL_BALANCE"] + df["REPAYMENT_AMOUNT"], df["CUMMULATIVE_OUTSTANDING"],
      rtol=1e-1,
      atol=1e-1
    )
)["match"].value_counts()
Code for ORDER_PRICE_AFTER_DISCOUNT percentage match
import pandas as pd
import numpy as np


df = pd.read_csv("../../data/Ecommerce_orders_Data.csv", header=0)

df.assign(
    match=np.isclose(
      df["ORDER_PRICE"] - df["DISCOUNT"], df["ORDER_PRICE_AFTER_DISCOUNT"],
      rtol=1e-1,
      atol=1e-1
    )
)["match"].value_counts()