Data Profile Reports

3. Data Profile Reports#

In the next 3 sections, I present a full profiler report on the 3 datasets. These reports are generated using Python package ydata-profiling and aim to provide an overview on the distribution of data in each column, aside from useful statistics, correlation analysis, among other things.

3.1. Low signal columns#

These are columns that bring little-to-no information, thus can interfere with ML model training. I’ll remove the following columns from any further step:

Loans dataset
- INITIAL_COST: constant value (zero), currently not charging retailers
- FINAL_COST: constant value (zero), not currently charging retailers
- INDEX: another identifier field
- LOAN_ID: another identifier field
- REPAYMENT_ID: another identifier field
- RETAILER_ID: another identifier field
Ecommerce dataset
- ORDER_ID: another identifier field

3.2. Highly correlated columns#

These are data points that capture the same underlying phenomenon, thus behave closely to one another. Using multiple columns with high correlation tends to bias the importance of a given phenomenon captured by the data. I won’t straight away remove those columns from analysis, but the correlation group must be consolidated into 1 representative when it comes to feature engineering and model induction.

Loans dataset
- Retailer identity (as expected):
  - RETAILER_ID
  - MAIN_SYSTEM_ID
- Loan invoice details:
  - LOAN_AMOUNT
  - TOTAL_INITIAL_AMOUNT
  - INITIAL_COST: currently not charging retailers, but it will be part of the correlation group once we start charging
- Retailer consumption of loan:
  - SPENT
  - FINAL_COST: currently not charging retailers, but it will be part of the correlation group once we start charging.
  - PAYMENT_AMOUNT
  - TOTAL_FINAL_AMOUNT
  - CUMMULATIVE_OUTSTANDING:
    - FIRST_TRIAL_BALANCE + REPAYMENT_AMOUNT: 99.9% match with 1 decimal point precision
Ecommerce dataset
- ORDER_PRICE_AFTER_DISCOUNT:
  - ORDER_PRICE - DISCOUNT: 99.9% match with 1 decimal point precision, 99.6% with 2 decimal points precision

Although there is strong correlation between columns of “Loan invoice details” and “Retailer consumption of loan”, I choose to keep them separate groups. The reasoning is based on how the credit product is implemented at the moment. We evaluate retailers request for a given amount, but allow them to spend only a fraction of that amount. This means there are 2 related but distinct behaviors to capture: the rationale when requesting funds, and the actual need to spend them.

The sections above were drawn from the following data profile reports:

Code to generate each report:

Retailer Loans

import pandas as pd
from ydata_profiling import ProfileReport


df = pd.read_excel("data/Loans_Data.xlsx")
profile = ProfileReport(df, explorative=True, interactions=None)
profile.config.html.navbar_show = False
profile.config.html.full_width = True
profile.to_file("credit-risk/discovery/data-profile-reports/loans-profile.html")

Fintech Transactions

import pandas as pd
from ydata_profiling import ProfileReport


df = pd.read_csv("data/Retailer_Transactions_Data.csv", header=0)
profile = ProfileReport(df, explorative=True, interactions=None)
profile.config.html.navbar_show = False
profile.config.html.full_width = True
profile.to_file("credit-risk/discovery/data-profile-reports/fintech-profile.html")

Ecommerce Transactions

import pandas as pd
from ydata_profiling import ProfileReport


df = pd.read_csv("data/Ecommerce_orders_Data.csv", header=0)
profile = ProfileReport(df, explorative=True, interactions=None)
profile.config.html.navbar_show = False
profile.config.html.full_width = True
profile.to_file("credit-risk/discovery/data-profile-reports/ecommerce-profile.html")

Data Profile Reports

Contents

3. Data Profile Reports#

3.1. Low signal columns#

3.2. Highly correlated columns#