3. Data Profile Reports#
In the next 3 sections, I present a full profiler report on the 3 datasets. These reports are generated using Python package ydata-profiling and aim to provide an overview on the distribution of data in each column, aside from useful statistics, correlation analysis, among other things.
3.1. Low signal columns#
These are columns that bring little-to-no information, thus can interfere with ML model training. I’ll remove the following columns from any further step:
Loans dataset
INITIAL_COST: constant value (zero), currently not charging retailersFINAL_COST: constant value (zero), not currently charging retailersINDEX: another identifier fieldLOAN_ID: another identifier fieldREPAYMENT_ID: another identifier fieldRETAILER_ID: another identifier field
Ecommerce dataset
ORDER_ID: another identifier field