Home Credit Default Risk Modeling

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-12-27

Updated

2023-12-29

Home Credit Default Risk project logo. Originally generated with Leonardo.Ai.

Annotation

In this project, an extensive analysis of credit data from Home Credit Group was undertaken. Two distinct models were developed to predict if a loan applicant repays the loan or faces financial difficulties: one model operates independently of credit history data and another one incorporates such information. After rigorous testing, the models were successfully deployed on the Google Cloud Platform and now are available via API. Surprisingly, the results indicate only marginal improvement in model performance when historical credit data is included.

The homepage of deployed models is currently available at:
https://home-credit-default-prediction-sarhiiybua-ew.a.run.app/

The examples on how to use the API are available in this README file.

Abbreaviations

API: application programming interface;
AUC: area under the ROC curve;
BAcc_01: balanced accuracy score normalized to interval [0, 1];
BAcc: balanced accuracy score;
EDA: exploratory data analysis;
F1_neg: F1 score for the negative class;
F1: F1 score (usually for the positive class);
K: thousand;
M: million;
ML: machine learning;
NPV: negative predictive value;
PPV: positive predictive value (precision);
ROC: receiver operating characteristic;
SHAP: SHapley Additive exPlanations;
TNR: true negative rate;
TPR: true positive rate (recall).

1 Plan

In this project, a dataset from Home Credit Group will be analyzed. The main purpose of this analysis is to investigate if significantly increases the performance of models that predict if a client, who is going to take credit, will face financial difficulties or not. For this purpose, the following plan will be implemented:

EDA on data will be performed.
A predictive model based on data from the application table (no credit history data included) will be created (the first model);
- assumption: this data is always available.
The first model will be deployed.
Data from the remaining tables will be pre-processed (extracted, aggregated) and merged with the applications dataset.
The model based on all currently available data (including credit and loan history) will be created (the second model).
- assumption: the data in other tables than application might be updated rarely, to acquire it might cost or sometimes might even be unavailable for some people who want to take a credit.
The second model will be deployed.

2 Setup

Some preparation steps are described in the README.md file of the project (e.g., here). The Python code that imports the necessary tools:

# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

# Plotting
%matplotlib inline

# Packages and modules -------------------------------
# Utilities
import os
import warnings
import numpy as np
import joblib

# Data frames
import pandas as pd

# EDA and plotting
import seaborn as sns
import matplotlib.pyplot as plt

import sweetviz
import klib

# Data wrangling, maths, feature engineering
import numpy as np

# Patch sklearn with Intel's version
from sklearnex import patch_sklearn

patch_sklearn()  # Run this code before importing from sklearn

# Machine learning
import lightgbm as lgb
from sklearn import set_config

from sklearn.base import clone, BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
)
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.model_selection import (
    cross_val_score,
    train_test_split,
    StratifiedKFold,
)

# ML: classification models
from lightgbm import LGBMClassifier

# ML: feature engineering and selection
from feature_engine.selection import (
    DropDuplicateFeatures,
    SmartCorrelatedSelection,
)

# ML: hyperparameter tuning
import optuna

# ML: explainability
import shap

# Display
from IPython.display import display

# Custom functions
import functions.fun_utils as my
import functions.fun_analysis as an
import functions.fun_ml as ml
from functions.utils import (
    ColumnSelector,
    CleanColumnNames,
)

%aimport functions.fun_utils
%aimport functions.fun_analysis
%aimport functions.fun_ml
%aimport functions.utils

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(10, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_colwidth", 50)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# Turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)

# Scikit-learn options
set_config(transform_output="pandas")

# Analysis parameters: use Sweetviz for eda?
do_eda = True

# For caching results ---------------------------------
dir_interim = "data/interim/"
os.mkdir(dir_interim) if not os.path.exists(dir_interim) else None

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

3 Data

In this project, a dataset from Home Credit Group is investigated.

The target variable TARGET has 2 values:

1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample,
0 - all other cases.

In most cases, the meaning of the following values are:

XNA: unknown / not available
XAP: not applicable

More information on the dataset:

Anna Montoya, Kirill Odintsov, and Martin Kotek. Home credit default risk (2018). Kaggle.
- https://kaggle.com/competitions/home-credit-default-risk
- https://www.kaggle.com/c/home-credit-default-risk

The dataset was downloaded as a zip file and extracted into the data folder (data/raw/ and data/info/directories; more details in the next section).

3.1 Explore Data Files

In this section, the data and metadata files are explored.

The files with metadata and descriptions are stored in data/info directory: the files are acquired from the main source as well as from elsewhere.

Details:

Code

!echo Files with data description:
!ls data/info/

Files with data description:
HomeCredit_column_descriptions.xlsx
HomeCredit_columns_description.csv
description--Home Credit Default Risk.pdf

The files with datasets are stored in data/raw directory.

Code

!echo Data files:
!ls data/raw/

Data files:
POS_CASH_balance.csv
application_test.csv
application_train.csv
bureau.csv
bureau_balance.csv
credit_card_balance.csv
installments_payments.csv
previous_application.csv
sample_submission.csv

Code

!echo File sizes:
!cd data/raw/ &&\
   du -m * | sed 's/\([0-9]\+\)/\1 MB /'

File sizes:
375 MB  POS_CASH_balance.csv
26 MB   application_test.csv
159 MB  application_train.csv
163 MB  bureau.csv
359 MB  bureau_balance.csv
405 MB  credit_card_balance.csv
690 MB  installments_payments.csv
387 MB  previous_application.csv
1 MB    sample_submission.csv

Code

# NOTE: header line is also included here
!echo Number of lines per file:
!cd data/raw/ &&\
    wc --lines *

Number of lines per file:
  10001359 POS_CASH_balance.csv
     48745 application_test.csv
    307512 application_train.csv
   1716429 bureau.csv
  27299926 bureau_balance.csv
   3840313 credit_card_balance.csv
  13605402 installments_payments.csv
   1670215 previous_application.csv
     48745 sample_submission.csv
  58538646 total

A few top rows (formatted as table) of each file:

Code

!cd data/raw/ &&\
    head -n 5 application_train.csv

SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.08303696739132256,0.2629485927471776,0.13937578009978951,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003540999999999999,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.3112673113812225,0.6222457752555098,,0.0959,0.0529,0.9851,0.7959999999999999,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.5559120833904428,0.7295666907060153,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.6504416904014653,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,

Code

!cd data/raw/ &&\
    head -n 5 application_train.csv | csvlook

| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL |  AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE           | NAME_FAMILY_STATUS   | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE      | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR |
| ---------- | ------ | ------------------ | ----------- | ------------ | --------------- | ------------ | ---------------- | ----------- | ----------- | --------------- | --------------- | ---------------- | ----------------------------- | -------------------- | ----------------- | -------------------------- | ---------- | ------------- | ----------------- | --------------- | ----------- | ---------- | -------------- | --------------- | ---------------- | ---------- | ---------- | --------------- | --------------- | -------------------- | --------------------------- | -------------------------- | ----------------------- | -------------------------- | -------------------------- | --------------------------- | ---------------------- | ---------------------- | ----------------------- | ---------------------- | ------------ | ------------ | ------------ | -------------- | ---------------- | --------------------------- | --------------- | -------------- | ------------- | ------------- | ------------- | ------------- | ------------ | -------------------- | -------------- | ----------------------- | ----------------- | --------------- | ----------------- | ---------------------------- | ---------------- | --------------- | -------------- | -------------- | -------------- | -------------- | ------------- | --------------------- | --------------- | ------------------------ | ------------------ | --------------- | ----------------- | ---------------------------- | ---------------- | --------------- | -------------- | -------------- | -------------- | -------------- | ------------- | --------------------- | --------------- | ------------------------ | ------------------ | ------------------ | -------------- | -------------- | ------------------ | ------------------- | ------------------------ | ------------------------ | ------------------------ | ------------------------ | ---------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | -------------------------- | ------------------------- | -------------------------- | ------------------------- | ------------------------- | -------------------------- |
|    100,002 |   True | Cash loans         | M           |        False |            True |        False |          202,500 |   406,597.5 |    24,700.5 |         351,000 | Unaccompanied   | Working          | Secondary / secondary special | Single / not married | House / apartment |                     0.019… |     -9,461 |          -637 |            -3,648 |          -2,120 |             |       True |           True |           False |             True |       True |      False | Laborers        |               1 |                    2 |                           2 |                 0001-01-03 |                      10 |                      False |                      False |                       False |                  False |                  False |                   False | Business Entity Type 3 |       0.083… |       0.263… |       0.139… |         0.025… |           0.037… |                      0.972… |          0.619… |         0.014… |          0.00 |        0.069… |        0.083… |        0.125… |       0.037… |               0.020… |         0.019… |                  0.000… |            0.000… |          0.025… |            0.038… |                       0.972… |           0.634… |          0.014… |         0.000… |         0.069… |         0.083… |         0.125… |        0.038… |                 0.022 |          0.020… |                        0 |                  0 |          0.025… |            0.037… |                       0.972… |           0.624… |          0.014… |           0.00 |         0.069… |         0.083… |         0.125… |        0.038… |                0.020… |          0.019… |                   0.000… |               0.00 | reg oper account   | block of flats |         0.015… | Stone, brick       |               False |                        2 |                        2 |                        2 |                        2 |                 -1,134 |           False |            True |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                          0 |                         0 |                          0 |                         0 |                         0 |                          1 |
|    100,003 |  False | Cash loans         | F           |        False |           False |        False |          270,000 | 1,293,502.5 |    35,698.5 |       1,129,500 | Family          | State servant    | Higher education              | Married              | House / apartment |                     0.004… |    -16,765 |        -1,188 |            -1,186 |            -291 |             |       True |           True |           False |             True |       True |      False | Core staff      |               2 |                    1 |                           1 |                 0001-01-08 |                      11 |                      False |                      False |                       False |                  False |                  False |                   False | School                 |       0.311… |       0.622… |              |         0.096… |           0.053… |                      0.985… |          0.796… |         0.060… |          0.08 |        0.034… |        0.292… |        0.333… |       0.013… |               0.077… |         0.055… |                  0.004… |            0.010… |          0.092… |            0.054… |                       0.985… |           0.804… |          0.050… |         0.081… |         0.034… |         0.292… |         0.333… |        0.013… |                 0.079 |          0.055… |                        0 |                  0 |          0.097… |            0.053… |                       0.985… |           0.799… |          0.061… |           0.08 |         0.034… |         0.292… |         0.333… |        0.013… |                0.079… |          0.056… |                   0.004… |               0.01 | reg oper account   | block of flats |         0.071… | Block              |               False |                        1 |                        0 |                        1 |                        0 |                   -828 |           False |            True |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                          0 |                         0 |                          0 |                         0 |                         0 |                          0 |
|    100,004 |  False | Revolving loans    | M           |         True |            True |        False |           67,500 |   135,000.0 |     6,750.0 |         135,000 | Unaccompanied   | Working          | Secondary / secondary special | Single / not married | House / apartment |                     0.010… |    -19,046 |          -225 |            -4,260 |          -2,531 |          26 |       True |           True |            True |             True |       True |      False | Laborers        |               1 |                    2 |                           2 |                 0001-01-08 |                       9 |                      False |                      False |                       False |                  False |                  False |                   False | Government             |              |       0.556… |       0.730… |                |                  |                             |                 |                |               |               |               |               |              |                      |                |                         |                   |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                    |                |                |                    |                     |                        0 |                        0 |                        0 |                        0 |                   -815 |           False |           False |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                          0 |                         0 |                          0 |                         0 |                         0 |                          0 |
|    100,006 |  False | Cash loans         | F           |        False |            True |        False |          135,000 |   312,682.5 |    29,686.5 |         297,000 | Unaccompanied   | Working          | Secondary / secondary special | Civil marriage       | House / apartment |                     0.008… |    -19,005 |        -3,039 |            -9,833 |          -2,437 |             |       True |           True |           False |             True |      False |      False | Laborers        |               2 |                    2 |                           2 |                 0001-01-03 |                      17 |                      False |                      False |                       False |                  False |                  False |                   False | Business Entity Type 3 |              |       0.650… |              |                |                  |                             |                 |                |               |               |               |               |              |                      |                |                         |                   |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                    |                |                |                    |                     |                        2 |                        0 |                        2 |                        0 |                   -617 |           False |            True |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                            |                           |                            |                           |                           |                            |

3.2 Read Data

Next, the data files are read into the Python environment.

Potentially useful files are:

application_train.csv
bureau.csv
bureau_balance.csv
previous_application.csv
POS_CASH_balance.csv
credit_card_balance.csv
installments_payments.csv

Files to discard from the analysis:

application_test.csv: no target variable.
sample_submission.csv: just an example of a submission file, not relevant for the analysis.

Code

if not os.path.exists("data/interim/raw--application.feather"):
    # Read CSV data, convert to smaller data types
    # and save as Feather files
    application = pd.read_csv("data/raw/application_train.csv").pipe(
        klib.convert_datatypes
    )
    bureau = pd.read_csv("data/raw/bureau.csv").pipe(klib.convert_datatypes)
    bureau_balance = pd.read_csv("data/raw/bureau_balance.csv").pipe(
        klib.convert_datatypes
    )
    previous_application = pd.read_csv("data/raw/previous_application.csv").pipe(
        klib.convert_datatypes
    )
    credit_card_balance = pd.read_csv("data/raw/credit_card_balance.csv").pipe(
        klib.convert_datatypes
    )
    installments_payments = pd.read_csv("data/raw/installments_payments.csv").pipe(
        klib.convert_datatypes
    )
    pos_cash_balance = pd.read_csv("data/raw/POS_CASH_balance.csv").pipe(
        klib.convert_datatypes
    )

    application_test = pd.read_csv("data/raw/application_test.csv").pipe(
        klib.convert_datatypes
    )
    sample_submission = pd.read_csv("data/raw/sample_submission.csv").pipe(
        klib.convert_datatypes
    )
    # Time to read CSV data: 2m 1.1s

    # Use Feather format for quicker loading of data
    application.to_feather("data/interim/raw--application.feather")
    bureau.to_feather("data/interim/raw--bureau.feather")
    bureau_balance.to_feather("data/interim/raw--bureau_balance.feather")
    previous_application.to_feather("data/interim/raw--previous_application.feather")
    credit_card_balance.to_feather("data/interim/raw--credit_card_balance.feather")
    installments_payments.to_feather("data/interim/raw--installments_payments.feather")
    pos_cash_balance.to_feather("data/interim/raw--pos_cash_balance.feather")

    application_test.to_feather("data/interim/raw--application_test.feather")
    sample_submission.to_feather("data/interim/raw--sample_submission.feather")

else:
    # Read cached data
    application = pd.read_feather("data/interim/raw--application.feather")
    bureau = pd.read_feather("data/interim/raw--bureau.feather")
    bureau_balance = pd.read_feather("data/interim/raw--bureau_balance.feather")
    previous_application = pd.read_feather(
        "data/interim/raw--previous_application.feather"
    )
    credit_card_balance = pd.read_feather(
        "data/interim/raw--credit_card_balance.feather"
    )
    installments_payments = pd.read_feather(
        "data/interim/raw--installments_payments.feather"
    )
    pos_cash_balance = pd.read_feather("data/interim/raw--pos_cash_balance.feather")

    application_test = pd.read_feather("data/interim/raw--application_test.feather")
    sample_submission = pd.read_feather("data/interim/raw--sample_submission.feather")

3.3 Inspect Data

Next, the data files are inspected. The purpose of this step is to get a general idea of the data and to spot potential issues. More detailed data exploration (EDA) will be performed later on training data only.

Table application_train has the largest number of columns (122), and table burau_balance has the largest number of rows (27.3M). The next sub-sections explore each table in more detail.

3.3.1 Table `application`

Code

application.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: category(16), float32(64), float64(1), int16(2), int32(2), int8(37)
memory usage: 96.5 MB

Code

application.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_WORK_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	OCCUPATION_TYPE	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	REG_CITY_NOT_WORK_CITY	LIVE_CITY_NOT_WORK_CITY	ORGANIZATION_TYPE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	COMMONAREA_MODE	ELEVATORS_MODE	ENTRANCES_MODE	FLOORSMAX_MODE	FLOORSMIN_MODE	LANDAREA_MODE	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	FONDKAPREMONT_MODE	HOUSETYPE_MODE	TOTALAREA_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE	OBS_30_CNT_SOCIAL_CIRCLE	DEF_30_CNT_SOCIAL_CIRCLE	OBS_60_CNT_SOCIAL_CIRCLE	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100002	1	Cash loans	M	N	Y	202500.00	406597.50	24700.50	351000.00	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.02	-9461	-637	-3648.00	-2120	NaN	1	1	0	1	1	Laborers	1.00	2	2	WEDNESDAY	10	0	0	Business Entity Type 3	0.08	0.26	0.14	0.02	0.04	0.97	0.62	0.01	0.00	0.07	0.08	0.12	0.04	0.02	0.02	0.00	0.00	0.03	0.04	0.97	0.63	0.01	0.00	0.07	0.08	0.12	0.04	0.02	0.02	0.00	0.00	0.03	0.04	0.97	0.62	0.01	0.00	0.07	0.08	0.12	0.04	0.02	0.02	0.00	0.00	reg oper account	block of flats	0.01	Stone, brick	No	2.00	2.00	2.00	2.00	-1134.00	1	0	0.00	0.00	0.00	0.00	0.00	1.00
1	100003	0	Cash loans	F	N	N	270000.00	1293502.50	35698.50	1129500.00	Family	State servant	Higher education	Married	House / apartment	0.00	-16765	-1188	-1186.00	-291	NaN	1	1	0	1	1	Core staff	2.00	1	1	MONDAY	11	0	0	School	0.31	0.62	NaN	0.10	0.05	0.99	0.80	0.06	0.08	0.03	0.29	0.33	0.01	0.08	0.05	0.00	0.01	0.09	0.05	0.99	0.80	0.05	0.08	0.03	0.29	0.33	0.01	0.08	0.06	0.00	0.00	0.10	0.05	0.99	0.80	0.06	0.08	0.03	0.29	0.33	0.01	0.08	0.06	0.00	0.01	reg oper account	block of flats	0.07	Block	No	1.00	0.00	1.00	0.00	-828.00	1	0	0.00	0.00	0.00	0.00	0.00	0.00
2	100004	0	Revolving loans	M	Y	Y	67500.00	135000.00	6750.00	135000.00	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.01	-19046	-225	-4260.00	-2531	26.00	1	1	1	1	1	Laborers	1.00	2	2	MONDAY	9	0	0	Government	NaN	0.56	0.73	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	-815.00	0	0	0.00	0.00	0.00	0.00	0.00	0.00
3	100006	0	Cash loans	F	N	Y	135000.00	312682.50	29686.50	297000.00	Unaccompanied	Working	Secondary / secondary special	Civil marriage	House / apartment	0.01	-19005	-3039	-9833.00	-2437	NaN	1	1	0	1	0	Laborers	2.00	2	2	WEDNESDAY	17	0	0	Business Entity Type 3	NaN	0.65	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.00	0.00	2.00	0.00	-617.00	1	0	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	121500.00	513000.00	21865.50	513000.00	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.03	-19932	-3038	-4311.00	-3458	NaN	1	1	0	1	0	Core staff	1.00	2	2	THURSDAY	11	1	1	Religion	NaN	0.32	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	-1106.00	0	1	0.00	0.00	0.00	0.00	0.00	0.00

Code

an.col_info(application, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	1.2 MB	307,511	100.0%	0	0%	1	<0.1%	<0.1%	100002
2	TARGET	int8	307.5 kB	2	<0.1%	0	0%	282,686	91.9%	91.9%	0
3	NAME_CONTRACT_TYPE	category	307.8 kB	2	<0.1%	0	0%	278,232	90.5%	90.5%	Cash loans
4	CODE_GENDER	category	307.8 kB	3	<0.1%	0	0%	202,448	65.8%	65.8%	F
5	FLAG_OWN_CAR	category	307.7 kB	2	<0.1%	0	0%	202,924	66.0%	66.0%	N
6	FLAG_OWN_REALTY	category	307.7 kB	2	<0.1%	0	0%	213,312	69.4%	69.4%	Y
7	CNT_CHILDREN	int8	307.5 kB	15	<0.1%	0	0%	215,371	70.0%	70.0%	0
8	AMT_INCOME_TOTAL	float64	2.5 MB	2,548	0.8%	0	0%	35,750	11.6%	11.6%	135000.0
9	AMT_CREDIT	float32	1.2 MB	5,603	1.8%	0	0%	9,709	3.2%	3.2%	450000.0
10	AMT_ANNUITY	float32	1.2 MB	13,672	4.4%	12	<0.1%	6,385	2.1%	2.1%	9000.0
11	AMT_GOODS_PRICE	float32	1.2 MB	1,002	0.3%	278	0.1%	26,022	8.5%	8.5%	450000.0
12	NAME_TYPE_SUITE	category	308.3 kB	7	<0.1%	1,292	0.4%	248,526	80.8%	81.2%	Unaccompanied
13	NAME_INCOME_TYPE	category	308.4 kB	8	<0.1%	0	0%	158,774	51.6%	51.6%	Working
14	NAME_EDUCATION_TYPE	category	308.1 kB	5	<0.1%	0	0%	218,391	71.0%	71.0%	Secondary / secondary special
15	NAME_FAMILY_STATUS	category	308.1 kB	6	<0.1%	0	0%	196,432	63.9%	63.9%	Married
16	NAME_HOUSING_TYPE	category	308.1 kB	6	<0.1%	0	0%	272,868	88.7%	88.7%	House / apartment
17	REGION_POPULATION_RELATIVE	float32	1.2 MB	81	<0.1%	0	0%	16,408	5.3%	5.3%	0.035792
18	DAYS_BIRTH	int16	615.0 kB	17,460	5.7%	0	0%	43	<0.1%	<0.1%	-13749
19	DAYS_EMPLOYED	int32	1.2 MB	12,574	4.1%	0	0%	55,374	18.0%	18.0%	365243
20	DAYS_REGISTRATION	float32	1.2 MB	15,688	5.1%	0	0%	113	<0.1%	<0.1%	-1.0
21	DAYS_ID_PUBLISH	int16	615.0 kB	6,168	2.0%	0	0%	169	0.1%	0.1%	-4053
22	OWN_CAR_AGE	float32	1.2 MB	62	<0.1%	202,929	66.0%	7,424	2.4%	7.1%	7.0
23	FLAG_MOBIL	int8	307.5 kB	2	<0.1%	0	0%	307,510	>99.9%	>99.9%	1
24	FLAG_EMP_PHONE	int8	307.5 kB	2	<0.1%	0	0%	252,125	82.0%	82.0%	1
25	FLAG_WORK_PHONE	int8	307.5 kB	2	<0.1%	0	0%	246,203	80.1%	80.1%	0
26	FLAG_CONT_MOBILE	int8	307.5 kB	2	<0.1%	0	0%	306,937	99.8%	99.8%	1
27	FLAG_PHONE	int8	307.5 kB	2	<0.1%	0	0%	221,080	71.9%	71.9%	0
28	FLAG_EMAIL	int8	307.5 kB	2	<0.1%	0	0%	290,069	94.3%	94.3%	0
29	OCCUPATION_TYPE	category	309.3 kB	18	<0.1%	96,391	31.3%	55,186	17.9%	26.1%	Laborers
30	CNT_FAM_MEMBERS	float32	1.2 MB	17	<0.1%	2	<0.1%	158,357	51.5%	51.5%	2.0
31	REGION_RATING_CLIENT	int8	307.5 kB	3	<0.1%	0	0%	226,984	73.8%	73.8%	2
32	REGION_RATING_CLIENT_W_CITY	int8	307.5 kB	3	<0.1%	0	0%	229,484	74.6%	74.6%	2
33	WEEKDAY_APPR_PROCESS_START	category	308.3 kB	7	<0.1%	0	0%	53,901	17.5%	17.5%	TUESDAY
34	HOUR_APPR_PROCESS_START	int8	307.5 kB	24	<0.1%	0	0%	37,722	12.3%	12.3%	10
35	REG_REGION_NOT_LIVE_REGION	int8	307.5 kB	2	<0.1%	0	0%	302,854	98.5%	98.5%	0
36	REG_REGION_NOT_WORK_REGION	int8	307.5 kB	2	<0.1%	0	0%	291,899	94.9%	94.9%	0
37	LIVE_REGION_NOT_WORK_REGION	int8	307.5 kB	2	<0.1%	0	0%	295,008	95.9%	95.9%	0
38	REG_CITY_NOT_LIVE_CITY	int8	307.5 kB	2	<0.1%	0	0%	283,472	92.2%	92.2%	0
39	REG_CITY_NOT_WORK_CITY	int8	307.5 kB	2	<0.1%	0	0%	236,644	77.0%	77.0%	0
40	LIVE_CITY_NOT_WORK_CITY	int8	307.5 kB	2	<0.1%	0	0%	252,296	82.0%	82.0%	0
41	ORGANIZATION_TYPE	category	313.6 kB	58	<0.1%	0	0%	67,992	22.1%	22.1%	Business Entity Type 3
42	EXT_SOURCE_1	float32	1.2 MB	114,584	37.3%	173,378	56.4%	5	<0.1%	<0.1%	0.62270665
43	EXT_SOURCE_2	float32	1.2 MB	119,831	39.0%	660	0.2%	721	0.2%	0.2%	0.28589788
44	EXT_SOURCE_3	float32	1.2 MB	814	0.3%	60,965	19.8%	1,460	0.5%	0.6%	0.7463002
45	APARTMENTS_AVG	float32	1.2 MB	2,339	0.8%	156,061	50.7%	6,663	2.2%	4.4%	0.0825
46	BASEMENTAREA_AVG	float32	1.2 MB	3,780	1.2%	179,943	58.5%	14,745	4.8%	11.6%	0.0
47	YEARS_BEGINEXPLUATATION_AVG	float32	1.2 MB	285	0.1%	150,007	48.8%	4,311	1.4%	2.7%	0.9871
48	YEARS_BUILD_AVG	float32	1.2 MB	149	<0.1%	204,488	66.5%	2,999	1.0%	2.9%	0.8232
49	COMMONAREA_AVG	float32	1.2 MB	3,181	1.0%	214,865	69.9%	8,442	2.7%	9.1%	0.0
50	ELEVATORS_AVG	float32	1.2 MB	257	0.1%	163,891	53.3%	85,718	27.9%	59.7%	0.0
51	ENTRANCES_AVG	float32	1.2 MB	285	0.1%	154,828	50.3%	34,007	11.1%	22.3%	0.1379
52	FLOORSMAX_AVG	float32	1.2 MB	403	0.1%	153,020	49.8%	61,875	20.1%	40.1%	0.1667
53	FLOORSMIN_AVG	float32	1.2 MB	305	0.1%	208,642	67.8%	32,875	10.7%	33.3%	0.2083
54	LANDAREA_AVG	float32	1.2 MB	3,527	1.1%	182,590	59.4%	15,600	5.1%	12.5%	0.0
55	LIVINGAPARTMENTS_AVG	float32	1.2 MB	1,868	0.6%	210,199	68.4%	4,272	1.4%	4.4%	0.0504
56	LIVINGAREA_AVG	float32	1.2 MB	5,199	1.7%	154,350	50.2%	284	0.1%	0.2%	0.0
57	NONLIVINGAPARTMENTS_AVG	float32	1.2 MB	386	0.1%	213,514	69.4%	54,549	17.7%	58.0%	0.0
58	NONLIVINGAREA_AVG	float32	1.2 MB	3,290	1.1%	169,682	55.2%	58,735	19.1%	42.6%	0.0
59	APARTMENTS_MODE	float32	1.2 MB	760	0.2%	156,061	50.7%	7,522	2.4%	5.0%	0.084
60	BASEMENTAREA_MODE	float32	1.2 MB	3,841	1.2%	179,943	58.5%	16,598	5.4%	13.0%	0.0
61	YEARS_BEGINEXPLUATATION_MODE	float32	1.2 MB	221	0.1%	150,007	48.8%	4,291	1.4%	2.7%	0.9871
62	YEARS_BUILD_MODE	float32	1.2 MB	154	0.1%	204,488	66.5%	2,960	1.0%	2.9%	0.8301
63	COMMONAREA_MODE	float32	1.2 MB	3,128	1.0%	214,865	69.9%	9,690	3.2%	10.5%	0.0
64	ELEVATORS_MODE	float32	1.2 MB	26	<0.1%	163,891	53.3%	89,498	29.1%	62.3%	0.0
65	ENTRANCES_MODE	float32	1.2 MB	30	<0.1%	154,828	50.3%	36,041	11.7%	23.6%	0.1379
66	FLOORSMAX_MODE	float32	1.2 MB	25	<0.1%	153,020	49.8%	65,550	21.3%	42.4%	0.1667
67	FLOORSMIN_MODE	float32	1.2 MB	25	<0.1%	208,642	67.8%	34,403	11.2%	34.8%	0.2083
68	LANDAREA_MODE	float32	1.2 MB	3,563	1.2%	182,590	59.4%	17,453	5.7%	14.0%	0.0
69	LIVINGAPARTMENTS_MODE	float32	1.2 MB	736	0.2%	210,199	68.4%	4,931	1.6%	5.1%	0.0551
70	LIVINGAREA_MODE	float32	1.2 MB	5,301	1.7%	154,350	50.2%	444	0.1%	0.3%	0.0
71	NONLIVINGAPARTMENTS_MODE	float32	1.2 MB	167	0.1%	213,514	69.4%	59,255	19.3%	63.0%	0.0
72	NONLIVINGAREA_MODE	float32	1.2 MB	3,327	1.1%	169,682	55.2%	67,126	21.8%	48.7%	0.0
73	APARTMENTS_MEDI	float32	1.2 MB	1,148	0.4%	156,061	50.7%	7,109	2.3%	4.7%	0.0833
74	BASEMENTAREA_MEDI	float32	1.2 MB	3,772	1.2%	179,943	58.5%	14,991	4.9%	11.8%	0.0
75	YEARS_BEGINEXPLUATATION_MEDI	float32	1.2 MB	245	0.1%	150,007	48.8%	4,314	1.4%	2.7%	0.9871
76	YEARS_BUILD_MEDI	float32	1.2 MB	151	<0.1%	204,488	66.5%	2,994	1.0%	2.9%	0.8256
77	COMMONAREA_MEDI	float32	1.2 MB	3,202	1.0%	214,865	69.9%	8,691	2.8%	9.4%	0.0
78	ELEVATORS_MEDI	float32	1.2 MB	46	<0.1%	163,891	53.3%	87,026	28.3%	60.6%	0.0
79	ENTRANCES_MEDI	float32	1.2 MB	46	<0.1%	154,828	50.3%	35,535	11.6%	23.3%	0.1379
80	FLOORSMAX_MEDI	float32	1.2 MB	49	<0.1%	153,020	49.8%	63,607	20.7%	41.2%	0.1667
81	FLOORSMIN_MEDI	float32	1.2 MB	47	<0.1%	208,642	67.8%	33,737	11.0%	34.1%	0.2083
82	LANDAREA_MEDI	float32	1.2 MB	3,560	1.2%	182,590	59.4%	15,919	5.2%	12.7%	0.0
83	LIVINGAPARTMENTS_MEDI	float32	1.2 MB	1,097	0.4%	210,199	68.4%	4,500	1.5%	4.6%	0.0513
84	LIVINGAREA_MEDI	float32	1.2 MB	5,281	1.7%	154,350	50.2%	299	0.1%	0.2%	0.0
85	NONLIVINGAPARTMENTS_MEDI	float32	1.2 MB	214	0.1%	213,514	69.4%	56,097	18.2%	59.7%	0.0
86	NONLIVINGAREA_MEDI	float32	1.2 MB	3,323	1.1%	169,682	55.2%	60,954	19.8%	44.2%	0.0
87	FONDKAPREMONT_MODE	category	308.0 kB	4	<0.1%	210,295	68.4%	73,830	24.0%	75.9%	reg oper account
88	HOUSETYPE_MODE	category	307.8 kB	3	<0.1%	154,297	50.2%	150,503	48.9%	98.2%	block of flats
89	TOTALAREA_MODE	float32	1.2 MB	5,116	1.7%	148,431	48.3%	582	0.2%	0.4%	0.0
90	WALLSMATERIAL_MODE	category	308.3 kB	7	<0.1%	156,341	50.8%	66,040	21.5%	43.7%	Panel
91	EMERGENCYSTATE_MODE	category	307.7 kB	2	<0.1%	145,755	47.4%	159,428	51.8%	98.6%	No
92	OBS_30_CNT_SOCIAL_CIRCLE	float32	1.2 MB	33	<0.1%	1,021	0.3%	163,910	53.3%	53.5%	0.0
93	DEF_30_CNT_SOCIAL_CIRCLE	float32	1.2 MB	10	<0.1%	1,021	0.3%	271,324	88.2%	88.5%	0.0
94	OBS_60_CNT_SOCIAL_CIRCLE	float32	1.2 MB	33	<0.1%	1,021	0.3%	164,666	53.5%	53.7%	0.0
95	DEF_60_CNT_SOCIAL_CIRCLE	float32	1.2 MB	9	<0.1%	1,021	0.3%	280,721	91.3%	91.6%	0.0
96	DAYS_LAST_PHONE_CHANGE	float32	1.2 MB	3,773	1.2%	1	<0.1%	37,672	12.3%	12.3%	0.0
97	FLAG_DOCUMENT_2	int8	307.5 kB	2	<0.1%	0	0%	307,498	>99.9%	>99.9%	0
98	FLAG_DOCUMENT_3	int8	307.5 kB	2	<0.1%	0	0%	218,340	71.0%	71.0%	1
99	FLAG_DOCUMENT_4	int8	307.5 kB	2	<0.1%	0	0%	307,486	>99.9%	>99.9%	0
100	FLAG_DOCUMENT_5	int8	307.5 kB	2	<0.1%	0	0%	302,863	98.5%	98.5%	0
101	FLAG_DOCUMENT_6	int8	307.5 kB	2	<0.1%	0	0%	280,433	91.2%	91.2%	0
102	FLAG_DOCUMENT_7	int8	307.5 kB	2	<0.1%	0	0%	307,452	>99.9%	>99.9%	0
103	FLAG_DOCUMENT_8	int8	307.5 kB	2	<0.1%	0	0%	282,487	91.9%	91.9%	0
104	FLAG_DOCUMENT_9	int8	307.5 kB	2	<0.1%	0	0%	306,313	99.6%	99.6%	0
105	FLAG_DOCUMENT_10	int8	307.5 kB	2	<0.1%	0	0%	307,504	>99.9%	>99.9%	0
106	FLAG_DOCUMENT_11	int8	307.5 kB	2	<0.1%	0	0%	306,308	99.6%	99.6%	0
107	FLAG_DOCUMENT_12	int8	307.5 kB	2	<0.1%	0	0%	307,509	>99.9%	>99.9%	0
108	FLAG_DOCUMENT_13	int8	307.5 kB	2	<0.1%	0	0%	306,427	99.6%	99.6%	0
109	FLAG_DOCUMENT_14	int8	307.5 kB	2	<0.1%	0	0%	306,608	99.7%	99.7%	0
110	FLAG_DOCUMENT_15	int8	307.5 kB	2	<0.1%	0	0%	307,139	99.9%	99.9%	0
111	FLAG_DOCUMENT_16	int8	307.5 kB	2	<0.1%	0	0%	304,458	99.0%	99.0%	0
112	FLAG_DOCUMENT_17	int8	307.5 kB	2	<0.1%	0	0%	307,429	>99.9%	>99.9%	0
113	FLAG_DOCUMENT_18	int8	307.5 kB	2	<0.1%	0	0%	305,011	99.2%	99.2%	0
114	FLAG_DOCUMENT_19	int8	307.5 kB	2	<0.1%	0	0%	307,328	99.9%	99.9%	0
115	FLAG_DOCUMENT_20	int8	307.5 kB	2	<0.1%	0	0%	307,355	99.9%	99.9%	0
116	FLAG_DOCUMENT_21	int8	307.5 kB	2	<0.1%	0	0%	307,408	>99.9%	>99.9%	0
117	AMT_REQ_CREDIT_BUREAU_HOUR	float32	1.2 MB	5	<0.1%	41,519	13.5%	264,366	86.0%	99.4%	0.0
118	AMT_REQ_CREDIT_BUREAU_DAY	float32	1.2 MB	9	<0.1%	41,519	13.5%	264,503	86.0%	99.4%	0.0
119	AMT_REQ_CREDIT_BUREAU_WEEK	float32	1.2 MB	9	<0.1%	41,519	13.5%	257,456	83.7%	96.8%	0.0
120	AMT_REQ_CREDIT_BUREAU_MON	float32	1.2 MB	24	<0.1%	41,519	13.5%	222,233	72.3%	83.5%	0.0
121	AMT_REQ_CREDIT_BUREAU_QRT	float32	1.2 MB	11	<0.1%	41,519	13.5%	215,417	70.1%	81.0%	0.0
122	AMT_REQ_CREDIT_BUREAU_YEAR	float32	1.2 MB	25	<0.1%	41,519	13.5%	71,801	23.3%	27.0%	0.0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_1 = sweetviz.analyze(
            [application, "application"],
            pairwise_analysis="off",
        )
        report_inspect_1.show_notebook()

Let’s test the hypothesis that males (M) and females (F) have different repayment abilities. For this purpose, the chi-squared test of independence will be used. The CODE_GENDER and TARGET columns are selected from the application table and treated as nominal variables. The null hypothesis is that the proportion of 0 (no financial difficulties) and 1 (have financial difficulties) in each group is the same. The alternative hypothesis is that the proportions are different.

The test results reveal that the differences are significant and the frequency table (see below) reveals that the size of the difference in males and females with financial difficulties is approximately 3 percent.

Note. As making financial decisions based on gender is illegal in many countries, this variable will be excluded from the analysis.

Code

sns.set_theme(style="whitegrid")
crosstab = an.CrossTab(
    "CODE_GENDER", "TARGET", application.query("CODE_GENDER != 'XNA'")
)
crosstab.barplot(normalize="rows", stacked=True)
print(crosstab.chisq_test())
# The percentages of each row add up to 100%
crosstab.row_percentage.style.format("{:.1f}%")

chi-square test, χ²(1, n = 307507) = 920.01, p < 0.001

TARGET	0	1
CODE_GENDER
F	93.0%	7.0%
M	89.9%	10.1%

3.3.2 Table `bureau`

Code

bureau.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Columns: 17 entries, SK_ID_CURR to AMT_ANNUITY
dtypes: category(3), float32(2), float64(6), int16(2), int32(3), int8(1)
memory usage: 124.4 MB

Code

bureau.head()

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.00	-153.00	NaN	91323.00	0.00	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.00	NaN	NaN	225000.00	171342.00	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.00	NaN	NaN	464323.50	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.00	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.00	NaN	77674.50	2700000.00	NaN	NaN	Consumer credit	-21	NaN

Code

an.col_info(bureau, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	6.9 MB	305,811	17.8%	0	0%	116	<0.1%	<0.1%	120860
2	SK_ID_BUREAU	int32	6.9 MB	1,716,428	100.0%	0	0%	1	<0.1%	<0.1%	5714462
3	CREDIT_ACTIVE	category	1.7 MB	4	<0.1%	0	0%	1,079,273	62.9%	62.9%	Closed
4	CREDIT_CURRENCY	category	1.7 MB	4	<0.1%	0	0%	1,715,020	99.9%	99.9%	currency 1
5	DAYS_CREDIT	int16	3.4 MB	2,923	0.2%	0	0%	1,330	0.1%	0.1%	-364
6	CREDIT_DAY_OVERDUE	int16	3.4 MB	942	0.1%	0	0%	1,712,211	99.8%	99.8%	0
7	DAYS_CREDIT_ENDDATE	float32	6.9 MB	14,096	0.8%	105,553	6.1%	883	0.1%	0.1%	0.0
8	DAYS_ENDDATE_FACT	float32	6.9 MB	2,917	0.2%	633,653	36.9%	811	<0.1%	0.1%	-329.0
9	AMT_CREDIT_MAX_OVERDUE	float64	13.7 MB	68,251	4.0%	1,124,488	65.5%	470,650	27.4%	79.5%	0.0
10	CNT_CREDIT_PROLONG	int8	1.7 MB	10	<0.1%	0	0%	1,707,314	99.5%	99.5%	0
11	AMT_CREDIT_SUM	float64	13.7 MB	236,708	13.8%	13	<0.1%	66,582	3.9%	3.9%	0.0
12	AMT_CREDIT_SUM_DEBT	float64	13.7 MB	226,537	13.2%	257,669	15.0%	1,016,434	59.2%	69.7%	0.0
13	AMT_CREDIT_SUM_LIMIT	float64	13.7 MB	51,726	3.0%	591,780	34.5%	1,050,142	61.2%	93.4%	0.0
14	AMT_CREDIT_SUM_OVERDUE	float64	13.7 MB	1,616	0.1%	0	0%	1,712,270	99.8%	99.8%	0.0
15	CREDIT_TYPE	category	1.7 MB	15	<0.1%	0	0%	1,251,615	72.9%	72.9%	Consumer credit
16	DAYS_CREDIT_UPDATE	int32	6.9 MB	2,982	0.2%	0	0%	18,503	1.1%	1.1%	-7
17	AMT_ANNUITY	float64	13.7 MB	40,321	2.3%	1,226,791	71.5%	256,915	15.0%	52.5%	0.0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_2 = sweetviz.analyze(
            [bureau, "bureau"],
            pairwise_analysis="off",
        )
        report_inspect_2.show_notebook()

3.3.3 Table `bureau_balance`

Code

bureau_balance.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Columns: 3 entries, SK_ID_BUREAU to STATUS
dtypes: category(1), int32(1), int8(1)
memory usage: 156.2 MB

Code

bureau_balance.head()

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS
0	5715448	0	C
1	5715448	-1	C
2	5715448	-2	C
3	5715448	-3	C
4	5715448	-4	C

Code

an.col_info(bureau_balance, style=True)

	column	data_type	memory_size	n_unique	p_unique	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_BUREAU	int32	109.2 MB	817,395	3.0%	0%	97	<0.1%	<0.1%	5645521
2	MONTHS_BALANCE	int8	27.3 MB	97	<0.1%	0%	622,601	2.3%	2.3%	-1
3	STATUS	category	27.3 MB	8	<0.1%	0%	13,646,993	50.0%	50.0%	C

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_3 = sweetviz.analyze(
            [bureau_balance, "bureau_balance"],
            pairwise_analysis="off",
        )
        report_inspect_3.show_notebook()

3.3.4 Table `previous_application`

Code

previous_application.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Columns: 37 entries, SK_ID_PREV to NFLAG_INSURED_ON_APPROVAL
dtypes: category(16), float32(10), float64(5), int16(1), int32(3), int8(2)
memory usage: 178.4 MB

Code

previous_application.head()

	SK_ID_PREV	SK_ID_CURR	NAME_CONTRACT_TYPE	AMT_ANNUITY	AMT_APPLICATION	AMT_CREDIT	AMT_DOWN_PAYMENT	AMT_GOODS_PRICE	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	FLAG_LAST_APPL_PER_CONTRACT	NFLAG_LAST_APPL_IN_DAY	RATE_DOWN_PAYMENT	RATE_INTEREST_PRIMARY	RATE_INTEREST_PRIVILEGED	NAME_CASH_LOAN_PURPOSE	NAME_CONTRACT_STATUS	DAYS_DECISION	NAME_PAYMENT_TYPE	CODE_REJECT_REASON	NAME_TYPE_SUITE	NAME_CLIENT_TYPE	NAME_GOODS_CATEGORY	NAME_PORTFOLIO	NAME_PRODUCT_TYPE	CHANNEL_TYPE	SELLERPLACE_AREA	NAME_SELLER_INDUSTRY	CNT_PAYMENT	NAME_YIELD_GROUP	PRODUCT_COMBINATION	DAYS_FIRST_DRAWING	DAYS_FIRST_DUE	DAYS_LAST_DUE_1ST_VERSION	DAYS_LAST_DUE	DAYS_TERMINATION	NFLAG_INSURED_ON_APPROVAL
0	2030495	271877	Consumer loans	1730.43	17145.00	17145.00	0.00	17145.00	SATURDAY	15	Y	1	0.00	0.18	0.87	XAP	Approved	-73	Cash through the bank	XAP	NaN	Repeater	Mobile	POS	XNA	Country-wide	35	Connectivity	12.00	middle	POS mobile with interest	365243.00	-42.00	300.00	-42.00	-37.00	0.00
1	2802425	108129	Cash loans	25188.62	607500.00	679671.00	NaN	607500.00	THURSDAY	11	Y	1	NaN	NaN	NaN	XNA	Approved	-164	XNA	XAP	Unaccompanied	Repeater	XNA	Cash	x-sell	Contact center	-1	XNA	36.00	low_action	Cash X-Sell: low	365243.00	-134.00	916.00	365243.00	365243.00	1.00
2	2523466	122040	Cash loans	15060.74	112500.00	136444.50	NaN	112500.00	TUESDAY	11	Y	1	NaN	NaN	NaN	XNA	Approved	-301	Cash through the bank	XAP	Spouse, partner	Repeater	XNA	Cash	x-sell	Credit and cash offices	-1	XNA	12.00	high	Cash X-Sell: high	365243.00	-271.00	59.00	365243.00	365243.00	1.00
3	2819243	176158	Cash loans	47041.33	450000.00	470790.00	NaN	450000.00	MONDAY	7	Y	1	NaN	NaN	NaN	XNA	Approved	-512	Cash through the bank	XAP	NaN	Repeater	XNA	Cash	x-sell	Credit and cash offices	-1	XNA	12.00	middle	Cash X-Sell: middle	365243.00	-482.00	-152.00	-182.00	-177.00	1.00
4	1784265	202054	Cash loans	31924.40	337500.00	404055.00	NaN	337500.00	THURSDAY	9	Y	1	NaN	NaN	NaN	Repairs	Refused	-781	Cash through the bank	HC	NaN	Repeater	XNA	Cash	walk-in	Credit and cash offices	-1	XNA	24.00	high	Cash Street: high	NaN	NaN	NaN	NaN	NaN	NaN

Code

an.col_info(previous_application, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_PREV	int32	6.7 MB	1,670,214	100.0%	0	0%	1	<0.1%	<0.1%	2030495
2	SK_ID_CURR	int32	6.7 MB	338,857	20.3%	0	0%	77	<0.1%	<0.1%	187868
3	NAME_CONTRACT_TYPE	category	1.7 MB	4	<0.1%	0	0%	747,553	44.8%	44.8%	Cash loans
4	AMT_ANNUITY	float64	13.4 MB	357,959	21.4%	372,235	22.3%	31,865	1.9%	2.5%	2250.0
5	AMT_APPLICATION	float64	13.4 MB	93,885	5.6%	0	0%	392,402	23.5%	23.5%	0.0
6	AMT_CREDIT	float64	13.4 MB	86,803	5.2%	1	<0.1%	336,768	20.2%	20.2%	0.0
7	AMT_DOWN_PAYMENT	float64	13.4 MB	29,278	1.8%	895,844	53.6%	369,854	22.1%	47.8%	0.0
8	AMT_GOODS_PRICE	float64	13.4 MB	93,885	5.6%	385,515	23.1%	47,831	2.9%	3.7%	45000.0
9	WEEKDAY_APPR_PROCESS_START	category	1.7 MB	7	<0.1%	0	0%	255,118	15.3%	15.3%	TUESDAY
10	HOUR_APPR_PROCESS_START	int8	1.7 MB	24	<0.1%	0	0%	192,728	11.5%	11.5%	11
11	FLAG_LAST_APPL_PER_CONTRACT	category	1.7 MB	2	<0.1%	0	0%	1,661,739	99.5%	99.5%	Y
12	NFLAG_LAST_APPL_IN_DAY	int8	1.7 MB	2	<0.1%	0	0%	1,664,314	99.6%	99.6%	1
13	RATE_DOWN_PAYMENT	float32	6.7 MB	191,301	11.5%	895,844	53.6%	369,854	22.1%	47.8%	0.0
14	RATE_INTEREST_PRIMARY	float32	6.7 MB	148	<0.1%	1,664,263	99.6%	1,218	0.1%	20.5%	0.18913634
15	RATE_INTEREST_PRIVILEGED	float32	6.7 MB	25	<0.1%	1,664,263	99.6%	1,717	0.1%	28.9%	0.83509517
16	NAME_CASH_LOAN_PURPOSE	category	1.7 MB	25	<0.1%	0	0%	922,661	55.2%	55.2%	XAP
17	NAME_CONTRACT_STATUS	category	1.7 MB	4	<0.1%	0	0%	1,036,781	62.1%	62.1%	Approved
18	DAYS_DECISION	int16	3.3 MB	2,922	0.2%	0	0%	2,444	0.1%	0.1%	-245
19	NAME_PAYMENT_TYPE	category	1.7 MB	4	<0.1%	0	0%	1,033,552	61.9%	61.9%	Cash through the bank
20	CODE_REJECT_REASON	category	1.7 MB	9	<0.1%	0	0%	1,353,093	81.0%	81.0%	XAP
21	NAME_TYPE_SUITE	category	1.7 MB	7	<0.1%	820,405	49.1%	508,970	30.5%	59.9%	Unaccompanied
22	NAME_CLIENT_TYPE	category	1.7 MB	4	<0.1%	0	0%	1,231,261	73.7%	73.7%	Repeater
23	NAME_GOODS_CATEGORY	category	1.7 MB	28	<0.1%	0	0%	950,809	56.9%	56.9%	XNA
24	NAME_PORTFOLIO	category	1.7 MB	5	<0.1%	0	0%	691,011	41.4%	41.4%	POS
25	NAME_PRODUCT_TYPE	category	1.7 MB	3	<0.1%	0	0%	1,063,666	63.7%	63.7%	XNA
26	CHANNEL_TYPE	category	1.7 MB	8	<0.1%	0	0%	719,968	43.1%	43.1%	Credit and cash offices
27	SELLERPLACE_AREA	int32	6.7 MB	2,097	0.1%	0	0%	762,675	45.7%	45.7%	-1
28	NAME_SELLER_INDUSTRY	category	1.7 MB	11	<0.1%	0	0%	855,720	51.2%	51.2%	XNA
29	CNT_PAYMENT	float32	6.7 MB	49	<0.1%	372,230	22.3%	323,049	19.3%	24.9%	12.0
30	NAME_YIELD_GROUP	category	1.7 MB	5	<0.1%	0	0%	517,215	31.0%	31.0%	XNA
31	PRODUCT_COMBINATION	category	1.7 MB	17	<0.1%	346	<0.1%	285,990	17.1%	17.1%	Cash
32	DAYS_FIRST_DRAWING	float32	6.7 MB	2,838	0.2%	673,065	40.3%	934,444	55.9%	93.7%	365243.0
33	DAYS_FIRST_DUE	float32	6.7 MB	2,892	0.2%	673,065	40.3%	40,645	2.4%	4.1%	365243.0
34	DAYS_LAST_DUE_1ST_VERSION	float32	6.7 MB	4,605	0.3%	673,065	40.3%	93,864	5.6%	9.4%	365243.0
35	DAYS_LAST_DUE	float32	6.7 MB	2,873	0.2%	673,065	40.3%	211,221	12.6%	21.2%	365243.0
36	DAYS_TERMINATION	float32	6.7 MB	2,830	0.2%	673,065	40.3%	225,913	13.5%	22.7%	365243.0
37	NFLAG_INSURED_ON_APPROVAL	float32	6.7 MB	2	<0.1%	673,065	40.3%	665,527	39.8%	66.7%	0.0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_4 = sweetviz.analyze(
            [previous_application, "previous_application"],
            pairwise_analysis="off",
        )
        report_inspect_4.show_notebook()

3.3.5 Table `pos_cash_balance`

Code

pos_cash_balance.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Columns: 8 entries, SK_ID_PREV to SK_DPD_DEF
dtypes: category(1), float32(2), int16(2), int32(2), int8(1)
memory usage: 209.8 MB

Code

pos_cash_balance.head()

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	CNT_INSTALMENT	CNT_INSTALMENT_FUTURE	NAME_CONTRACT_STATUS
0	1803195	182943	-31	48.00	45.00	Active
1	1715348	367990	-33	36.00	35.00	Active
2	1784872	397406	-32	12.00	9.00	Active
3	1903291	269225	-35	48.00	42.00	Active
4	2341044	334279	-35	36.00	35.00	Active

Code

an.col_info(pos_cash_balance, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_PREV	int32	40.0 MB	936,325	9.4%	0	0%	96	<0.1%	<0.1%	1856103
2	SK_ID_CURR	int32	40.0 MB	337,252	3.4%	0	0%	295	<0.1%	<0.1%	265042
3	MONTHS_BALANCE	int8	10.0 MB	96	<0.1%	0	0%	216,441	2.2%	2.2%	-10
4	CNT_INSTALMENT	float32	40.0 MB	73	<0.1%	26,071	0.3%	2,496,845	25.0%	25.0%	12.0
5	CNT_INSTALMENT_FUTURE	float32	40.0 MB	79	<0.1%	26,087	0.3%	1,185,960	11.9%	11.9%	0.0
6	NAME_CONTRACT_STATUS	category	10.0 MB	9	<0.1%	0	0%	9,151,119	91.5%	91.5%	Active
7	SK_DPD	int16	20.0 MB	3,400	<0.1%	0	0%	9,706,131	97.0%	97.0%	0
8	SK_DPD_DEF	int16	20.0 MB	2,307	<0.1%	0	0%	9,887,389	98.9%	98.9%	0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_5 = sweetviz.analyze(
            [pos_cash_balance, "pos_cash_balance"],
            pairwise_analysis="off",
        )
        report_inspect_5.show_notebook()

3.3.6 Table `credit_card_balance`

Code

credit_card_balance.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Columns: 23 entries, SK_ID_PREV to SK_DPD_DEF
dtypes: category(1), float32(4), float64(11), int16(3), int32(3), int8(1)
memory usage: 454.1 MB

Code

credit_card_balance.head()

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	AMT_BALANCE	AMT_CREDIT_LIMIT_ACTUAL	AMT_DRAWINGS_ATM_CURRENT	AMT_DRAWINGS_CURRENT	AMT_DRAWINGS_POS_CURRENT	AMT_INST_MIN_REGULARITY	AMT_PAYMENT_CURRENT	AMT_PAYMENT_TOTAL_CURRENT	AMT_RECEIVABLE_PRINCIPAL	AMT_RECIVABLE	AMT_TOTAL_RECEIVABLE	CNT_DRAWINGS_ATM_CURRENT	CNT_DRAWINGS_CURRENT	CNT_DRAWINGS_POS_CURRENT	CNT_INSTALMENT_MATURE_CUM	NAME_CONTRACT_STATUS
0	2562384	378907	-6	56.97	135000	0.00	877.50	877.50	1700.33	1800.00	1800.00	0.00	0.00	0.00	0.00	1	1.00	35.00	Active
1	2582071	363914	-1	63975.56	45000	2250.00	2250.00	0.00	2250.00	2250.00	2250.00	60175.08	64875.56	64875.56	1.00	1	0.00	69.00	Active
2	1740877	371185	-7	31815.22	450000	0.00	0.00	0.00	2250.00	2250.00	2250.00	26926.42	31460.08	31460.08	0.00	0	0.00	30.00	Active
3	1389973	337855	-4	236572.11	225000	2250.00	2250.00	0.00	11795.76	11925.00	11925.00	224949.29	233048.97	233048.97	1.00	1	0.00	10.00	Active
4	1891521	126868	-1	453919.46	450000	0.00	11547.00	11547.00	22924.89	27000.00	27000.00	443044.40	453919.46	453919.46	0.00	1	1.00	101.00	Active

Code

an.col_info(credit_card_balance, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_PREV	int32	15.4 MB	104,307	2.7%	0	0%	96	<0.1%	<0.1%	2377894
2	SK_ID_CURR	int32	15.4 MB	103,558	2.7%	0	0%	192	<0.1%	<0.1%	186401
3	MONTHS_BALANCE	int8	3.8 MB	96	<0.1%	0	0%	102,115	2.7%	2.7%	-4
4	AMT_BALANCE	float64	30.7 MB	1,347,904	35.1%	0	0%	2,156,420	56.2%	56.2%	0.0
5	AMT_CREDIT_LIMIT_ACTUAL	int32	15.4 MB	181	<0.1%	0	0%	753,823	19.6%	19.6%	0
6	AMT_DRAWINGS_ATM_CURRENT	float64	30.7 MB	2,267	0.1%	749,816	19.5%	2,665,718	69.4%	86.3%	0.0
7	AMT_DRAWINGS_CURRENT	float64	30.7 MB	187,005	4.9%	0	0%	3,223,443	83.9%	83.9%	0.0
8	AMT_DRAWINGS_OTHER_CURRENT	float64	30.7 MB	1,832	<0.1%	749,816	19.5%	3,078,163	80.2%	99.6%	0.0
9	AMT_DRAWINGS_POS_CURRENT	float64	30.7 MB	168,748	4.4%	749,816	19.5%	2,825,595	73.6%	91.4%	0.0
10	AMT_INST_MIN_REGULARITY	float64	30.7 MB	312,266	8.1%	305,236	7.9%	1,928,864	50.2%	54.6%	0.0
11	AMT_PAYMENT_CURRENT	float64	30.7 MB	163,209	4.2%	767,988	20.0%	390,507	10.2%	12.7%	0.0
12	AMT_PAYMENT_TOTAL_CURRENT	float64	30.7 MB	182,957	4.8%	0	0%	2,172,223	56.6%	56.6%	0.0
13	AMT_RECEIVABLE_PRINCIPAL	float64	30.7 MB	1,195,839	31.1%	0	0%	2,296,167	59.8%	59.8%	0.0
14	AMT_RECIVABLE	float64	30.7 MB	1,338,878	34.9%	0	0%	2,113,816	55.0%	55.0%	0.0
15	AMT_TOTAL_RECEIVABLE	float64	30.7 MB	1,339,008	34.9%	0	0%	2,113,643	55.0%	55.0%	0.0
16	CNT_DRAWINGS_ATM_CURRENT	float32	15.4 MB	44	<0.1%	749,816	19.5%	2,665,718	69.4%	86.3%	0.0
17	CNT_DRAWINGS_CURRENT	int16	7.7 MB	129	<0.1%	0	0%	3,229,952	84.1%	84.1%	0
18	CNT_DRAWINGS_OTHER_CURRENT	float32	15.4 MB	11	<0.1%	749,816	19.5%	3,077,688	80.1%	99.6%	0.0
19	CNT_DRAWINGS_POS_CURRENT	float32	15.4 MB	133	<0.1%	749,816	19.5%	2,825,594	73.6%	91.4%	0.0
20	CNT_INSTALMENT_MATURE_CUM	float32	15.4 MB	121	<0.1%	305,236	7.9%	551,467	14.4%	15.6%	0.0
21	NAME_CONTRACT_STATUS	category	3.8 MB	7	<0.1%	0	0%	3,698,436	96.3%	96.3%	Active
22	SK_DPD	int16	7.7 MB	917	<0.1%	0	0%	3,686,957	96.0%	96.0%	0
23	SK_DPD_DEF	int16	7.7 MB	378	<0.1%	0	0%	3,750,972	97.7%	97.7%	0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_6 = sweetviz.analyze(
            [credit_card_balance, "previous_application"],
            pairwise_analysis="off",
        )
        report_inspect_6.show_notebook()

3.3.7 Table `installments_payments`

Code

installments_payments.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Columns: 8 entries, SK_ID_PREV to AMT_PAYMENT
dtypes: float32(3), float64(2), int16(1), int32(2)
memory usage: 493.1 MB

Code

installments_payments.head()

	SK_ID_PREV	SK_ID_CURR	NUM_INSTALMENT_VERSION	NUM_INSTALMENT_NUMBER	DAYS_INSTALMENT	DAYS_ENTRY_PAYMENT	AMT_INSTALMENT	AMT_PAYMENT
0	1054186	161674	1.00	6	-1180.00	-1187.00	6948.36	6948.36
1	1330831	151639	0.00	34	-2156.00	-2156.00	1716.53	1716.53
2	2085231	193053	2.00	1	-63.00	-63.00	25425.00	25425.00
3	2452527	199697	1.00	3	-2418.00	-2426.00	24350.13	24350.13
4	2714724	167756	1.00	2	-1383.00	-1366.00	2165.04	2160.59

Code

an.col_info(installments_payments, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_PREV	int32	54.4 MB	997,752	7.3%	0	0%	293	<0.1%	<0.1%	2360056
2	SK_ID_CURR	int32	54.4 MB	339,587	2.5%	0	0%	372	<0.1%	<0.1%	145728
3	NUM_INSTALMENT_VERSION	float32	54.4 MB	65	<0.1%	0	0%	8,485,004	62.4%	62.4%	1.0
4	NUM_INSTALMENT_NUMBER	int16	27.2 MB	277	<0.1%	0	0%	1,004,160	7.4%	7.4%	1
5	DAYS_INSTALMENT	float32	54.4 MB	2,922	<0.1%	0	0%	11,512	0.1%	0.1%	-120.0
6	DAYS_ENTRY_PAYMENT	float32	54.4 MB	3,039	<0.1%	2,905	<0.1%	13,103	0.1%	0.1%	-91.0
7	AMT_INSTALMENT	float64	108.8 MB	902,539	6.6%	0	0%	254,062	1.9%	1.9%	9000.0
8	AMT_PAYMENT	float64	108.8 MB	944,235	6.9%	2,905	<0.1%	248,757	1.8%	1.8%	9000.0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_7 = sweetviz.analyze(
            [installments_payments, "installments_payments"],
            pairwise_analysis="off",
        )
        report_inspect_7.show_notebook()

3.3.8 Tables `application_test` and `sample_submission`

Table application_test contains the same variables as table application, but without the target variable TARGET. And table sample_submission only contains sample and not real data. These tables will be excluded from the analysis.

Code

application_test.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: category(16), float32(64), float64(1), int16(2), int32(2), int8(36)
memory usage: 15.3 MB

Code

application_test.head()

	SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_WORK_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	FLAG_EMAIL	OCCUPATION_TYPE	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	REG_CITY_NOT_WORK_CITY	LIVE_CITY_NOT_WORK_CITY	ORGANIZATION_TYPE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	COMMONAREA_MODE	ELEVATORS_MODE	ENTRANCES_MODE	FLOORSMAX_MODE	FLOORSMIN_MODE	LANDAREA_MODE	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	FONDKAPREMONT_MODE	HOUSETYPE_MODE	TOTALAREA_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100001	Cash loans	F	N	Y	0	135000.00	568800.00	20560.50	450000.00	Unaccompanied	Working	Higher education	Married	House / apartment	0.02	-19241	-2329	-5170.00	-812	NaN	1	1	0	1	0	1	NaN	2.00	2	2	TUESDAY	18	0	0	Kindergarten	0.75	0.79	0.16	0.07	0.06	0.97	NaN	NaN	NaN	0.14	0.12	NaN	NaN	NaN	0.05	NaN	NaN	0.07	0.06	0.97	NaN	NaN	NaN	0.14	0.12	NaN	NaN	NaN	0.05	NaN	NaN	0.07	0.06	0.97	NaN	NaN	NaN	0.14	0.12	NaN	NaN	NaN	0.05	NaN	NaN	NaN	block of flats	0.04	Stone, brick	No	-1740.00	1	0	0.00	0.00	0.00	0.00	0.00	0.00
1	100005	Cash loans	M	N	Y	0	99000.00	222768.00	17370.00	180000.00	Unaccompanied	Working	Secondary / secondary special	Married	House / apartment	0.04	-18064	-4469	-9118.00	-1623	NaN	1	1	0	1	0	0	Low-skill Laborers	2.00	2	2	FRIDAY	9	0	0	Self-employed	0.56	0.29	0.43	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	1	0	0.00	0.00	0.00	0.00	0.00	3.00
2	100013	Cash loans	M	Y	Y	0	202500.00	663264.00	69777.00	630000.00	NaN	Working	Higher education	Married	House / apartment	0.02	-20038	-4458	-2175.00	-3503	5.00	1	1	0	1	0	0	Drivers	2.00	2	2	MONDAY	14	0	0	Transport: type 3	NaN	0.70	0.61	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-856.00	0	1	0.00	0.00	0.00	0.00	1.00	4.00
3	100028	Cash loans	F	N	Y	2	315000.00	1575000.00	49018.50	1575000.00	Unaccompanied	Working	Secondary / secondary special	Married	House / apartment	0.03	-13976	-1866	-2000.00	-4208	NaN	1	1	0	1	1	0	Sales staff	4.00	2	2	WEDNESDAY	11	0	0	Business Entity Type 3	0.53	0.51	0.61	0.31	0.20	1.00	0.96	0.12	0.32	0.28	0.38	0.04	0.20	0.24	0.37	0.04	0.08	0.31	0.20	1.00	0.96	0.12	0.32	0.28	0.38	0.04	0.21	0.26	0.38	0.04	0.08	0.31	0.20	1.00	0.96	0.12	0.32	0.28	0.38	0.04	0.21	0.24	0.37	0.04	0.08	reg oper account	block of flats	0.37	Panel	No	-1805.00	1	0	0.00	0.00	0.00	0.00	0.00	3.00
4	100038	Cash loans	M	Y	N	1	180000.00	625500.00	32067.00	625500.00	Unaccompanied	Working	Secondary / secondary special	Married	House / apartment	0.01	-13040	-2191	-4000.00	-4262	16.00	1	1	1	1	0	0	NaN	3.00	2	2	FRIDAY	5	1	1	Business Entity Type 3	0.20	0.43	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-821.00	1	0	NaN	NaN	NaN	NaN	NaN	NaN

Code

sample_submission.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 2 entries, SK_ID_CURR to TARGET
dtypes: float32(1), int32(1)
memory usage: 380.9 KB

Code

sample_submission.head()

	SK_ID_CURR	TARGET
0	100001	0.50
1	100005	0.50
2	100013	0.50
3	100028	0.50
4	100038	0.50

3.4 Split to Train, Validation, and Test Sets

To make models more robust for the unseen data, the data is split into training, validation, and test sets (70%:15%:15%).
Stratification by target is used to ensure that the proportions of target values are the same in all sets.

Code

application_train, application_validation = train_test_split(
    application, test_size=0.3, random_state=42, stratify=application.TARGET
)

application_validation, application_test = train_test_split(
    application_validation,
    test_size=0.5,
    random_state=42,
    stratify=application_validation.TARGET,
)

Code

X_train = application_train.drop(columns=["TARGET"])
y_train = application_train["TARGET"]

X_validation = application_validation.drop(columns=["TARGET"])
y_validation = application_validation["TARGET"]

X_test = application_test.drop(columns=["TARGET"])
y_test = application_test["TARGET"]

The sizes of the sets are (“k” stands for thousands):

Code

print(f"{application_train.shape[0]/1e3:.1f}k rows in training set.")
print(f"{application_validation.shape[0]/1e3: .1f}k rows in validation set.")
print(f"{application_test.shape[0]/1e3: .1f}k rows in test set.")

215.3k rows in training set.
 46.1k rows in validation set.
 46.1k rows in test set.

3.5 EDA on Train Set

A more detailed EDA is performed on the training set only. Pay attention that in the sweetviz report, not only the distributions of variables are plotted as bars but also the means of the target variable in each category/interval are indicated by dark blue dots connected with a line.

Code

application_train.head()

	SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	FLAG_EMAIL	OCCUPATION_TYPE	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	ORGANIZATION_TYPE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	COMMONAREA_MODE	ELEVATORS_MODE	ENTRANCES_MODE	FLOORSMAX_MODE	FLOORSMIN_MODE	LANDAREA_MODE	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	FONDKAPREMONT_MODE	HOUSETYPE_MODE	TOTALAREA_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE	OBS_30_CNT_SOCIAL_CIRCLE	DEF_30_CNT_SOCIAL_CIRCLE	OBS_60_CNT_SOCIAL_CIRCLE	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	FLAG_DOCUMENT_9	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
159703	285133	Cash loans	F	Y	Y	2	405000.00	1971072.00	68643.00	1800000.00	Unaccompanied	Commercial associate	Higher education	Married	House / apartment	0.01	-13587	-1028	-7460.00	-1823	13.00	1	1	1	0	0	Accountants	4.00	3	3	SATURDAY	11	Self-employed	0.68	0.33	0.64	0.12	0.10	0.98	0.78	NaN	0.00	0.24	0.17	0.21	0.00	0.10	0.12	NaN	0.03	0.12	0.10	0.98	0.79	NaN	0.00	0.24	0.17	0.21	0.00	0.11	0.13	NaN	0.03	0.12	0.10	0.98	0.79	NaN	0.00	0.24	0.17	0.21	0.00	0.10	0.13	NaN	0.03	reg oper account	block of flats	0.10	Stone, brick	No	4.00	0.00	4.00	0.00	-2169.00	1	0	0	0.00	0.00	0.00	0.00	0.00	0.00
79269	191894	Cash loans	M	N	Y	0	337500.00	508495.50	38146.50	454500.00	Family	State servant	Higher education	Married	House / apartment	0.01	-17543	-1208	-4054.00	-1090	NaN	1	1	1	0	0	Managers	2.00	2	2	WEDNESDAY	11	Agriculture	NaN	0.62	0.44	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.00	1.00	2.00	1.00	-659.00	0	1	0	0.00	0.00	0.00	0.00	0.00	6.00
232615	369428	Cash loans	M	N	Y	1	112500.00	110146.50	13068.00	90000.00	Unaccompanied	Commercial associate	Secondary / secondary special	Married	House / apartment	0.01	-11557	-593	-5554.00	-4130	NaN	1	1	1	1	1	Laborers	3.00	2	2	FRIDAY	11	Business Entity Type 3	0.36	0.65	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	-172.00	0	0	1	NaN	NaN	NaN	NaN	NaN	NaN
33420	138717	Cash loans	F	N	Y	2	40500.00	66384.00	3519.00	45000.00	Unaccompanied	Commercial associate	Secondary / secondary special	Married	House / apartment	0.03	-15750	-5376	-5285.00	-5290	NaN	1	1	1	0	0	Sales staff	4.00	2	2	TUESDAY	13	Self-employed	0.39	0.60	0.45	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	-1576.00	1	0	0	0.00	0.00	0.00	1.00	0.00	2.00
88191	202381	Cash loans	M	Y	N	0	225000.00	298512.00	31801.50	270000.00	Unaccompanied	Commercial associate	Secondary / secondary special	Married	House / apartment	0.02	-19912	-1195	-86.00	-3033	11.00	1	1	1	0	0	Drivers	2.00	2	2	FRIDAY	16	Construction	0.74	0.66	0.72	0.30	0.14	1.00	0.99	0.10	0.40	0.17	0.46	0.00	0.00	0.24	0.25	0.00	0.00	0.30	0.14	1.00	0.99	0.10	0.40	0.17	0.46	0.00	0.00	0.26	0.26	0.00	0.00	0.30	0.14	1.00	0.99	0.10	0.40	0.17	0.46	0.00	0.00	0.25	0.25	0.00	0.00	reg oper account	block of flats	0.27	Stone, brick	No	3.00	0.00	3.00	0.00	-624.00	1	0	0	0.00	0.00	0.00	0.00	0.00	0.00

Code

an.col_info(application_train, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	861.0 kB	215,257	100.0%	0	0%	1	<0.1%	<0.1%	285133
2	TARGET	int8	215.3 kB	2	<0.1%	0	0%	197,880	91.9%	91.9%	0
3	NAME_CONTRACT_TYPE	category	215.5 kB	2	<0.1%	0	0%	194,675	90.4%	90.4%	Cash loans
4	CODE_GENDER	category	215.5 kB	3	<0.1%	0	0%	141,622	65.8%	65.8%	F
5	FLAG_OWN_CAR	category	215.5 kB	2	<0.1%	0	0%	142,086	66.0%	66.0%	N
6	FLAG_OWN_REALTY	category	215.5 kB	2	<0.1%	0	0%	149,412	69.4%	69.4%	Y
7	CNT_CHILDREN	int8	215.3 kB	12	<0.1%	0	0%	150,641	70.0%	70.0%	0
8	AMT_INCOME_TOTAL	float64	1.7 MB	1,949	0.9%	0	0%	24,982	11.6%	11.6%	135000.0
9	AMT_CREDIT	float32	861.0 kB	5,097	2.4%	0	0%	6,823	3.2%	3.2%	450000.0
10	AMT_ANNUITY	float32	861.0 kB	12,801	5.9%	8	<0.1%	4,499	2.1%	2.1%	9000.0
11	AMT_GOODS_PRICE	float32	861.0 kB	828	0.4%	187	0.1%	18,194	8.5%	8.5%	450000.0
12	NAME_TYPE_SUITE	category	216.0 kB	7	<0.1%	901	0.4%	174,089	80.9%	81.2%	Unaccompanied
13	NAME_INCOME_TYPE	category	216.1 kB	8	<0.1%	0	0%	110,984	51.6%	51.6%	Working
14	NAME_EDUCATION_TYPE	category	215.8 kB	5	<0.1%	0	0%	152,993	71.1%	71.1%	Secondary / secondary special
15	NAME_FAMILY_STATUS	category	215.8 kB	6	<0.1%	0	0%	137,457	63.9%	63.9%	Married
16	NAME_HOUSING_TYPE	category	215.9 kB	6	<0.1%	0	0%	191,159	88.8%	88.8%	House / apartment
17	REGION_POPULATION_RELATIVE	float32	861.0 kB	81	<0.1%	0	0%	11,494	5.3%	5.3%	0.035792
18	DAYS_BIRTH	int16	430.5 kB	17,377	8.1%	0	0%	32	<0.1%	<0.1%	-14890
19	DAYS_EMPLOYED	int32	861.0 kB	11,770	5.5%	0	0%	38,756	18.0%	18.0%	365243
20	DAYS_REGISTRATION	float32	861.0 kB	15,249	7.1%	0	0%	79	<0.1%	<0.1%	-7.0
21	DAYS_ID_PUBLISH	int16	430.5 kB	6,122	2.8%	0	0%	119	0.1%	0.1%	-4074
22	OWN_CAR_AGE	float32	861.0 kB	61	<0.1%	142,091	66.0%	5,232	2.4%	7.2%	7.0
23	FLAG_MOBIL	int8	215.3 kB	2	<0.1%	0	0%	215,256	>99.9%	>99.9%	1
24	FLAG_EMP_PHONE	int8	215.3 kB	2	<0.1%	0	0%	176,491	82.0%	82.0%	1
25	FLAG_WORK_PHONE	int8	215.3 kB	2	<0.1%	0	0%	172,406	80.1%	80.1%	0
26	FLAG_CONT_MOBILE	int8	215.3 kB	2	<0.1%	0	0%	214,855	99.8%	99.8%	1
27	FLAG_PHONE	int8	215.3 kB	2	<0.1%	0	0%	154,906	72.0%	72.0%	0
28	FLAG_EMAIL	int8	215.3 kB	2	<0.1%	0	0%	203,006	94.3%	94.3%	0
29	OCCUPATION_TYPE	category	217.1 kB	18	<0.1%	67,480	31.3%	38,591	17.9%	26.1%	Laborers
30	CNT_FAM_MEMBERS	float32	861.0 kB	12	<0.1%	1	<0.1%	110,671	51.4%	51.4%	2.0
31	REGION_RATING_CLIENT	int8	215.3 kB	3	<0.1%	0	0%	158,846	73.8%	73.8%	2
32	REGION_RATING_CLIENT_W_CITY	int8	215.3 kB	3	<0.1%	0	0%	160,564	74.6%	74.6%	2
33	WEEKDAY_APPR_PROCESS_START	category	216.0 kB	7	<0.1%	0	0%	37,826	17.6%	17.6%	TUESDAY
34	HOUR_APPR_PROCESS_START	int8	215.3 kB	24	<0.1%	0	0%	26,465	12.3%	12.3%	10
35	REG_REGION_NOT_LIVE_REGION	int8	215.3 kB	2	<0.1%	0	0%	211,999	98.5%	98.5%	0
36	REG_REGION_NOT_WORK_REGION	int8	215.3 kB	2	<0.1%	0	0%	204,222	94.9%	94.9%	0
37	LIVE_REGION_NOT_WORK_REGION	int8	215.3 kB	2	<0.1%	0	0%	206,386	95.9%	95.9%	0
38	REG_CITY_NOT_LIVE_CITY	int8	215.3 kB	2	<0.1%	0	0%	198,549	92.2%	92.2%	0
39	REG_CITY_NOT_WORK_CITY	int8	215.3 kB	2	<0.1%	0	0%	165,697	77.0%	77.0%	0
40	LIVE_CITY_NOT_WORK_CITY	int8	215.3 kB	2	<0.1%	0	0%	176,518	82.0%	82.0%	0
41	ORGANIZATION_TYPE	category	221.4 kB	58	<0.1%	0	0%	47,582	22.1%	22.1%	Business Entity Type 3
42	EXT_SOURCE_1	float32	861.0 kB	83,961	39.0%	121,373	56.4%	5	<0.1%	<0.1%	0.44398212
43	EXT_SOURCE_2	float32	861.0 kB	102,229	47.5%	464	0.2%	503	0.2%	0.2%	0.28589788
44	EXT_SOURCE_3	float32	861.0 kB	804	0.4%	42,680	19.8%	985	0.5%	0.6%	0.7463002
45	APARTMENTS_AVG	float32	861.0 kB	2,207	1.0%	109,076	50.7%	4,712	2.2%	4.4%	0.0825
46	BASEMENTAREA_AVG	float32	861.0 kB	3,626	1.7%	125,793	58.4%	10,282	4.8%	11.5%	0.0
47	YEARS_BEGINEXPLUATATION_AVG	float32	861.0 kB	260	0.1%	104,910	48.7%	3,073	1.4%	2.8%	0.9871
48	YEARS_BUILD_AVG	float32	861.0 kB	146	0.1%	143,036	66.4%	2,132	1.0%	3.0%	0.8232
49	COMMONAREA_AVG	float32	861.0 kB	2,964	1.4%	150,300	69.8%	5,899	2.7%	9.1%	0.0
50	ELEVATORS_AVG	float32	861.0 kB	241	0.1%	114,570	53.2%	60,109	27.9%	59.7%	0.0
51	ENTRANCES_AVG	float32	861.0 kB	266	0.1%	108,270	50.3%	23,867	11.1%	22.3%	0.1379
52	FLOORSMAX_AVG	float32	861.0 kB	371	0.2%	106,970	49.7%	43,449	20.2%	40.1%	0.1667
53	FLOORSMIN_AVG	float32	861.0 kB	280	0.1%	146,054	67.9%	23,117	10.7%	33.4%	0.2083
54	LANDAREA_AVG	float32	861.0 kB	3,360	1.6%	127,644	59.3%	10,845	5.0%	12.4%	0.0
55	LIVINGAPARTMENTS_AVG	float32	861.0 kB	1,761	0.8%	147,049	68.3%	2,984	1.4%	4.4%	0.0504
56	LIVINGAREA_AVG	float32	861.0 kB	4,983	2.3%	107,990	50.2%	202	0.1%	0.2%	0.0
57	NONLIVINGAPARTMENTS_AVG	float32	861.0 kB	345	0.2%	149,354	69.4%	38,319	17.8%	58.1%	0.0
58	NONLIVINGAREA_AVG	float32	861.0 kB	3,042	1.4%	118,577	55.1%	41,099	19.1%	42.5%	0.0
59	APARTMENTS_MODE	float32	861.0 kB	744	0.3%	109,076	50.7%	5,301	2.5%	5.0%	0.084
60	BASEMENTAREA_MODE	float32	861.0 kB	3,687	1.7%	125,793	58.4%	11,561	5.4%	12.9%	0.0
61	YEARS_BEGINEXPLUATATION_MODE	float32	861.0 kB	210	0.1%	104,910	48.7%	3,039	1.4%	2.8%	0.9871
62	YEARS_BUILD_MODE	float32	861.0 kB	152	0.1%	143,036	66.4%	2,090	1.0%	2.9%	0.8301
63	COMMONAREA_MODE	float32	861.0 kB	2,908	1.4%	150,300	69.8%	6,770	3.1%	10.4%	0.0
64	ELEVATORS_MODE	float32	861.0 kB	26	<0.1%	114,570	53.2%	62,808	29.2%	62.4%	0.0
65	ENTRANCES_MODE	float32	861.0 kB	30	<0.1%	108,270	50.3%	25,310	11.8%	23.7%	0.1379
66	FLOORSMAX_MODE	float32	861.0 kB	25	<0.1%	106,970	49.7%	46,048	21.4%	42.5%	0.1667
67	FLOORSMIN_MODE	float32	861.0 kB	25	<0.1%	146,054	67.9%	24,209	11.2%	35.0%	0.2083
68	LANDAREA_MODE	float32	861.0 kB	3,406	1.6%	127,644	59.3%	12,121	5.6%	13.8%	0.0
69	LIVINGAPARTMENTS_MODE	float32	861.0 kB	715	0.3%	147,049	68.3%	3,447	1.6%	5.1%	0.0551
70	LIVINGAREA_MODE	float32	861.0 kB	5,083	2.4%	107,990	50.2%	310	0.1%	0.3%	0.0
71	NONLIVINGAPARTMENTS_MODE	float32	861.0 kB	148	0.1%	149,354	69.4%	41,574	19.3%	63.1%	0.0
72	NONLIVINGAREA_MODE	float32	861.0 kB	3,090	1.4%	118,577	55.1%	46,933	21.8%	48.5%	0.0
73	APARTMENTS_MEDI	float32	861.0 kB	1,120	0.5%	109,076	50.7%	5,000	2.3%	4.7%	0.0833
74	BASEMENTAREA_MEDI	float32	861.0 kB	3,614	1.7%	125,793	58.4%	10,458	4.9%	11.7%	0.0
75	YEARS_BEGINEXPLUATATION_MEDI	float32	861.0 kB	232	0.1%	104,910	48.7%	3,060	1.4%	2.8%	0.9871
76	YEARS_BUILD_MEDI	float32	861.0 kB	148	0.1%	143,036	66.4%	2,118	1.0%	2.9%	0.8256
77	COMMONAREA_MEDI	float32	861.0 kB	2,982	1.4%	150,300	69.8%	6,068	2.8%	9.3%	0.0
78	ELEVATORS_MEDI	float32	861.0 kB	46	<0.1%	114,570	53.2%	61,040	28.4%	60.6%	0.0
79	ENTRANCES_MEDI	float32	861.0 kB	46	<0.1%	108,270	50.3%	24,940	11.6%	23.3%	0.1379
80	FLOORSMAX_MEDI	float32	861.0 kB	49	<0.1%	106,970	49.7%	44,659	20.7%	41.2%	0.1667
81	FLOORSMIN_MEDI	float32	861.0 kB	47	<0.1%	146,054	67.9%	23,733	11.0%	34.3%	0.2083
82	LANDAREA_MEDI	float32	861.0 kB	3,393	1.6%	127,644	59.3%	11,058	5.1%	12.6%	0.0
83	LIVINGAPARTMENTS_MEDI	float32	861.0 kB	1,063	0.5%	147,049	68.3%	3,142	1.5%	4.6%	0.0513
84	LIVINGAREA_MEDI	float32	861.0 kB	5,067	2.4%	107,990	50.2%	210	0.1%	0.2%	0.0
85	NONLIVINGAPARTMENTS_MEDI	float32	861.0 kB	190	0.1%	149,354	69.4%	39,384	18.3%	59.8%	0.0
86	NONLIVINGAREA_MEDI	float32	861.0 kB	3,083	1.4%	118,577	55.1%	42,610	19.8%	44.1%	0.0
87	FONDKAPREMONT_MODE	category	215.7 kB	4	<0.1%	147,099	68.3%	51,785	24.1%	76.0%	reg oper account
88	HOUSETYPE_MODE	category	215.6 kB	3	<0.1%	107,834	50.1%	105,515	49.0%	98.2%	block of flats
89	TOTALAREA_MODE	float32	861.0 kB	4,896	2.3%	103,833	48.2%	417	0.2%	0.4%	0.0
90	WALLSMATERIAL_MODE	category	216.0 kB	7	<0.1%	109,329	50.8%	46,298	21.5%	43.7%	Panel
91	EMERGENCYSTATE_MODE	category	215.5 kB	2	<0.1%	101,963	47.4%	111,665	51.9%	98.6%	No
92	OBS_30_CNT_SOCIAL_CIRCLE	float32	861.0 kB	32	<0.1%	714	0.3%	114,550	53.2%	53.4%	0.0
93	DEF_30_CNT_SOCIAL_CIRCLE	float32	861.0 kB	10	<0.1%	714	0.3%	189,988	88.3%	88.6%	0.0
94	OBS_60_CNT_SOCIAL_CIRCLE	float32	861.0 kB	32	<0.1%	714	0.3%	115,085	53.5%	53.6%	0.0
95	DEF_60_CNT_SOCIAL_CIRCLE	float32	861.0 kB	9	<0.1%	714	0.3%	196,614	91.3%	91.6%	0.0
96	DAYS_LAST_PHONE_CHANGE	float32	861.0 kB	3,720	1.7%	1	<0.1%	26,201	12.2%	12.2%	0.0
97	FLAG_DOCUMENT_2	int8	215.3 kB	2	<0.1%	0	0%	215,246	>99.9%	>99.9%	0
98	FLAG_DOCUMENT_3	int8	215.3 kB	2	<0.1%	0	0%	152,845	71.0%	71.0%	1
99	FLAG_DOCUMENT_4	int8	215.3 kB	2	<0.1%	0	0%	215,238	>99.9%	>99.9%	0
100	FLAG_DOCUMENT_5	int8	215.3 kB	2	<0.1%	0	0%	212,025	98.5%	98.5%	0
101	FLAG_DOCUMENT_6	int8	215.3 kB	2	<0.1%	0	0%	196,348	91.2%	91.2%	0
102	FLAG_DOCUMENT_7	int8	215.3 kB	2	<0.1%	0	0%	215,221	>99.9%	>99.9%	0
103	FLAG_DOCUMENT_8	int8	215.3 kB	2	<0.1%	0	0%	197,689	91.8%	91.8%	0
104	FLAG_DOCUMENT_9	int8	215.3 kB	2	<0.1%	0	0%	214,440	99.6%	99.6%	0
105	FLAG_DOCUMENT_10	int8	215.3 kB	2	<0.1%	0	0%	215,253	>99.9%	>99.9%	0
106	FLAG_DOCUMENT_11	int8	215.3 kB	2	<0.1%	0	0%	214,448	99.6%	99.6%	0
107	FLAG_DOCUMENT_12	int8	215.3 kB	2	<0.1%	0	0%	215,256	>99.9%	>99.9%	0
108	FLAG_DOCUMENT_13	int8	215.3 kB	2	<0.1%	0	0%	214,541	99.7%	99.7%	0
109	FLAG_DOCUMENT_14	int8	215.3 kB	2	<0.1%	0	0%	214,614	99.7%	99.7%	0
110	FLAG_DOCUMENT_15	int8	215.3 kB	2	<0.1%	0	0%	215,015	99.9%	99.9%	0
111	FLAG_DOCUMENT_16	int8	215.3 kB	2	<0.1%	0	0%	213,089	99.0%	99.0%	0
112	FLAG_DOCUMENT_17	int8	215.3 kB	2	<0.1%	0	0%	215,200	>99.9%	>99.9%	0
113	FLAG_DOCUMENT_18	int8	215.3 kB	2	<0.1%	0	0%	213,525	99.2%	99.2%	0
114	FLAG_DOCUMENT_19	int8	215.3 kB	2	<0.1%	0	0%	215,124	99.9%	99.9%	0
115	FLAG_DOCUMENT_20	int8	215.3 kB	2	<0.1%	0	0%	215,146	99.9%	99.9%	0
116	FLAG_DOCUMENT_21	int8	215.3 kB	2	<0.1%	0	0%	215,187	>99.9%	>99.9%	0
117	AMT_REQ_CREDIT_BUREAU_HOUR	float32	861.0 kB	5	<0.1%	29,081	13.5%	185,061	86.0%	99.4%	0.0
118	AMT_REQ_CREDIT_BUREAU_DAY	float32	861.0 kB	9	<0.1%	29,081	13.5%	185,147	86.0%	99.4%	0.0
119	AMT_REQ_CREDIT_BUREAU_WEEK	float32	861.0 kB	9	<0.1%	29,081	13.5%	180,246	83.7%	96.8%	0.0
120	AMT_REQ_CREDIT_BUREAU_MON	float32	861.0 kB	22	<0.1%	29,081	13.5%	155,679	72.3%	83.6%	0.0
121	AMT_REQ_CREDIT_BUREAU_QRT	float32	861.0 kB	10	<0.1%	29,081	13.5%	150,895	70.1%	81.0%	0.0
122	AMT_REQ_CREDIT_BUREAU_YEAR	float32	861.0 kB	24	<0.1%	29,081	13.5%	50,313	23.4%	27.0%	0.0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_eda_1 = sweetviz.analyze(
            [application_train, "application (train)"],
            target_feat="TARGET",
            pairwise_analysis="off",
        )
        report_eda_1.show_notebook()

The distribution of the number of children (CNT_CHILDREN) is right-skewed, with a few outliers. In the sweetviz report, the trends in the distribution are not clear due to the extreme values. The frequency table below shows that the majority of clients have no children and only a few have more than 5-6 children.

Code

application_train.value_counts("CNT_CHILDREN").sort_index()

CNT_CHILDREN
0     150641
1      42945
2      18697
3       2598
4        280
5         67
6         17
7          6
8          2
9          1
10         2
19         1
Name: count, dtype: int64

Total income (AMT_INCOME_TOTAL) is also right-skewed, with a few outliers. Let’s categorize values, draw a frequency table and plot the distribution with the most extreme values removed to see the trends more clearly.

Code

above_1m_count = (
    application_train.assign(
        FLAG_INCOME_TOTAL_ABOVE_1M=lambda df: pd.cut(
            df["AMT_INCOME_TOTAL"],
            bins=[0, 5e5, 1e6, 1.5e6, 2e6, 1e7, np.inf],
            labels=["0-100k", "500k-1M", "1M-1.5M", "1.5M-2M", "2M-10M", "10M+"],
        )
    )
    .value_counts("FLAG_INCOME_TOTAL_ABOVE_1M")
    .sort_index()
)
above_1m_count

FLAG_INCOME_TOTAL_ABOVE_1M
0-100k     213341
500k-1M      1745
1M-1.5M       109
1.5M-2M        29
2M-10M         31
10M+            2
Name: count, dtype: int64

Code

plt.figure(figsize=(10, 3))
plt.hist(application_train["AMT_INCOME_TOTAL"], bins=40, range=(0, 1.5e6), ec="black")
plt.xlabel("AMT_INCOME_TOTAL")
plt.ylabel("Frequency")
plt.title("Distribution of AMT_INCOME_TOTAL (up to 1.5M$)")
plt.show()

Most of the values in DAYS_EMPLOYED are negative, which means that the client is employed. But there is a big positive number (365243) which will be treated as a missing value.

Code

application_train.value_counts("DAYS_EMPLOYED").sort_index()

DAYS_EMPLOYED
-17583         1
-17546         1
-17522         1
-17139         1
-16849         1
           ...  
-3             2
-2             2
-1             1
 0             1
 365243    38756
Name: count, Length: 11770, dtype: int64

Code

application_train.eval("DAYS_EMPLOYED > 0").value_counts().sort_index()

DAYS_EMPLOYED
False    176501
True      38756
Name: count, dtype: int64

There are many types of organizations. Value XNA should be converted to an explicit missing value:

Code

application_train.value_counts("ORGANIZATION_TYPE").sort_index()

ORGANIZATION_TYPE
Advertising                 289
Agriculture                1730
Bank                       1735
Business Entity Type 1     4214
Business Entity Type 2     7374
Business Entity Type 3    47582
Cleaning                    195
Construction               4704
Culture                     269
Electricity                 674
Emergency                   395
Government                 7324
Hotel                       686
Housing                    2055
Industry: type 1            737
Industry: type 10            75
Industry: type 11          1888
Industry: type 12           258
Industry: type 13            46
Industry: type 2            326
Industry: type 3           2292
Industry: type 4            633
Industry: type 5            393
Industry: type 6             77
Industry: type 7            903
Industry: type 8             17
Industry: type 9           2396
Insurance                   415
Kindergarten               4891
Legal Services              218
Medicine                   7917
Military                   1857
Mobile                      211
Other                     11662
Police                     1608
Postal                     1520
Realtor                     279
Religion                     59
Restaurant                 1285
School                     6296
Security                   2302
Security Ministries        1403
Self-employed             26681
Services                   1089
Telecom                     396
Trade: type 1               237
Trade: type 2              1338
Trade: type 3              2425
Trade: type 4                45
Trade: type 5                34
Trade: type 6               425
Trade: type 7              5450
Transport: type 1           145
Transport: type 2          1529
Transport: type 3           851
Transport: type 4          3749
University                  917
XNA                       38756
Name: count, dtype: int64

4 Modeling (w/o Historical Data)

In this section, a model based only on the data from the main table application is built. The historical credit data is not included here.

4.1 Create Pipelines

A few pre-processing steps are defined in this section. The pipelines are created using sklearn’s Pipeline class.

Some general steps:

Code

# Numeric variables except missing value indicators
select_numeric = make_column_selector(dtype_include="number")

# Categorical variables
select_categorical = make_column_selector(dtype_include="category")

# Create the pipelines
# Use median imputation for numeric variables
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median", add_indicator=True))]
)

# Use one-hot encoding for categorical variables
# and clean column names (remove spaces, special characters, etc.)
categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
        ("clean_names", CleanColumnNames()),
    ]
)

# Merge pipelines of numeric and categorical variables
pre_processing = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, select_numeric),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)
pre_processing

ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(add_indicator=True,
                                                                strategy='median'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001E1D640F1D0>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False)),
                                                 ('clean_names',
                                                  CleanColumnNames())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001E1D640E5D0>)],
                  verbose_feature_names_out=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Some pre-processing steps specific to the main table application are implemented below. These are:

Data cleaning steps;
Feature engineering steps.

In feature engineering, variables such as the number of non-children in the family, income per family member and others are created.

Some variables that might be considered discriminative by the law are (age, sex, and family status) discarded from the analysis. Some, which might also be considered unethical (e.g., the day of the week and the hour of the day when the application started) are also removed.

Code

class PreprocessorForApplications(BaseEstimator, TransformerMixin):
    """Transformer for the loan grade prediction."""

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        education_values = [
            "Lower secondary",
            "Secondary / secondary special",
            "Incomplete higher",
            "Higher education",
            "Academic degree",
        ]

        education_dtype = pd.CategoricalDtype(categories=education_values, ordered=True)

        X = X.assign(
            # Extract features
            FLAG_OWN_CAR=lambda df: (df["FLAG_OWN_CAR"] == "Y").astype("Int8"),
            FLAG_OWN_REALTY=lambda df: (df["FLAG_OWN_REALTY"] == "Y").astype("Int8"),
            FLAG_IS_EMERGENCY=lambda df: (df["EMERGENCYSTATE_MODE"] == "Yes").astype(
                "Int8"
            ),
            NAME_EDUCATION_TYPE=lambda df: df["NAME_EDUCATION_TYPE"].astype(
                education_dtype
            ),
            ord_education_type=lambda df: df["NAME_EDUCATION_TYPE"].cat.codes,
            flag_has_children=lambda df: (df["CNT_CHILDREN"] > 0).astype("Int8"),
            DAYS_EMPLOYED=lambda df: df["DAYS_EMPLOYED"].replace(365243, np.nan),
            years_employed=lambda df: df["DAYS_EMPLOYED"] / -365,
            amt_income_total_per_family_member=lambda df: df["AMT_INCOME_TOTAL"]
            / df["CNT_FAM_MEMBERS"],
            cnt_fam_members_excluding_children=lambda df: df["CNT_FAM_MEMBERS"]
            - df["CNT_CHILDREN"],
            amt_annuity_to_credit_ratio=lambda df: df["AMT_ANNUITY"] / df["AMT_CREDIT"],
            amt_annuity_to_income_ratio=lambda df: df["AMT_ANNUITY"]
            / df["AMT_INCOME_TOTAL"],
            amt_credit_to_income_ratio=lambda df: df["AMT_CREDIT"]
            / df["AMT_INCOME_TOTAL"],
            amt_annuity_to_income_per_family_member=lambda df: df["AMT_ANNUITY"]
            / df["amt_income_total_per_family_member"],
            # Make explicit the missing values: XNA → NaN
            ORGANIZATION_TYPE=lambda df: df["ORGANIZATION_TYPE"].replace("XNA", np.nan),
        )
        return X.drop(
            columns=[
                "SK_ID_CURR",
                # Restricted by legal constraints
                "CODE_GENDER",
                "NAME_FAMILY_STATUS",
                "DAYS_BIRTH",
                # Not useful / Unethical
                "WEEKDAY_APPR_PROCESS_START",
                "HOUR_APPR_PROCESS_START",
                # Almost constant
                "FLAG_MOBIL",
                # Already used/processed
                "EMERGENCYSTATE_MODE",
                "DAYS_EMPLOYED",
            ]
        )

    def get_feature_names_out(self):
        pass

4.2 Train Full Model

A Light Gradient Boosting Machine (LGBM) is used as a model here as it is fast, reasonably accurate and robust to outliers, missing values, and some other issues which means that a few pre-processing steps can be skipped.

Code

lgbm_classifier = LGBMClassifier(
    random_state=1, class_weight="balanced", n_jobs=-1, device="gpu"
)

The model creation in this section will consist of the following steps:

application-specific pre-processing steps (see code in the previous section);
general pre-processing steps for each data type (see the pipeline in the previous section);
feature pre-selection removing duplicated and correlated columns;
training the model.

Code

if "models_default_prediction" not in locals():
    models_default_prediction = {}


@my.cache_results(dir_interim + "task-1-applications-only--model-01_lgbm.pickle")
def fit_lgbm_default():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor_1", PreprocessorForApplications()),
            ("preprocessor_2", clone(pre_processing)),
            ("drop_duplicate_features", DropDuplicateFeatures()),
            (
                "drop_corr_features",
                SmartCorrelatedSelection(selection_method="variance"),
            ),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_train, y_train)
    return pipeline


models_default_prediction["LGBM"] = fit_lgbm_default()
models_default_prediction["LGBM"]

[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6162
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 185
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 35 dense feature groups (7.39 MB) transferred to GPU in 0.023343 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000

Code

feature_names_all = (
    models_default_prediction["LGBM"].named_steps["classifier"].feature_name_
)
print(f"N features used by LGBM: {len(feature_names_all)}")

N features used by LGBM: 197

4.3 Evaluate Models

The model is evaluated on the validation set. For reference, the results are also calculated on the training set.

The main metric to rank modes here and in the other sections is the ROC AUC score. The values of other metrics are taken into account too.

print("--- Train ---")

ml.classification_scores(
    models_default_prediction,
    X_train,
    y_train,
    color="orange",
    sort_by="ROC_AUC",
)

--- Train ---

Table 4.1. Classification scores for the train set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM	215257	0.919	0.714	0.721	0.443	0.292	0.821	0.731	0.712	0.182	0.968	0.799

print("--- Validation ---")

ml.classification_scores(
    models_default_prediction,
    X_validation,
    y_validation,
    sort_by="ROC_AUC",
)

--- Validation ---

Table 4.2. Classification scores for the validation set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM	46127	0.919	0.705	0.688	0.375	0.268	0.815	0.667	0.708	0.167	0.960	0.759

Code

sns.set_style("white")
y_pred_validation_lgbm = models_default_prediction["LGBM"].predict(X_validation)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm, figsize=(13, 3));

4.4 Feature Importance

In this section, the feature importance is calculated: both internal LGBM feature importance as well as SHAP values are used. The results indicate that 5 most important features captured by both methods are:

the ratio of annuity to credit amount (amt_annuity_to_credit_ratio)
EXT_SOURCE_3
EXT_SOURCE_2
EXT_SOURCE_1
length of employment (years_employed).

Note. Feature names in CAPITALS indicate that there are the original features from the application table and feature names in lowercase indicate that the features were derived, extracted or the values were pre-processed.

Find the details below.

Code

@my.cache_results(dir_interim + "task-1-applications-only--shap_lgbm_k=all.pickle")
def get_shap_values_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_default_prediction[model].steps[:-1])
    classifier = models_default_prediction[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm, data_for_lgbm = get_shap_values_lgbm()

Code

vals = np.abs(shap_values_lgbm).mean(0).mean(0)
feature_importance = (
    pd.DataFrame(
        list(zip(data_for_lgbm.columns, vals)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

Code

sns.set_style("whitegrid")
lgb.plot_importance(
    models_default_prediction["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(8, 10),
    height=0.8,
    title="LGBM Feature Importance",
);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm[1],
    data_for_lgbm,
    plot_type="bar",
    max_display=110,
    plot_size=(10, 15),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm[1], data_for_lgbm, max_display=50, plot_size=(10, 9), show=False
)
plt.title("SHAP Feature Importance", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

Code of the table

feature_importance.query("importance > 0.00001").style.format(precision=5)

Table 4.3. SHAP feature importance for the validation set.

	index	col_name	importance
0	21	EXT_SOURCE_2	0.34360
1	22	EXT_SOURCE_3	0.33933
2	68	amt_annuity_to_credit_ratio	0.18039
3	20	EXT_SOURCE_1	0.15745
4	66	years_employed	0.11401
5	65	ord_education_type	0.09472
6	4	AMT_ANNUITY	0.06261
7	82	NAME_CONTRACT_TYPE_Cash_loans	0.05932
8	8	OWN_CAR_AGE	0.04880
9	3	AMT_CREDIT	0.04073
10	80	missingindicator_AMT_REQ_CREDIT_BUREAU_HOUR	0.04033
11	7	DAYS_ID_PUBLISH	0.03977
12	36	DEF_30_CNT_SOCIAL_CIRCLE	0.03616
13	37	DAYS_LAST_PHONE_CHANGE	0.02921
14	39	FLAG_DOCUMENT_3	0.02860
15	62	AMT_REQ_CREDIT_BUREAU_QRT	0.02840
16	15	REGION_RATING_CLIENT	0.02754
17	74	missingindicator_EXT_SOURCE_1	0.02751
18	67	cnt_fam_members_excluding_children	0.02707
19	0	FLAG_OWN_CAR	0.02629
20	97	NAME_INCOME_TYPE_Working	0.02496
21	6	DAYS_REGISTRATION	0.02361
22	115	OCCUPATION_TYPE_Laborers	0.02352
23	111	OCCUPATION_TYPE_Drivers	0.01835
24	69	amt_annuity_to_income_ratio	0.01825
25	63	AMT_REQ_CREDIT_BUREAU_YEAR	0.01549
26	18	REG_CITY_NOT_LIVE_CITY	0.01509
27	194	WALLSMATERIAL_MODE_Panel	0.01506
28	10	FLAG_WORK_PHONE	0.01406
29	110	OCCUPATION_TYPE_Core_staff	0.01404
30	27	YEARS_BEGINEXPLUATATION_MODE	0.01197
31	71	amt_annuity_to_income_per_family_member	0.01172
32	32	FLOORSMAX_MEDI	0.01132
33	168	ORGANIZATION_TYPE_Self_employed	0.01129
34	5	REGION_POPULATION_RELATIVE	0.00989
35	70	amt_credit_to_income_ratio	0.00968
36	107	OCCUPATION_TYPE_Accountants	0.00918
37	131	ORGANIZATION_TYPE_Business_Entity_Type_3	0.00904
38	2	AMT_INCOME_TOTAL	0.00888
39	133	ORGANIZATION_TYPE_Construction	0.00708
40	34	LANDAREA_MEDI	0.00694
41	9	FLAG_EMP_PHONE	0.00664
42	125	OCCUPATION_TYPE_nan	0.00643
43	12	FLAG_PHONE	0.00627
44	94	NAME_INCOME_TYPE_State_servant	0.00561
45	30	NONLIVINGAREA_MODE	0.00554
46	152	ORGANIZATION_TYPE_Industry_type_9	0.00502
47	28	ENTRANCES_MODE	0.00459
48	35	OBS_30_CNT_SOCIAL_CIRCLE	0.00429
49	157	ORGANIZATION_TYPE_Military	0.00425
50	14	CNT_FAM_MEMBERS	0.00404
51	31	COMMONAREA_MEDI	0.00387
52	165	ORGANIZATION_TYPE_School	0.00377
53	23	YEARS_BUILD_AVG	0.00369
54	172	ORGANIZATION_TYPE_Trade_type_2	0.00353
55	116	OCCUPATION_TYPE_Low_skill_Laborers	0.00291
56	26	BASEMENTAREA_MODE	0.00290
57	128	ORGANIZATION_TYPE_Bank	0.00287
58	29	LIVINGAPARTMENTS_MODE	0.00261
59	123	OCCUPATION_TYPE_Security_staff	0.00238
60	52	FLAG_DOCUMENT_16	0.00222
61	24	ELEVATORS_AVG	0.00204
62	84	NAME_TYPE_SUITE_Family	0.00178
63	180	ORGANIZATION_TYPE_Transport_type_3	0.00162
64	49	FLAG_DOCUMENT_13	0.00157
65	19	REG_CITY_NOT_WORK_CITY	0.00152
66	44	FLAG_DOCUMENT_8	0.00144
67	25	NONLIVINGAPARTMENTS_AVG	0.00140
68	33	FLOORSMIN_MEDI	0.00129
69	77	missingindicator_YEARS_BUILD_AVG	0.00128
70	104	NAME_HOUSING_TYPE_Office_apartment	0.00124
71	54	FLAG_DOCUMENT_18	0.00123
72	89	NAME_TYPE_SUITE_Unaccompanied	0.00115
73	61	AMT_REQ_CREDIT_BUREAU_MON	0.00107
74	169	ORGANIZATION_TYPE_Services	0.00098
75	154	ORGANIZATION_TYPE_Kindergarten	0.00087
76	1	FLAG_OWN_REALTY	0.00073
77	102	NAME_HOUSING_TYPE_House_apartment	0.00067
78	92	NAME_INCOME_TYPE_Commercial_associate	0.00063
79	160	ORGANIZATION_TYPE_Police	0.00059
80	105	NAME_HOUSING_TYPE_Rented_apartment	0.00056
81	167	ORGANIZATION_TYPE_Security_Ministries	0.00054
82	51	FLAG_DOCUMENT_15	0.00054
83	90	NAME_TYPE_SUITE_nan	0.00052
84	129	ORGANIZATION_TYPE_Business_Entity_Type_1	0.00047
85	155	ORGANIZATION_TYPE_Legal_Services	0.00047
86	162	ORGANIZATION_TYPE_Realtor	0.00045
87	60	AMT_REQ_CREDIT_BUREAU_WEEK	0.00043
88	186	FONDKAPREMONT_MODE_reg_oper_spec_account	0.00041
89	176	ORGANIZATION_TYPE_Trade_type_6	0.00039
90	181	ORGANIZATION_TYPE_Transport_type_4	0.00033
91	195	WALLSMATERIAL_MODE_Stone_brick	0.00030
92	58	AMT_REQ_CREDIT_BUREAU_HOUR	0.00028
93	177	ORGANIZATION_TYPE_Trade_type_7	0.00025
94	103	NAME_HOUSING_TYPE_Municipal_apartment	0.00024
95	16	REG_REGION_NOT_LIVE_REGION	0.00022
96	47	FLAG_DOCUMENT_11	0.00022
97	88	NAME_TYPE_SUITE_Spouse_partner	0.00021
98	13	FLAG_EMAIL	0.00020
99	122	OCCUPATION_TYPE_Secretaries	0.00019
100	83	NAME_TYPE_SUITE_Children	0.00018
101	159	ORGANIZATION_TYPE_Other	0.00018
102	185	FONDKAPREMONT_MODE_reg_oper_account	0.00018
103	147	ORGANIZATION_TYPE_Industry_type_4	0.00015
104	174	ORGANIZATION_TYPE_Trade_type_4	0.00015
105	156	ORGANIZATION_TYPE_Medicine	0.00015
106	108	OCCUPATION_TYPE_Cleaning_staff	0.00014
107	41	FLAG_DOCUMENT_5	0.00014
108	121	OCCUPATION_TYPE_Sales_staff	0.00012
109	191	WALLSMATERIAL_MODE_Mixed	0.00012
110	183	FONDKAPREMONT_MODE_not_specified	0.00011
111	118	OCCUPATION_TYPE_Medicine_staff	0.00010
112	76	missingindicator_EXT_SOURCE_3	0.00010
113	189	HOUSETYPE_MODE_nan	0.00009
114	124	OCCUPATION_TYPE_Waiters_barmen_staff	0.00009
115	101	NAME_HOUSING_TYPE_Co_op_apartment	0.00008
116	109	OCCUPATION_TYPE_Cooking_staff	0.00007
117	179	ORGANIZATION_TYPE_Transport_type_2	0.00007
118	190	WALLSMATERIAL_MODE_Block	0.00006
119	113	OCCUPATION_TYPE_High_skill_tech_staff	0.00006
120	119	OCCUPATION_TYPE_Private_service_staff	0.00005
121	17	REG_REGION_NOT_WORK_REGION	0.00004
122	117	OCCUPATION_TYPE_Managers	0.00004
123	170	ORGANIZATION_TYPE_Telecom	0.00004
124	127	ORGANIZATION_TYPE_Agriculture	0.00002

4.5 Training Models with Feature Selection

In this section, LGBM models are trained on the training on subsets of features. These subsets are determined by SHAP values: 7 thresholds of SHAP values are used to select features.

This method is quicker than sequential feature selection (SFS) but might not be as accurate as not all combinations of features are tested.

Despite the fact, that the validation ROC AUC value was highest in the full model (0.759), the model with 111 was chosen for the next steps because ROC AUC is lower just by 0.001 and most of the other metrics are better than in the full model.

Find the details below.

Code

def fit_lgbm_on_features(features):
    """Template to fit a LGBM model with a smaller number of features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor_1", PreprocessorForApplications()),
            ("preprocessor_2", clone(pre_processing)),
            ("drop_duplicate_features", DropDuplicateFeatures()),
            (
                "drop_corr_features",
                SmartCorrelatedSelection(selection_method="variance"),
            ),
            ("selector", ColumnSelector(features)),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_train, y_train)
    return pipeline


def fit_lgbm_with_shap_threshold(threshold):
    """Function for feature selection based on SHAP values"""
    features = feature_importance.query(f"importance > {threshold}").col_name.to_list()

    k = len(features)

    return f"LGBM ({k} features)", fit_lgbm_on_features(features)

Code

# Restore from file or calculate
file = dir_interim + "task-1-default--lgbm_models_as_dict.pkl"

if os.path.exists(file):
    with open(file, "rb") as f:
        models_default_prediction = joblib.load(f)
else:
    for threshold in [0.00004, 0.0001, 0.0002, 0.001, 0.003, 0.010, 0.050]:
        model_name, model = fit_lgbm_with_shap_threshold(threshold)
        models_default_prediction[model_name] = model

    # Change name of the full model
    models_default_prediction[
        "LGBM (FULL | 197 feat.)"
    ] = models_default_prediction.pop("LGBM")

    with open(file, "wb") as f:
        joblib.dump(models_default_prediction, f)

del file

# Time: 9m 10.8s

Change model’s label to make it easier to understand among other models:

Code

print("--- Train ---")
ml.classification_scores(
    models_default_prediction,
    X_train,
    y_train,
    sort_by="ROC_AUC",
    color="orange",
)

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (123 features)	215257	0.919	0.714	0.722	0.443	0.292	0.821	0.731	0.712	0.182	0.968	0.800
LGBM (FULL \| 197 feat.)	215257	0.919	0.714	0.721	0.443	0.292	0.821	0.731	0.712	0.182	0.968	0.799
LGBM (98 features)	215257	0.919	0.714	0.722	0.444	0.292	0.820	0.733	0.712	0.183	0.968	0.799
LGBM (74 features)	215257	0.919	0.713	0.721	0.443	0.292	0.820	0.731	0.712	0.182	0.968	0.799
LGBM (111 features)	215257	0.919	0.714	0.721	0.443	0.292	0.820	0.731	0.712	0.182	0.968	0.799
LGBM (55 features)	215257	0.919	0.713	0.721	0.441	0.291	0.820	0.730	0.711	0.182	0.968	0.799
LGBM (34 features)	215257	0.919	0.711	0.718	0.437	0.289	0.819	0.726	0.710	0.180	0.967	0.796
LGBM (8 features)	215257	0.919	0.705	0.708	0.416	0.280	0.814	0.711	0.704	0.174	0.965	0.784

Code

print("--- Validation ---")
ml.classification_scores(
    models_default_prediction,
    X_validation,
    y_validation,
    sort_by="ROC_AUC",
)

--- Validation ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (FULL \| 197 feat.)	46127	0.919	0.705	0.688	0.375	0.268	0.815	0.667	0.708	0.167	0.960	0.759
LGBM (123 features)	46127	0.919	0.704	0.687	0.375	0.267	0.815	0.668	0.707	0.167	0.960	0.759
LGBM (111 features)	46127	0.919	0.705	0.691	0.381	0.269	0.816	0.673	0.708	0.168	0.961	0.758
LGBM (98 features)	46127	0.919	0.704	0.689	0.377	0.268	0.814	0.671	0.707	0.167	0.961	0.758
LGBM (74 features)	46127	0.919	0.704	0.689	0.377	0.268	0.815	0.670	0.707	0.167	0.961	0.758
LGBM (34 features)	46127	0.919	0.703	0.688	0.375	0.267	0.814	0.669	0.706	0.167	0.960	0.758
LGBM (55 features)	46127	0.919	0.703	0.688	0.375	0.267	0.814	0.669	0.706	0.167	0.961	0.758
LGBM (8 features)	46127	0.919	0.698	0.688	0.376	0.265	0.810	0.675	0.700	0.165	0.961	0.754

4.6 Hyperparameter Tuning

The model with 111 features is tuned using Optuna package (Bayesian optimization).

Code

# Features to use: 111 features
feature_names_to_tune_111 = (
    models_default_prediction["LGBM (111 features)"]
    .named_steps["classifier"]
    .feature_name_
)

# Use 3-fold stratified CV
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

Code

# Define objective function for Optuna
def objective_1(trial):
    "Objective function for hyperparameter tuning"
    # LGBM params
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000, step=50),
        "max_depth": trial.suggest_int("max_depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["gbdt"]),
        # Tree Structure and Complexity
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        # Regularization
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 1.0),
        # Learning Rate and Feature Selection
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
        # Other Parameters
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_weight": trial.suggest_float(
            "min_child_weight", 1e-3, 1e3, log=True
        ),
        "min_split_gain": trial.suggest_float("min_split_gain", 0.0, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 50),
        "max_delta_step": trial.suggest_int("max_delta_step", 0, 10),
    }

    model = LGBMClassifier(
        objective="binary",
        metric="auc",
        random_state=1,
        class_weight="balanced",
        n_jobs=-1,
        device="gpu",
        **params,
    )

    pipeline_to_tune = Pipeline(
        steps=[
            ("preprocessor_1", PreprocessorForApplications()),
            ("preprocessor_2", clone(pre_processing)),
            ("selector", ColumnSelector(feature_names_to_tune_111)),
            ("classifier", model),
        ]
    )

    scores = cross_val_score(
        pipeline_to_tune, X_train, y_train, n_jobs=-1, cv=stratified_kfold
    )

    return scores.mean()


study_name_1 = "tune--without-credit-history"
storage_name_1 = f"sqlite:///{dir_interim}/optuna--{study_name_1}.db"

study_1 = optuna.create_study(
    study_name=study_name_1,
    storage=storage_name_1,
    load_if_exists=True,
    direction="maximize",
)
study_1.optimize(objective_1, n_trials=100, timeout=3600)
# Time: 61m 42.5s

[I 2023-12-27 23:19:35,808] A new study created in RDB with name: tune--without-credit-history
[I 2023-12-27 23:21:36,524] Trial 0 finished with value: 0.8748890771937772 and parameters: {'n_estimators': 700, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 70, 'min_child_samples': 35, 'lambda_l1': 0.025550868350601743, 'lambda_l2': 6.238522519933288e-06, 'reg_alpha': 0.8322445184530112, 'reg_lambda': 0.4781666892844665, 'learning_rate': 0.22386006944324777, 'feature_fraction': 0.605273438694538, 'subsample': 0.2815517349153884, 'colsample_bytree': 0.5105571507272018, 'bagging_fraction': 0.6965379020611673, 'bagging_freq': 5, 'min_child_weight': 0.0659826384593787, 'min_split_gain': 0.04703076905732695, 'min_data_in_leaf': 24, 'max_delta_step': 3}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:23:31,224] Trial 1 finished with value: 0.8719995072472626 and parameters: {'n_estimators': 450, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 213, 'min_child_samples': 85, 'lambda_l1': 1.661141373652044e-05, 'lambda_l2': 0.004818453214986181, 'reg_alpha': 0.4467088053506053, 'reg_lambda': 0.8312315360481795, 'learning_rate': 0.14446555019790158, 'feature_fraction': 0.9148147970988095, 'subsample': 0.06025781999414132, 'colsample_bytree': 0.12426740411827782, 'bagging_fraction': 0.819760663190308, 'bagging_freq': 5, 'min_child_weight': 0.13983078649703026, 'min_split_gain': 0.9980604904877477, 'min_data_in_leaf': 39, 'max_delta_step': 8}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:25:13,496] Trial 2 finished with value: 0.8716975552657328 and parameters: {'n_estimators': 1000, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 217, 'min_child_samples': 81, 'lambda_l1': 0.009261946937607285, 'lambda_l2': 0.06187411358260759, 'reg_alpha': 0.8360855695430213, 'reg_lambda': 0.924252800361061, 'learning_rate': 0.2656294979420381, 'feature_fraction': 0.465108015073728, 'subsample': 0.45509562950354704, 'colsample_bytree': 0.8388483403850614, 'bagging_fraction': 0.7223922062864684, 'bagging_freq': 6, 'min_child_weight': 4.831699218012537, 'min_split_gain': 0.962711189116782, 'min_data_in_leaf': 35, 'max_delta_step': 2}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:25:33,913] Trial 3 finished with value: 0.6886373087416217 and parameters: {'n_estimators': 300, 'max_depth': 3, 'boosting_type': 'gbdt', 'num_leaves': 248, 'min_child_samples': 33, 'lambda_l1': 0.004828148870358621, 'lambda_l2': 3.7703945012157024e-08, 'reg_alpha': 0.59959488943298, 'reg_lambda': 0.1437503088550186, 'learning_rate': 0.0378365539137563, 'feature_fraction': 0.8254606113930397, 'subsample': 0.8122840147057663, 'colsample_bytree': 0.9718079730031982, 'bagging_fraction': 0.5958803375842794, 'bagging_freq': 7, 'min_child_weight': 27.943197609701645, 'min_split_gain': 0.22052965204241526, 'min_data_in_leaf': 39, 'max_delta_step': 9}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:27:54,032] Trial 4 finished with value: 0.8427321660875374 and parameters: {'n_estimators': 950, 'max_depth': 7, 'boosting_type': 'gbdt', 'num_leaves': 243, 'min_child_samples': 13, 'lambda_l1': 3.167458423762587e-07, 'lambda_l2': 2.0396359964114315e-05, 'reg_alpha': 0.9294582973364317, 'reg_lambda': 0.5563919876289434, 'learning_rate': 0.07876875308940313, 'feature_fraction': 0.7357755986308083, 'subsample': 0.21739585888299084, 'colsample_bytree': 0.7852352864892946, 'bagging_fraction': 0.7029190171696713, 'bagging_freq': 2, 'min_child_weight': 1.5343452877981487, 'min_split_gain': 0.9847878845340966, 'min_data_in_leaf': 4, 'max_delta_step': 6}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:28:43,962] Trial 5 finished with value: 0.7071918616650716 and parameters: {'n_estimators': 850, 'max_depth': 4, 'boosting_type': 'gbdt', 'num_leaves': 144, 'min_child_samples': 26, 'lambda_l1': 0.01415493365081948, 'lambda_l2': 0.006424681506007423, 'reg_alpha': 0.6180900050335616, 'reg_lambda': 0.29405590359524547, 'learning_rate': 0.02219633676489674, 'feature_fraction': 0.9691971625917661, 'subsample': 0.11557394834032685, 'colsample_bytree': 0.062123273241513156, 'bagging_fraction': 0.6974205405994247, 'bagging_freq': 4, 'min_child_weight': 0.003574530682905926, 'min_split_gain': 0.9711972007891173, 'min_data_in_leaf': 49, 'max_delta_step': 4}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:29:58,291] Trial 6 finished with value: 0.7174493746075338 and parameters: {'n_estimators': 1000, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 25, 'min_child_samples': 9, 'lambda_l1': 1.1640499253732652, 'lambda_l2': 1.4978615689896486, 'reg_alpha': 0.6703920743197584, 'reg_lambda': 0.6951344076255069, 'learning_rate': 0.03934371100464988, 'feature_fraction': 0.5426727265714434, 'subsample': 0.9003412951794741, 'colsample_bytree': 0.9572435750544385, 'bagging_fraction': 0.7122492104249079, 'bagging_freq': 2, 'min_child_weight': 383.3789145469394, 'min_split_gain': 0.13037245311437773, 'min_data_in_leaf': 15, 'max_delta_step': 0}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:30:36,511] Trial 7 finished with value: 0.6979471073413656 and parameters: {'n_estimators': 750, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 25, 'min_child_samples': 45, 'lambda_l1': 0.00036467107284622217, 'lambda_l2': 0.9664719945485747, 'reg_alpha': 0.0037572914004266877, 'reg_lambda': 0.35831306235287763, 'learning_rate': 0.06805395733114952, 'feature_fraction': 0.5522713863361657, 'subsample': 0.28819032440828074, 'colsample_bytree': 0.5917850497085058, 'bagging_fraction': 0.9832808929105473, 'bagging_freq': 4, 'min_child_weight': 0.059396449073026505, 'min_split_gain': 0.8609953778780896, 'min_data_in_leaf': 2, 'max_delta_step': 4}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:32:39,473] Trial 8 finished with value: 0.7632829528249984 and parameters: {'n_estimators': 700, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 183, 'min_child_samples': 86, 'lambda_l1': 0.43707893573098267, 'lambda_l2': 1.1063823368943688e-07, 'reg_alpha': 0.37683976035684597, 'reg_lambda': 0.7766568902763252, 'learning_rate': 0.011339287234428277, 'feature_fraction': 0.5522059526815742, 'subsample': 0.7322810968584744, 'colsample_bytree': 0.17684575464900754, 'bagging_fraction': 0.6152408818469279, 'bagging_freq': 1, 'min_child_weight': 0.07350345998766064, 'min_split_gain': 0.6272191527067923, 'min_data_in_leaf': 46, 'max_delta_step': 9}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:32:58,013] Trial 9 finished with value: 0.6852599447300033 and parameters: {'n_estimators': 450, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 16, 'min_child_samples': 13, 'lambda_l1': 0.0070758192800633914, 'lambda_l2': 0.13717467880445933, 'reg_alpha': 0.7362306944558158, 'reg_lambda': 0.3272592750320643, 'learning_rate': 0.029383915056277837, 'feature_fraction': 0.8289326333877685, 'subsample': 0.9345982070505822, 'colsample_bytree': 0.2984780648394578, 'bagging_fraction': 0.925810197700263, 'bagging_freq': 1, 'min_child_weight': 49.251384853626384, 'min_split_gain': 0.008451976684397788, 'min_data_in_leaf': 2, 'max_delta_step': 6}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:33:28,532] Trial 10 finished with value: 0.7684488721125219 and parameters: {'n_estimators': 150, 'max_depth': 7, 'boosting_type': 'gbdt', 'num_leaves': 87, 'min_child_samples': 65, 'lambda_l1': 4.891527454057723, 'lambda_l2': 1.7766807872936614e-05, 'reg_alpha': 0.9978156714517683, 'reg_lambda': 0.0016067837261933837, 'learning_rate': 0.2611021234227985, 'feature_fraction': 0.6516022331735631, 'subsample': 0.4763179523459447, 'colsample_bytree': 0.4636511039806581, 'bagging_fraction': 0.46156670236544706, 'bagging_freq': 5, 'min_child_weight': 0.0014304249163018315, 'min_split_gain': 0.3246805866112631, 'min_data_in_leaf': 23, 'max_delta_step': 0}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:35:15,562] Trial 11 finished with value: 0.8782710858654724 and parameters: {'n_estimators': 500, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 113, 'min_child_samples': 98, 'lambda_l1': 2.8861819174336533e-06, 'lambda_l2': 0.0007058725506922919, 'reg_alpha': 0.42669822608800045, 'reg_lambda': 0.9783691932791746, 'learning_rate': 0.14222487027417705, 'feature_fraction': 0.9635586125518529, 'subsample': 0.050624589771094415, 'colsample_bytree': 0.342473029527568, 'bagging_fraction': 0.8553943871270485, 'bagging_freq': 5, 'min_child_weight': 0.15274206501441348, 'min_split_gain': 0.4461762335815785, 'min_data_in_leaf': 27, 'max_delta_step': 7}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:36:53,579] Trial 12 finished with value: 0.8738531048568623 and parameters: {'n_estimators': 600, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 87, 'min_child_samples': 99, 'lambda_l1': 1.3423524891267378e-07, 'lambda_l2': 0.00013164904456447555, 'reg_alpha': 0.3753630433121057, 'reg_lambda': 0.9897294778233382, 'learning_rate': 0.14170726301672681, 'feature_fraction': 0.9980175031070498, 'subsample': 0.22833986121884964, 'colsample_bytree': 0.39512367125352293, 'bagging_fraction': 0.8724062667064009, 'bagging_freq': 5, 'min_child_weight': 0.2622582690107168, 'min_split_gain': 0.4416845674532798, 'min_data_in_leaf': 22, 'max_delta_step': 3}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:38:25,091] Trial 13 finished with value: 0.8620160976845547 and parameters: {'n_estimators': 600, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 84, 'min_child_samples': 57, 'lambda_l1': 4.833402031282412e-06, 'lambda_l2': 2.5977647272347386e-06, 'reg_alpha': 0.5195545278104725, 'reg_lambda': 0.5971811647641649, 'learning_rate': 0.14126902389604182, 'feature_fraction': 0.6877658982489735, 'subsample': 0.056309405751741703, 'colsample_bytree': 0.5861630773127292, 'bagging_fraction': 0.8123145203990759, 'bagging_freq': 7, 'min_child_weight': 0.014161748826676928, 'min_split_gain': 0.523022406210781, 'min_data_in_leaf': 30, 'max_delta_step': 7}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:39:35,072] Trial 14 finished with value: 0.8780248740132749 and parameters: {'n_estimators': 300, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 129, 'min_child_samples': 40, 'lambda_l1': 3.159875322258148e-08, 'lambda_l2': 0.0009420741355173971, 'reg_alpha': 0.8079070267854085, 'reg_lambda': 0.47196676315342473, 'learning_rate': 0.28064826534912485, 'feature_fraction': 0.4100906099818725, 'subsample': 0.32040123209157056, 'colsample_bytree': 0.2884983489599452, 'bagging_fraction': 0.823637627022563, 'bagging_freq': 3, 'min_child_weight': 0.01352913024023231, 'min_split_gain': 0.009335610475266924, 'min_data_in_leaf': 14, 'max_delta_step': 2}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:40:07,058] Trial 15 finished with value: 0.7674314856025837 and parameters: {'n_estimators': 100, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 135, 'min_child_samples': 64, 'lambda_l1': 1.1093894693363508e-08, 'lambda_l2': 0.0007252146370295743, 'reg_alpha': 0.26404462746108115, 'reg_lambda': 0.674582771693113, 'learning_rate': 0.09954423258559975, 'feature_fraction': 0.4089650219134541, 'subsample': 0.38088460195602175, 'colsample_bytree': 0.26963460056226346, 'bagging_fraction': 0.9894528142903561, 'bagging_freq': 3, 'min_child_weight': 0.015885521142713823, 'min_split_gain': 0.2231871227491393, 'min_data_in_leaf': 10, 'max_delta_step': 10}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:41:25,977] Trial 16 finished with value: 0.8775463754838221 and parameters: {'n_estimators': 300, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 165, 'min_child_samples': 49, 'lambda_l1': 2.979801132039705e-06, 'lambda_l2': 0.00025417590238099034, 'reg_alpha': 0.7249036921470404, 'reg_lambda': 0.8202778015863035, 'learning_rate': 0.18006312787180753, 'feature_fraction': 0.7370066028683423, 'subsample': 0.596637826806498, 'colsample_bytree': 0.33624300721567546, 'bagging_fraction': 0.8764736731421038, 'bagging_freq': 3, 'min_child_weight': 0.5791617342788797, 'min_split_gain': 0.3512094898270119, 'min_data_in_leaf': 16, 'max_delta_step': 2}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:42:32,478] Trial 17 finished with value: 0.8545180808335937 and parameters: {'n_estimators': 300, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 114, 'min_child_samples': 74, 'lambda_l1': 1.0424205102368452e-08, 'lambda_l2': 8.250065245439023, 'reg_alpha': 0.5771786186662261, 'reg_lambda': 0.46386315662637, 'learning_rate': 0.2743647210971697, 'feature_fraction': 0.43022418099957055, 'subsample': 0.15288268840229252, 'colsample_bytree': 0.212170834584281, 'bagging_fraction': 0.7965514894530638, 'bagging_freq': 3, 'min_child_weight': 0.009215891417021284, 'min_split_gain': 0.13522409806794833, 'min_data_in_leaf': 16, 'max_delta_step': 5}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:43:52,718] Trial 18 finished with value: 0.840214250784558 and parameters: {'n_estimators': 400, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 118, 'min_child_samples': 97, 'lambda_l1': 4.028505692985566e-05, 'lambda_l2': 0.0024298720059222328, 'reg_alpha': 0.8068097048036752, 'reg_lambda': 0.658718828376563, 'learning_rate': 0.11360943815373552, 'feature_fraction': 0.480041144602146, 'subsample': 0.3602857201118521, 'colsample_bytree': 0.39457375811046463, 'bagging_fraction': 0.912398175273994, 'bagging_freq': 6, 'min_child_weight': 0.001030133654874284, 'min_split_gain': 0.6525591832264386, 'min_data_in_leaf': 30, 'max_delta_step': 1}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:44:19,977] Trial 19 finished with value: 0.7405938010394489 and parameters: {'n_estimators': 200, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 57, 'min_child_samples': 39, 'lambda_l1': 3.2047555533872153e-07, 'lambda_l2': 0.020605679980235606, 'reg_alpha': 0.5068287729814005, 'reg_lambda': 0.9227978766682805, 'learning_rate': 0.2064521033259073, 'feature_fraction': 0.8044307556062196, 'subsample': 0.17588926419955378, 'colsample_bytree': 0.208274754943567, 'bagging_fraction': 0.8068835108366229, 'bagging_freq': 2, 'min_child_weight': 0.709777498318447, 'min_split_gain': 0.3098974642919132, 'min_data_in_leaf': 10, 'max_delta_step': 8}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:44:42,086] Trial 20 finished with value: 0.7612528170842955 and parameters: {'n_estimators': 50, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 158, 'min_child_samples': 24, 'lambda_l1': 1.1944409555908412e-06, 'lambda_l2': 0.0006399993856977176, 'reg_alpha': 0.6904445761992442, 'reg_lambda': 0.5190310484924262, 'learning_rate': 0.1782897440083468, 'feature_fraction': 0.9016233224676775, 'subsample': 0.5730556441213499, 'colsample_bytree': 0.06783581520353882, 'bagging_fraction': 0.8596259140100362, 'bagging_freq': 4, 'min_child_weight': 0.026274533589895142, 'min_split_gain': 0.016636042354496304, 'min_data_in_leaf': 29, 'max_delta_step': 6}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:46:01,622] Trial 21 finished with value: 0.8833719634667491 and parameters: {'n_estimators': 300, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 177, 'min_child_samples': 45, 'lambda_l1': 3.4991269944278816e-06, 'lambda_l2': 0.0002491284134385983, 'reg_alpha': 0.7480374305092647, 'reg_lambda': 0.8248409723216801, 'learning_rate': 0.1901342759969906, 'feature_fraction': 0.7462514961118765, 'subsample': 0.5970550676337194, 'colsample_bytree': 0.36799990149343387, 'bagging_fraction': 0.9183731050628999, 'bagging_freq': 3, 'min_child_weight': 0.36135045191782317, 'min_split_gain': 0.40707406736393204, 'min_data_in_leaf': 16, 'max_delta_step': 2}. Best is trial 21 with value: 0.8833719634667491.
[I 2023-12-27 23:47:15,553] Trial 22 finished with value: 0.8881755261066901 and parameters: {'n_estimators': 250, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 197, 'min_child_samples': 54, 'lambda_l1': 6.487095526319997e-08, 'lambda_l2': 0.0017328406272876602, 'reg_alpha': 0.8954019209806247, 'reg_lambda': 0.7574800975978729, 'learning_rate': 0.28442706550634245, 'feature_fraction': 0.7823651612335393, 'subsample': 0.6274888850559157, 'colsample_bytree': 0.3549984446040611, 'bagging_fraction': 0.9350132824135351, 'bagging_freq': 3, 'min_child_weight': 0.18974615877784037, 'min_split_gain': 0.47269394006479004, 'min_data_in_leaf': 20, 'max_delta_step': 2}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:48:14,237] Trial 23 finished with value: 0.851493793278416 and parameters: {'n_estimators': 200, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 190, 'min_child_samples': 57, 'lambda_l1': 9.149873937494441e-08, 'lambda_l2': 9.731946260935053e-05, 'reg_alpha': 0.9412953784916732, 'reg_lambda': 0.7672495740592127, 'learning_rate': 0.18715023155531416, 'feature_fraction': 0.7729667315426292, 'subsample': 0.6467429302022853, 'colsample_bytree': 0.39652678699156313, 'bagging_fraction': 0.9486007346471711, 'bagging_freq': 4, 'min_child_weight': 0.16338460855819342, 'min_split_gain': 0.4593591636130272, 'min_data_in_leaf': 20, 'max_delta_step': 1}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:49:35,797] Trial 24 finished with value: 0.8814626237854742 and parameters: {'n_estimators': 550, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 210, 'min_child_samples': 70, 'lambda_l1': 6.206372509221023e-07, 'lambda_l2': 0.009433328937434073, 'reg_alpha': 0.933946630859976, 'reg_lambda': 0.9954225256080727, 'learning_rate': 0.2988681483708137, 'feature_fraction': 0.8673030919587643, 'subsample': 0.5347153032152265, 'colsample_bytree': 0.34481442679746765, 'bagging_fraction': 0.9428166360548743, 'bagging_freq': 2, 'min_child_weight': 0.41137454052279926, 'min_split_gain': 0.552588353202933, 'min_data_in_leaf': 28, 'max_delta_step': 4}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:50:42,099] Trial 25 finished with value: 0.8713862821743685 and parameters: {'n_estimators': 400, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 215, 'min_child_samples': 66, 'lambda_l1': 5.606353525411963e-07, 'lambda_l2': 0.01645968116387086, 'reg_alpha': 0.900328990911367, 'reg_lambda': 0.8882478647335285, 'learning_rate': 0.2995296376396346, 'feature_fraction': 0.8653931193758865, 'subsample': 0.6956790691548889, 'colsample_bytree': 0.4605688189731501, 'bagging_fraction': 0.9963554434544024, 'bagging_freq': 2, 'min_child_weight': 1.3790999793509264, 'min_split_gain': 0.5500198236137742, 'min_data_in_leaf': 8, 'max_delta_step': 4}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:51:58,432] Trial 26 finished with value: 0.8757299330182601 and parameters: {'n_estimators': 600, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 194, 'min_child_samples': 73, 'lambda_l1': 3.458832037131109e-08, 'lambda_l2': 0.01062829757405875, 'reg_alpha': 0.9090182869071985, 'reg_lambda': 0.8535617462549212, 'learning_rate': 0.2068294241225185, 'feature_fraction': 0.763010580961449, 'subsample': 0.5453704964594214, 'colsample_bytree': 0.24605307868957346, 'bagging_fraction': 0.942644568029684, 'bagging_freq': 1, 'min_child_weight': 0.42312410212590734, 'min_split_gain': 0.616416087546423, 'min_data_in_leaf': 20, 'max_delta_step': 3}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:52:43,882] Trial 27 finished with value: 0.8116483935336053 and parameters: {'n_estimators': 200, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 229, 'min_child_samples': 52, 'lambda_l1': 8.577767779820766e-07, 'lambda_l2': 0.0018848346117680019, 'reg_alpha': 0.9868091766180361, 'reg_lambda': 0.9828569897195489, 'learning_rate': 0.21901708186437807, 'feature_fraction': 0.8703601172344868, 'subsample': 0.4981492493361206, 'colsample_bytree': 0.1845769619284794, 'bagging_fraction': 0.9078603509428422, 'bagging_freq': 3, 'min_child_weight': 3.1809205085934575, 'min_split_gain': 0.6988582226953822, 'min_data_in_leaf': 34, 'max_delta_step': 1}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:54:13,932] Trial 28 finished with value: 0.8534635288851998 and parameters: {'n_estimators': 350, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 173, 'min_child_samples': 60, 'lambda_l1': 1.145739349972685e-07, 'lambda_l2': 0.00013318361338986002, 'reg_alpha': 0.7665287665052277, 'reg_lambda': 0.7432817893085424, 'learning_rate': 0.10513798925474571, 'feature_fraction': 0.7015728921275299, 'subsample': 0.6058590655055107, 'colsample_bytree': 0.3606173051524219, 'bagging_fraction': 0.9616218937977218, 'bagging_freq': 2, 'min_child_weight': 0.46816099764162866, 'min_split_gain': 0.39738802885664537, 'min_data_in_leaf': 18, 'max_delta_step': 5}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:55:11,491] Trial 29 finished with value: 0.8186632554762338 and parameters: {'n_estimators': 550, 'max_depth': 6, 'boosting_type': 'gbdt', 'num_leaves': 199, 'min_child_samples': 71, 'lambda_l1': 1.9300245738733512e-05, 'lambda_l2': 0.06529332322007117, 'reg_alpha': 0.8540002668256893, 'reg_lambda': 0.8910470873527566, 'learning_rate': 0.2322779764126315, 'feature_fraction': 0.7815730411921928, 'subsample': 0.4178043783507136, 'colsample_bytree': 0.44913002804719304, 'bagging_fraction': 0.997686972664943, 'bagging_freq': 3, 'min_child_weight': 0.047450020650408674, 'min_split_gain': 0.5261641319776578, 'min_data_in_leaf': 27, 'max_delta_step': 3}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:57:55,258] Trial 30 finished with value: 0.8921939752528024 and parameters: {'n_estimators': 700, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 152, 'min_child_samples': 33, 'lambda_l1': 0.00011264523436153885, 'lambda_l2': 0.002535426217214707, 'reg_alpha': 0.8531783896338508, 'reg_lambda': 0.8386482461746168, 'learning_rate': 0.22932817006049094, 'feature_fraction': 0.6501957138290259, 'subsample': 0.519408739568785, 'colsample_bytree': 0.5360604044432666, 'bagging_fraction': 0.9107998816579961, 'bagging_freq': 2, 'min_child_weight': 0.23203258978937152, 'min_split_gain': 0.3946794330666563, 'min_data_in_leaf': 25, 'max_delta_step': 3}. Best is trial 30 with value: 0.8921939752528024.
[I 2023-12-28 00:00:16,380] Trial 31 finished with value: 0.8922264966138842 and parameters: {'n_estimators': 700, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 156, 'min_child_samples': 29, 'lambda_l1': 0.00014582924208031534, 'lambda_l2': 0.00207070770891591, 'reg_alpha': 0.879099482559331, 'reg_lambda': 0.8226420082257161, 'learning_rate': 0.23972844391778333, 'feature_fraction': 0.6389072863404655, 'subsample': 0.4944031203227879, 'colsample_bytree': 0.5442477245125785, 'bagging_fraction': 0.9100444612104912, 'bagging_freq': 2, 'min_child_weight': 0.26729999120841763, 'min_split_gain': 0.39571056657151543, 'min_data_in_leaf': 24, 'max_delta_step': 3}. Best is trial 31 with value: 0.8922264966138842.
[I 2023-12-28 00:02:49,485] Trial 32 finished with value: 0.8989997991337638 and parameters: {'n_estimators': 800, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 149, 'min_child_samples': 26, 'lambda_l1': 9.803860851134294e-05, 'lambda_l2': 0.0030044424373395283, 'reg_alpha': 0.8485579521271137, 'reg_lambda': 0.823632033229195, 'learning_rate': 0.16123569549530453, 'feature_fraction': 0.6545602810054852, 'subsample': 0.5129109164629315, 'colsample_bytree': 0.5255478300943521, 'bagging_fraction': 0.9073804905823786, 'bagging_freq': 1, 'min_child_weight': 0.12693393736639316, 'min_split_gain': 0.3846614060379614, 'min_data_in_leaf': 25, 'max_delta_step': 2}. Best is trial 32 with value: 0.8989997991337638.
[I 2023-12-28 00:05:09,170] Trial 33 finished with value: 0.8958036145465975 and parameters: {'n_estimators': 750, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 151, 'min_child_samples': 25, 'lambda_l1': 0.00013239045157530194, 'lambda_l2': 0.0032071894959281267, 'reg_alpha': 0.8662145021497873, 'reg_lambda': 0.7533395165967287, 'learning_rate': 0.23146274771366598, 'feature_fraction': 0.6289110073229573, 'subsample': 0.45122154653169216, 'colsample_bytree': 0.5406432957366871, 'bagging_fraction': 0.8907563452214535, 'bagging_freq': 1, 'min_child_weight': 0.08022134072683336, 'min_split_gain': 0.3830191281331384, 'min_data_in_leaf': 24, 'max_delta_step': 2}. Best is trial 32 with value: 0.8989997991337638.
[I 2023-12-28 00:07:22,684] Trial 34 finished with value: 0.8897318053882265 and parameters: {'n_estimators': 750, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 150, 'min_child_samples': 23, 'lambda_l1': 0.0002756081233310541, 'lambda_l2': 0.004038521806636954, 'reg_alpha': 0.8368395098367797, 'reg_lambda': 0.8351955687445882, 'learning_rate': 0.15985743765335148, 'feature_fraction': 0.6189789865249556, 'subsample': 0.4704072511809242, 'colsample_bytree': 0.5594715372860741, 'bagging_fraction': 0.7656943167789485, 'bagging_freq': 1, 'min_child_weight': 0.09350196583068483, 'min_split_gain': 0.3716033548247211, 'min_data_in_leaf': 24, 'max_delta_step': 3}. Best is trial 32 with value: 0.8989997991337638.
[I 2023-12-28 00:09:19,708] Trial 35 finished with value: 0.8992227839303806 and parameters: {'n_estimators': 850, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 158, 'min_child_samples': 31, 'lambda_l1': 0.00012594780677761396, 'lambda_l2': 0.04190854332710144, 'reg_alpha': 0.8578304010160532, 'reg_lambda': 0.9082046808688874, 'learning_rate': 0.2269423609657828, 'feature_fraction': 0.6267650752224884, 'subsample': 0.45372996056218884, 'colsample_bytree': 0.6231626398111181, 'bagging_fraction': 0.8942737662563495, 'bagging_freq': 1, 'min_child_weight': 0.03139373537452999, 'min_split_gain': 0.2889146281291505, 'min_data_in_leaf': 34, 'max_delta_step': 1}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:12:12,040] Trial 36 finished with value: 0.8967002186582534 and parameters: {'n_estimators': 850, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 133, 'min_child_samples': 17, 'lambda_l1': 0.0012579716529809257, 'lambda_l2': 0.13286685086210484, 'reg_alpha': 0.7970066890973759, 'reg_lambda': 0.9052548375736079, 'learning_rate': 0.12827308333727266, 'feature_fraction': 0.5947904298434539, 'subsample': 0.42544021303549495, 'colsample_bytree': 0.6185053528543994, 'bagging_fraction': 0.8897744944480148, 'bagging_freq': 1, 'min_child_weight': 0.03348700916069735, 'min_split_gain': 0.28462873394322535, 'min_data_in_leaf': 35, 'max_delta_step': 1}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:15:01,246] Trial 37 finished with value: 0.8975457216031145 and parameters: {'n_estimators': 900, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 102, 'min_child_samples': 17, 'lambda_l1': 0.0010511873817839884, 'lambda_l2': 0.04511572731653093, 'reg_alpha': 0.8013630825405028, 'reg_lambda': 0.9251516562576017, 'learning_rate': 0.1589772264584597, 'feature_fraction': 0.6038001674080635, 'subsample': 0.42037162655371074, 'colsample_bytree': 0.6502318100866276, 'bagging_fraction': 0.8411653909921795, 'bagging_freq': 1, 'min_child_weight': 0.03848395291096782, 'min_split_gain': 0.27887820629397675, 'min_data_in_leaf': 39, 'max_delta_step': 0}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:17:09,325] Trial 38 finished with value: 0.85945636749193 and parameters: {'n_estimators': 900, 'max_depth': 7, 'boosting_type': 'gbdt', 'num_leaves': 100, 'min_child_samples': 18, 'lambda_l1': 0.0009059956557422825, 'lambda_l2': 0.2372632702197369, 'reg_alpha': 0.7913680828324612, 'reg_lambda': 0.9399333038053995, 'learning_rate': 0.12402294496183226, 'feature_fraction': 0.5967861272024316, 'subsample': 0.39973665639568184, 'colsample_bytree': 0.6506821200780779, 'bagging_fraction': 0.835482168603026, 'bagging_freq': 1, 'min_child_weight': 0.02841779979609808, 'min_split_gain': 0.26635896836483625, 'min_data_in_leaf': 41, 'max_delta_step': 0}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:18:37,903] Trial 39 finished with value: 0.8140501732349567 and parameters: {'n_estimators': 850, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 50, 'min_child_samples': 8, 'lambda_l1': 0.0014391679638090364, 'lambda_l2': 0.04193995020208812, 'reg_alpha': 0.6704979008319145, 'reg_lambda': 0.9220114570199825, 'learning_rate': 0.079993010687535, 'feature_fraction': 0.5948035295441012, 'subsample': 0.3540651650868418, 'colsample_bytree': 0.6798399781297993, 'bagging_fraction': 0.752945071168631, 'bagging_freq': 1, 'min_child_weight': 0.007388820142358995, 'min_split_gain': 0.2852848467749537, 'min_data_in_leaf': 35, 'max_delta_step': 1}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:21:16,910] Trial 40 finished with value: 0.9046627973391566 and parameters: {'n_estimators': 950, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 134, 'min_child_samples': 16, 'lambda_l1': 0.002865600014675554, 'lambda_l2': 0.2596842570474358, 'reg_alpha': 0.7882191530750605, 'reg_lambda': 0.8807986419485984, 'learning_rate': 0.1590297123957936, 'feature_fraction': 0.6798887463051443, 'subsample': 0.4338721676814267, 'colsample_bytree': 0.7119935931199451, 'bagging_fraction': 0.8456859403262338, 'bagging_freq': 1, 'min_child_weight': 0.004745177681063355, 'min_split_gain': 0.17818587330243982, 'min_data_in_leaf': 40, 'max_delta_step': 0}. Best is trial 40 with value: 0.9046627973391566.

The best set of hyperparameters was found in trial 40 with CV ROC AUC of 0.905.

‘n_estimators’: 950,
‘max_depth’: 10,
‘boosting_type’: ‘gbdt’,
‘num_leaves’: 134,
‘min_child_samples’: 16,
‘lambda_l1’: 0.002865600014675554,
‘lambda_l2’: 0.2596842570474358,
‘reg_alpha’: 0.7882191530750605,
‘reg_lambda’: 0.8807986419485984,
‘learning_rate’: 0.1590297123957936,
‘feature_fraction’: 0.6798887463051443,
‘subsample’: 0.4338721676814267,
‘colsample_bytree’: 0.7119935931199451,
‘bagging_fraction’: 0.8456859403262338,
‘bagging_freq’: 1,
‘min_child_weight’: 0.004745177681063355,
‘min_split_gain’: 0.17818587330243982,
‘min_data_in_leaf’: 40,
‘max_delta_step’: 0

4.7 Evaluate Tuned Model

The tuned model with the best set of hyperparameters is evaluated on the validation set. For reference, the results are also calculated on the training set as well.

Warning messages indicate, that as values of some parameters are set, other parameters are ignored. The ignored parameters will not be set.

The results show that on the training set the tuned model performs best, but on the validation set it performs worst. This indicates overfitting. Feasible options in this situation may include:

tunning the model with fewer features (e.g., 34, as ROC AUC is almost the same as in the model with 111 features);
using a simpler model (e.g., Logistic Regression).

Unfortunately, currently, there is not enough time to implement that so the untuned model with 111 features will be selected as the best one.

See the details below.

An example of warnings

[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=16 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=0.7119935931199451 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.7882191530750605 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.8807986419485984 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=0.4338721676814267 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=16 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=0.7119935931199451 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.7882191530750605 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.8807986419485984 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=0.4338721676814267 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1

Code

# The ignored parameters are left commented out
params_tuned_1 = {
    "n_estimators": 950,
    "max_depth": 10,
    "boosting_type": "gbdt",
    "num_leaves": 134,
    # "min_child_samples": 16,
    "lambda_l1": 0.002865600014675554,
    "lambda_l2": 0.2596842570474358,
    # "reg_alpha": 0.7882191530750605,
    # "reg_lambda": 0.8807986419485984,
    "learning_rate": 0.1590297123957936,
    "feature_fraction": 0.6798887463051443,
    # "subsample": 0.4338721676814267,
    # "colsample_bytree": 0.7119935931199451,
    "bagging_fraction": 0.8456859403262338,
    "bagging_freq": 1,
    "min_child_weight": 0.004745177681063355,
    "min_split_gain": 0.17818587330243982,
    "min_data_in_leaf": 40,
    "max_delta_step": 0,
}

model_tuned_1 = LGBMClassifier(
    objective="binary",
    metric="auc",
    random_state=1,
    class_weight="balanced",
    n_jobs=-1,
    device="gpu",
    **params_tuned_1,
)

pipeline_with_tuned_model = Pipeline(
    steps=[
        ("preprocessor_1", PreprocessorForApplications()),
        ("preprocessor_2", clone(pre_processing)),
        ("selector", ColumnSelector(feature_names_to_tune_111)),
        ("classifier", clone(model_tuned_1)),
    ]
)

models_default_prediction["LGBM (111 feat. | tuned)"] = pipeline_with_tuned_model.fit(
    X_train, y_train
)
# Time: 3m 20.7s

Code

performance_train_1 = ml.classification_scores(
    models_default_prediction,
    X_train,
    y_train,
    sort_by="ROC_AUC",
    color="orange",
)

[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1

Code

performance_validation_1 = ml.classification_scores(
    models_default_prediction,
    X_validation,
    y_validation,
    sort_by="ROC_AUC",
)

[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1

Code

print("--- Train ---")
performance_train_1

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (111 feat. \| tuned)	215257	0.919	0.995	0.997	0.995	0.972	0.997	1.000	0.995	0.945	1.000	1.000
LGBM (123 features)	215257	0.919	0.714	0.722	0.443	0.292	0.821	0.731	0.712	0.182	0.968	0.800
LGBM (FULL \| 197 feat.)	215257	0.919	0.714	0.721	0.443	0.292	0.821	0.731	0.712	0.182	0.968	0.799
LGBM (98 features)	215257	0.919	0.714	0.722	0.444	0.292	0.820	0.733	0.712	0.183	0.968	0.799
LGBM (74 features)	215257	0.919	0.713	0.721	0.443	0.292	0.820	0.731	0.712	0.182	0.968	0.799
LGBM (111 features)	215257	0.919	0.714	0.721	0.443	0.292	0.820	0.731	0.712	0.182	0.968	0.799
LGBM (55 features)	215257	0.919	0.713	0.721	0.441	0.291	0.820	0.730	0.711	0.182	0.968	0.799
LGBM (34 features)	215257	0.919	0.711	0.718	0.437	0.289	0.819	0.726	0.710	0.180	0.967	0.796
LGBM (8 features)	215257	0.919	0.705	0.708	0.416	0.280	0.814	0.711	0.704	0.174	0.965	0.784

Code

print("--- Validation ---")
performance_validation_1

--- Validation ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (FULL \| 197 feat.)	46127	0.919	0.705	0.688	0.375	0.268	0.815	0.667	0.708	0.167	0.960	0.759
LGBM (123 features)	46127	0.919	0.704	0.687	0.375	0.267	0.815	0.668	0.707	0.167	0.960	0.759
LGBM (111 features)	46127	0.919	0.705	0.691	0.381	0.269	0.816	0.673	0.708	0.168	0.961	0.758
LGBM (98 features)	46127	0.919	0.704	0.689	0.377	0.268	0.814	0.671	0.707	0.167	0.961	0.758
LGBM (74 features)	46127	0.919	0.704	0.689	0.377	0.268	0.815	0.670	0.707	0.167	0.961	0.758
LGBM (34 features)	46127	0.919	0.703	0.688	0.375	0.267	0.814	0.669	0.706	0.167	0.960	0.758
LGBM (55 features)	46127	0.919	0.703	0.688	0.375	0.267	0.814	0.669	0.706	0.167	0.961	0.758
LGBM (8 features)	46127	0.919	0.698	0.688	0.376	0.265	0.810	0.675	0.700	0.165	0.961	0.754
LGBM (111 feat. \| tuned)	46127	0.919	0.896	0.568	0.136	0.215	0.944	0.178	0.959	0.274	0.930	0.716

4.8 Final Evaluation

After hyperparameter tuning, the trade-off between model complexity and accuracy was re-considered. Instead of the best-performing model based on 111 features, a much less complex model based on 34 features with comparable performance (AUC = 0.758 which differs by less than 0.0005) was chosen as the final model to be deployed.

The final performance of the model based on these features is AUC = 0.763 (slightly better which can be related to the fact that the model was trained on a larger dataset).

Code

features_34 = feature_importance.head(34).col_name.to_list()

pipeline_final_1_with_34_feat = Pipeline(
    steps=[
        ("preprocessor_1", PreprocessorForApplications()),
        ("preprocessor_2", clone(pre_processing)),
        ("selector", ColumnSelector(features_34)),
        ("classifier", clone(lgbm_classifier)),
    ]
)

Code

# For performance evaluation
X_train_validation = pd.concat([X_train, X_validation])
y_train_validation = pd.concat([y_train, y_validation])

file = dir_interim + "task-1-applications-only--lgbm_models_final_1.pkl"

if os.path.exists(file):
    with open(file, "rb") as f:
        models_final_1 = joblib.load(f)
else:
    models_final_1 = {}
    models_final_1["LGBM (34 feat. | final)"] = pipeline_final_1_with_34_feat.fit(
        X_train_validation, y_train_validation
    )
    with open(file, "wb") as f:
        joblib.dump(models_final_1, f)

del file

[LightGBM] [Info] Number of positive: 21101, number of negative: 240283
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 3410
[LightGBM] [Info] Number of data points in the train set: 261384, number of used features: 34
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 17 dense feature groups (4.99 MB) transferred to GPU in 0.013168 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000

Code

print("--- Train ---")

ml.classification_scores(
    models_final_1,
    X_train_validation,
    y_train_validation,
    color="orange",
    sort_by="ROC_AUC",
)

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (34 feat. \| final)	261384	0.919	0.710	0.714	0.429	0.286	0.818	0.720	0.709	0.178	0.966	0.791

Code of the figure

sns.set_style("white")
y_pred_train_val_1 = models_final_1["LGBM (34 feat. | final)"].predict(
    X_train_validation
)
ml.plot_confusion_matrices(y_train_validation, y_pred_train_val_1, figsize=(13, 3));

Fig. 4.1. Confusion matrices for the joint **train and validation set**.

Code

print("--- Test ---")

ml.classification_scores(
    models_final_1,
    X_test,
    y_test,
    sort_by="ROC_AUC",
)

--- Test ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (34 feat. \| final)	46127	0.919	0.707	0.696	0.392	0.273	0.816	0.683	0.709	0.171	0.962	0.763

Code of the figure

sns.set_style("white")
y_pred_test_1 = models_final_1["LGBM (34 feat. | final)"].predict(X_test)
ml.plot_confusion_matrices(y_test, y_pred_test_1, figsize=(13, 3));

Fig. 4.2. Confusion matrices for the **test set**.

Code

# SHAP values for the final model
@my.cache_results(dir_interim + "task-1-applications-only--shap_lgbm_k=34-final.pickle")
def get_shap_values_lgbm_final_1():
    model = "LGBM (34 feat. | final)"
    preproc = Pipeline(steps=models_final_1[model].steps[:-1])
    classifier = models_final_1[model]["classifier"]
    X_test_preproc = preproc.transform(X_test)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_test_preproc)
    return shap_values, X_test_preproc


shap_values_lgbm_test_1, data_for_lgbm_test_1 = get_shap_values_lgbm_final_1()

feature_importance_test_1 = (
    pd.DataFrame(
        list(
            zip(
                data_for_lgbm_test_1.columns,
                np.abs(shap_values_lgbm_test_1).mean(0).mean(0),
            )
        ),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

Code

sns.set_style("whitegrid")
lgb.plot_importance(
    models_final_1["LGBM (34 feat. | final)"]["classifier"],
    max_num_features=50,
    figsize=(8, 8),
    height=0.8,
    title="LGBM Feature Importance (Final Model)",
);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_1[1],
    data_for_lgbm_test_1,
    plot_type="bar",
    max_display=50,
    plot_size=(10, 5),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_1[1],
    data_for_lgbm_test_1,
    max_display=50,
    plot_size=(10, 5),
    show=False,
)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

4.9 Model for Deployment (w/o Historical Data)

Code

# For deployment
X_all = pd.concat([X_train, X_validation, X_test], axis=0)
y_all = pd.concat([y_train, y_validation, y_test], axis=0)

pipeline_to_deploy_1 = clone(pipeline_final_1_with_34_feat)
pipeline_to_deploy_1 = pipeline_to_deploy_1.fit(X_all, y_all)

For simplicity, the model will be deployed without pre-processing pipeline.

Code

# Extract and save classifier
classifier_to_deploy_1 = pipeline_to_deploy_1.named_steps["classifier"]

with open("models/classifier-1--without_credit_history.pickle", "wb") as f:
    joblib.dump(classifier_to_deploy_1, f)

5 Feature Engineering

In this section, data from the tables with historical credit data will be prepared for the modeling. At first, each subsection will reveal the steps used to pre-process each dataset to get it ready for merging. Then, the datasets will be merged and a joint dataset will be created.

Note.

In this section, all the features from the application dataset will be used again even though some of them were not used in the previous model.
The feature selection will be performed after all the features are created and merged into a single dataset.

The main strategy to aggregate features was:

for numeric features, mean, median, standard deviation, max, min and range were calculated. On rare occasions, other statistics were calculated too.
for categorical features, the frequency either of each category or of the biggest categories was calculated.

As each dataset was different and had different features, the steps to pre-process were modified each time.

5.1 Table `bureau`

Code

bureau.head()

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.00	-153.00	NaN	91323.00	0.00	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.00	NaN	NaN	225000.00	171342.00	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.00	NaN	NaN	464323.50	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.00	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.00	NaN	77674.50	2700000.00	NaN	NaN	Consumer credit	-21	NaN

Code

file = dir_interim + "aggregated--bureau_aggregated.feather"

if os.path.exists(file):
    bureau_aggregated = pd.read_feather(file)

else:
    bureau_aggregated = (
        bureau.assign(
            CREDIT_TYPE=lambda df: df["CREDIT_TYPE"].apply(
                lambda x: x
                if x
                in [
                    "Consumer credit",
                    "Credit card",
                    "Car loan",
                    "Mortgage",
                    "Microloan",
                ]
                else "Other"
            ),
        )
        .groupby("SK_ID_CURR")
        .agg(
            n_credits_total=("SK_ID_BUREAU", "count"),
            n_credits_active=("CREDIT_ACTIVE", lambda x: (x == "Active").sum()),
            n_credits_closed=("CREDIT_ACTIVE", lambda x: (x == "Closed").sum()),
            n_credits_bad_debt=("CREDIT_ACTIVE", lambda x: (x == "Bad debt").sum()),
            n_credits_sold=("CREDIT_ACTIVE", lambda x: (x == "Sold").sum()),
            mode_credit_currency=(
                "CREDIT_CURRENCY",
                lambda x: x.mode().iloc[0] if not x.empty else None,
            ),
            n_different_currencies=("CREDIT_CURRENCY", "nunique"),
            n_currency_1=("CREDIT_CURRENCY", lambda x: (x == "currency 1").sum()),
            n_currency_2=("CREDIT_CURRENCY", lambda x: (x == "currency 2").sum()),
            n_currency_3=("CREDIT_CURRENCY", lambda x: (x == "currency 3").sum()),
            n_currency_4=("CREDIT_CURRENCY", lambda x: (x == "currency 4").sum()),
            days_credit_min=("DAYS_CREDIT", "min"),
            days_credit_max=("DAYS_CREDIT", "max"),
            days_credit_mean=("DAYS_CREDIT", "mean"),
            days_credit_std=("DAYS_CREDIT", "std"),
            days_credit_median=("DAYS_CREDIT", "median"),
            days_credit_range=("DAYS_CREDIT", lambda x: x.max() - x.min()),
            days_credit_overdue_min=("CREDIT_DAY_OVERDUE", "min"),
            days_credit_overdue_max=("CREDIT_DAY_OVERDUE", "max"),
            days_credit_overdue_mean=("CREDIT_DAY_OVERDUE", "mean"),
            days_credit_overdue_std=("CREDIT_DAY_OVERDUE", "std"),
            days_credit_overdue_median=("CREDIT_DAY_OVERDUE", "median"),
            days_credit_overdue_range=(
                "CREDIT_DAY_OVERDUE",
                lambda x: x.max() - x.min(),
            ),
            days_credit_enddate_min=("DAYS_CREDIT_ENDDATE", "min"),
            days_credit_enddate_max=("DAYS_CREDIT_ENDDATE", "max"),
            days_credit_enddate_mean=("DAYS_CREDIT_ENDDATE", "mean"),
            days_credit_enddate_std=("DAYS_CREDIT_ENDDATE", "std"),
            days_credit_enddate_median=("DAYS_CREDIT_ENDDATE", "median"),
            days_credit_enddate_range=(
                "DAYS_CREDIT_ENDDATE",
                lambda x: x.max() - x.min(),
            ),
            days_enddate_fact_min=("DAYS_ENDDATE_FACT", "min"),
            days_enddate_fact_max=("DAYS_ENDDATE_FACT", "max"),
            days_enddate_fact_mean=("DAYS_ENDDATE_FACT", "mean"),
            days_enddate_fact_std=("DAYS_ENDDATE_FACT", "std"),
            days_enddate_fact_median=("DAYS_ENDDATE_FACT", "median"),
            days_enddate_fact_range=("DAYS_ENDDATE_FACT", lambda x: x.max() - x.min()),
            amt_credit_max_overdue_min=("AMT_CREDIT_MAX_OVERDUE", "min"),
            amt_credit_max_overdue_max=("AMT_CREDIT_MAX_OVERDUE", "max"),
            amt_credit_max_overdue_mean=("AMT_CREDIT_MAX_OVERDUE", "mean"),
            amt_credit_max_overdue_std=("AMT_CREDIT_MAX_OVERDUE", "std"),
            amt_credit_max_overdue_median=("AMT_CREDIT_MAX_OVERDUE", "median"),
            amt_credit_max_overdue_range=(
                "AMT_CREDIT_MAX_OVERDUE",
                lambda x: x.max() - x.min(),
            ),
            cnt_credit_prolong_min=("CNT_CREDIT_PROLONG", "min"),
            cnt_credit_prolong_max=("CNT_CREDIT_PROLONG", "max"),
            cnt_credit_prolong_mean=("CNT_CREDIT_PROLONG", "mean"),
            cnt_credit_prolong_std=("CNT_CREDIT_PROLONG", "std"),
            cnt_credit_prolong_median=("CNT_CREDIT_PROLONG", "median"),
            cnt_credit_prolong_range=(
                "CNT_CREDIT_PROLONG",
                lambda x: x.max() - x.min(),
            ),
            cnt_credit_prolong_sum=("CNT_CREDIT_PROLONG", "sum"),
            amt_credit_sum_min=("AMT_CREDIT_SUM", "min"),
            amt_credit_sum_max=("AMT_CREDIT_SUM", "max"),
            amt_credit_sum_mean=("AMT_CREDIT_SUM", "mean"),
            amt_credit_sum_std=("AMT_CREDIT_SUM", "std"),
            amt_credit_sum_median=("AMT_CREDIT_SUM", "median"),
            amt_credit_sum_range=("AMT_CREDIT_SUM", lambda x: x.max() - x.min()),
            amt_credit_sum_sum=("AMT_CREDIT_SUM", "sum"),
            amt_credit_sum_debt_min=("AMT_CREDIT_SUM_DEBT", "min"),
            amt_credit_sum_debt_max=("AMT_CREDIT_SUM_DEBT", "max"),
            amt_credit_sum_debt_mean=("AMT_CREDIT_SUM_DEBT", "mean"),
            amt_credit_sum_debt_std=("AMT_CREDIT_SUM_DEBT", "std"),
            amt_credit_sum_debt_median=("AMT_CREDIT_SUM_DEBT", "median"),
            amt_credit_sum_debt_range=(
                "AMT_CREDIT_SUM_DEBT",
                lambda x: x.max() - x.min(),
            ),
            amt_credit_sum_debt_sum=("AMT_CREDIT_SUM_DEBT", "sum"),
            amt_credit_sum_limit_min=("AMT_CREDIT_SUM_LIMIT", "min"),
            amt_credit_sum_limit_max=("AMT_CREDIT_SUM_LIMIT", "max"),
            amt_credit_sum_limit_mean=("AMT_CREDIT_SUM_LIMIT", "mean"),
            amt_credit_sum_limit_std=("AMT_CREDIT_SUM_LIMIT", "std"),
            amt_credit_sum_limit_median=("AMT_CREDIT_SUM_LIMIT", "median"),
            amt_credit_sum_limit_range=(
                "AMT_CREDIT_SUM_LIMIT",
                lambda x: x.max() - x.min(),
            ),
            amt_credit_sum_limit_sum=("AMT_CREDIT_SUM_LIMIT", "sum"),
            amt_credit_sum_overdue_min=("AMT_CREDIT_SUM_OVERDUE", "min"),
            amt_credit_sum_overdue_max=("AMT_CREDIT_SUM_OVERDUE", "max"),
            amt_credit_sum_overdue_mean=("AMT_CREDIT_SUM_OVERDUE", "mean"),
            amt_credit_sum_overdue_std=("AMT_CREDIT_SUM_OVERDUE", "std"),
            amt_credit_sum_overdue_median=("AMT_CREDIT_SUM_OVERDUE", "median"),
            amt_credit_sum_overdue_range=(
                "AMT_CREDIT_SUM_OVERDUE",
                lambda x: x.max() - x.min(),
            ),
            amt_credit_sum_overdue_sum=("AMT_CREDIT_SUM_OVERDUE", "sum"),
            mode_credit_type=(
                "CREDIT_TYPE",
                lambda x: x.mode().iloc[0] if not x.empty else None,
            ),
            n_different_credit_types=("CREDIT_TYPE", "nunique"),
            n_consumer_credits=(
                "CREDIT_TYPE",
                lambda x: (x == "Consumer credit").sum(),
            ),
            n_credit_card_credits=("CREDIT_TYPE", lambda x: (x == "Credit card").sum()),
            n_car_loans=("CREDIT_TYPE", lambda x: (x == "Car loan").sum()),
            n_mortgages=("CREDIT_TYPE", lambda x: (x == "Mortgage").sum()),
            n_microloans=("CREDIT_TYPE", lambda x: (x == "Microloan").sum()),
            n_other_type_credit=("CREDIT_TYPE", lambda x: (x == "Other").sum()),
            days_credit_update_min=("DAYS_CREDIT_UPDATE", "min"),
            days_credit_update_max=("DAYS_CREDIT_UPDATE", "max"),
            days_credit_update_mean=("DAYS_CREDIT_UPDATE", "mean"),
            days_credit_update_std=("DAYS_CREDIT_UPDATE", "std"),
            days_credit_update_median=("DAYS_CREDIT_UPDATE", "median"),
            days_credit_update_range=(
                "DAYS_CREDIT_UPDATE",
                lambda x: x.max() - x.min(),
            ),
            amt_annuity_min=("AMT_ANNUITY", "min"),
            amt_annuity_max=("AMT_ANNUITY", "max"),
            amt_annuity_mean=("AMT_ANNUITY", "mean"),
            amt_annuity_std=("AMT_ANNUITY", "std"),
            amt_annuity_median=("AMT_ANNUITY", "median"),
            amt_annuity_range=("AMT_ANNUITY", lambda x: x.max() - x.min()),
        )
        .reset_index()
    )

    bureau_aggregated.to_feather(file)

del file
# Time: 17m 16.8s

Code

bureau_aggregated.shape

(305811, 97)

Code

bureau_aggregated.head()

	SK_ID_CURR	n_credits_total	n_credits_active	n_credits_closed	mode_credit_currency	n_different_currencies	n_currency_1	days_credit_min	days_credit_max	days_credit_mean	days_credit_std	days_credit_median	days_credit_range	days_credit_enddate_min	days_credit_enddate_max	days_credit_enddate_mean	days_credit_enddate_std	days_credit_enddate_median	days_credit_enddate_range	days_enddate_fact_min	days_enddate_fact_max	days_enddate_fact_mean	days_enddate_fact_std	days_enddate_fact_median	days_enddate_fact_range	amt_credit_max_overdue_min	amt_credit_max_overdue_max	amt_credit_max_overdue_mean	amt_credit_max_overdue_std	amt_credit_max_overdue_median	amt_credit_max_overdue_range	amt_credit_sum_min	amt_credit_sum_max	amt_credit_sum_mean	amt_credit_sum_std	amt_credit_sum_median	amt_credit_sum_range	amt_credit_sum_sum	amt_credit_sum_debt_max	amt_credit_sum_debt_mean	amt_credit_sum_debt_std	amt_credit_sum_debt_median	amt_credit_sum_debt_range	amt_credit_sum_debt_sum	amt_credit_sum_limit_max	amt_credit_sum_limit_mean	amt_credit_sum_limit_std	amt_credit_sum_limit_range	amt_credit_sum_limit_sum	mode_credit_type	n_different_credit_types	n_consumer_credits	n_credit_card_credits	days_credit_update_min	days_credit_update_max	days_credit_update_mean	days_credit_update_std	days_credit_update_median	days_credit_update_range	amt_annuity_min	amt_annuity_max	amt_annuity_mean	amt_annuity_std	amt_annuity_median	amt_annuity_range
0	100001	7	3	4	currency 1	1	7	-1572	-49	-735.00	489.94	-857.00	1523	-1329	1778	82.43	1032.86	-179.00	3107	-1328	-544	-825.50	369.08	-715.00	784	NaN	NaN	NaN	NaN	NaN	NaN	85500.00	378000.00	207623.57	122544.54	168345.00	292500.00	1453365.00	373239.00	85240.93	137485.63	0.00	373239.00	596686.50	0.00	0.00	0.00	0.00	0.00	Consumer credit	1	7	0	-155	-6	-93.14	77.20	-155.00	149	0.00	10822.50	3545.36	4800.61	0.00	10822.50
1	100002	8	2	6	currency 1	1	8	-1437	-103	-874.00	431.45	-1042.50	1334	-1072	780	-349.00	767.49	-424.50	1852	-1185	-36	-697.50	515.99	-939.00	1149	0.00	5043.65	1681.03	2363.25	40.50	5043.65	0.00	450000.00	108131.95	146075.56	54130.50	450000.00	865055.56	245781.00	49156.20	109916.60	0.00	245781.00	245781.00	31988.56	7997.14	15994.28	31988.56	31988.56	Consumer credit	2	4	4	-1185	-7	-499.88	518.52	-402.50	1178	0.00	0.00	0.00	0.00	0.00	0.00
2	100003	4	1	3	currency 1	1	4	-2586	-606	-1400.75	909.83	-1205.50	1980	-2434	1216	-544.50	1492.77	-480.00	3650	-2131	-540	-1097.33	896.10	-621.00	1591	0.00	0.00	0.00	0.00	0.00	0.00	22248.00	810000.00	254350.12	372269.47	92576.25	787752.00	1017400.50	0.00	0.00	0.00	0.00	0.00	0.00	810000.00	202500.00	405000.00	810000.00	810000.00	Consumer credit	2	2	2	-2131	-43	-816.00	908.05	-545.00	2088	NaN	NaN	NaN	NaN	NaN	NaN
3	100004	2	0	2	currency 1	1	2	-1326	-408	-867.00	649.12	-867.00	918	-595	-382	-488.50	150.61	-488.50	213	-683	-382	-532.50	212.84	-532.50	301	0.00	0.00	0.00	NaN	0.00	0.00	94500.00	94537.80	94518.90	26.73	94518.90	37.80	189037.80	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	Consumer credit	1	2	0	-682	-382	-532.00	212.13	-532.00	300	NaN	NaN	NaN	NaN	NaN	NaN
4	100005	3	2	1	currency 1	1	3	-373	-62	-190.67	162.30	-137.00	311	-128	1324	439.33	776.27	122.00	1452	-123	-123	-123.00	<NA>	-123.00	0	0.00	0.00	0.00	NaN	0.00	0.00	29826.00	568800.00	219042.00	303238.43	58500.00	538974.00	657126.00	543087.00	189469.50	306503.34	25321.50	543087.00	568408.50	0.00	0.00	0.00	0.00	0.00	Consumer credit	2	2	1	-121	-11	-54.33	58.59	-31.00	110	0.00	4261.50	1420.50	2460.38	0.00	4261.50

Code

an.col_info(bureau_aggregated, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int64	2.4 MB	305,811	100.0%	0	0%	1	<0.1%	<0.1%	100001
2	n_credits_total	int64	2.4 MB	64	<0.1%	0	0%	41,520	13.6%	13.6%	1
3	n_credits_active	int64	2.4 MB	23	<0.1%	0	0%	85,488	28.0%	28.0%	1
4	n_credits_closed	int64	2.4 MB	57	<0.1%	0	0%	61,695	20.2%	20.2%	1
5	n_credits_bad_debt	int64	2.4 MB	2	<0.1%	0	0%	305,790	>99.9%	>99.9%	0
6	n_credits_sold	int64	2.4 MB	8	<0.1%	0	0%	299,790	98.0%	98.0%	0
7	mode_credit_currency	object	20.5 MB	3	<0.1%	0	0%	305,759	>99.9%	>99.9%	currency 1
8	n_different_currencies	int64	2.4 MB	3	<0.1%	0	0%	304,739	99.6%	99.6%	1
9	n_currency_1	int64	2.4 MB	65	<0.1%	0	0%	41,555	13.6%	13.6%	1
10	n_currency_2	int64	2.4 MB	7	<0.1%	0	0%	304,841	99.7%	99.7%	0
11	n_currency_3	int64	2.4 MB	4	<0.1%	0	0%	305,654	99.9%	99.9%	0
12	n_currency_4	int64	2.4 MB	2	<0.1%	0	0%	305,801	>99.9%	>99.9%	0
13	days_credit_min	int64	2.4 MB	2,922	1.0%	0	0%	323	0.1%	0.1%	-2871
14	days_credit_max	int64	2.4 MB	2,923	1.0%	0	0%	751	0.2%	0.2%	-91
15	days_credit_mean	float64	2.4 MB	69,801	22.8%	0	0%	94	<0.1%	<0.1%	-441.0
16	days_credit_std	float64	2.4 MB	219,744	71.9%	41,520	13.6%	2,219	0.7%	0.8%	0.0
17	days_credit_median	float64	2.4 MB	5,774	1.9%	0	0%	195	0.1%	0.1%	-911.0
18	days_credit_range	int64	2.4 MB	2,917	1.0%	0	0%	43,739	14.3%	14.3%	0
19	days_credit_overdue_min	int64	2.4 MB	95	<0.1%	0	0%	305,650	99.9%	99.9%	0
20	days_credit_overdue_max	int64	2.4 MB	917	0.3%	0	0%	301,947	98.7%	98.7%	0
21	days_credit_overdue_mean	float64	2.4 MB	1,657	0.5%	0	0%	301,947	98.7%	98.7%	0.0
22	days_credit_overdue_std	float64	2.4 MB	2,150	0.7%	41,520	13.6%	260,586	85.2%	98.6%	0.0
23	days_credit_overdue_median	float64	2.4 MB	227	0.1%	0	0%	305,305	99.8%	99.8%	0.0
24	days_credit_overdue_range	int64	2.4 MB	895	0.3%	0	0%	302,106	98.8%	98.8%	0
25	days_credit_enddate_min	Int64	2.8 MB	7,154	2.3%	2,585	0.8%	191	0.1%	0.1%	-2359
26	days_credit_enddate_max	Int64	2.8 MB	13,537	4.4%	2,585	0.8%	279	0.1%	0.1%	31060
27	days_credit_enddate_mean	Float64	2.8 MB	108,600	35.5%	2,585	0.8%	70	<0.1%	<0.1%	-99.0
28	days_credit_enddate_std	Float64	2.8 MB	219,344	71.7%	46,899	15.3%	2,242	0.7%	0.9%	0.0
29	days_credit_enddate_median	Float64	2.8 MB	15,834	5.2%	2,585	0.8%	181	0.1%	0.1%	0.0
30	days_credit_enddate_range	Int64	2.8 MB	19,727	6.5%	2,585	0.8%	46,556	15.2%	15.4%	0
31	days_enddate_fact_min	Int64	2.8 MB	2,917	1.0%	37,656	12.3%	191	0.1%	0.1%	-2353
32	days_enddate_fact_max	Int64	2.8 MB	2,816	0.9%	37,656	12.3%	559	0.2%	0.2%	-35
33	days_enddate_fact_mean	Float64	2.8 MB	45,759	15.0%	37,656	12.3%	106	<0.1%	<0.1%	-448.0
34	days_enddate_fact_std	Float64	2.8 MB	154,869	50.6%	99,195	32.4%	1,591	0.5%	0.8%	0.0
35	days_enddate_fact_median	Float64	2.8 MB	5,425	1.8%	37,656	12.3%	199	0.1%	0.1%	-525.0
36	days_enddate_fact_range	Int64	2.8 MB	2,818	0.9%	37,656	12.3%	63,130	20.6%	23.5%	0
37	amt_credit_max_overdue_min	float64	2.4 MB	15,665	5.1%	92,840	30.4%	192,372	62.9%	90.3%	0.0
38	amt_credit_max_overdue_max	float64	2.4 MB	50,443	16.5%	92,840	30.4%	132,669	43.4%	62.3%	0.0
39	amt_credit_max_overdue_mean	float64	2.4 MB	62,613	20.5%	92,840	30.4%	132,669	43.4%	62.3%	0.0
40	amt_credit_max_overdue_std	float64	2.4 MB	56,744	18.6%	169,242	55.3%	71,930	23.5%	52.7%	0.0
41	amt_credit_max_overdue_median	float64	2.4 MB	33,054	10.8%	92,840	30.4%	166,468	54.4%	78.2%	0.0
42	amt_credit_max_overdue_range	float64	2.4 MB	42,074	13.8%	92,840	30.4%	148,332	48.5%	69.6%	0.0
43	cnt_credit_prolong_min	int64	2.4 MB	7	<0.1%	0	0%	305,499	99.9%	99.9%	0
44	cnt_credit_prolong_max	int64	2.4 MB	10	<0.1%	0	0%	297,015	97.1%	97.1%	0
45	cnt_credit_prolong_mean	float64	2.4 MB	111	<0.1%	0	0%	297,015	97.1%	97.1%	0.0
46	cnt_credit_prolong_std	float64	2.4 MB	262	0.1%	41,520	13.6%	255,807	83.6%	96.8%	0.0
47	cnt_credit_prolong_median	float64	2.4 MB	9	<0.1%	0	0%	304,960	99.7%	99.7%	0.0
48	cnt_credit_prolong_range	int64	2.4 MB	10	<0.1%	0	0%	297,327	97.2%	97.2%	0
49	cnt_credit_prolong_sum	int64	2.4 MB	10	<0.1%	0	0%	297,015	97.1%	97.1%	0
50	amt_credit_sum_min	float64	2.4 MB	61,581	20.1%	2	<0.1%	50,710	16.6%	16.6%	0.0
51	amt_credit_sum_max	float64	2.4 MB	73,784	24.1%	2	<0.1%	10,288	3.4%	3.4%	450000.0
52	amt_credit_sum_mean	float64	2.4 MB	241,361	78.9%	2	<0.1%	1,534	0.5%	0.5%	225000.0
53	amt_credit_sum_std	float64	2.4 MB	243,565	79.6%	41,521	13.6%	1,924	0.6%	0.7%	0.0
54	amt_credit_sum_median	float64	2.4 MB	114,504	37.4%	2	<0.1%	8,249	2.7%	2.7%	225000.0
55	amt_credit_sum_range	float64	2.4 MB	144,282	47.2%	2	<0.1%	43,443	14.2%	14.2%	0.0
56	amt_credit_sum_sum	float64	2.4 MB	236,430	77.3%	0	0%	1,513	0.5%	0.5%	225000.0
57	amt_credit_sum_debt_min	float64	2.4 MB	31,581	10.3%	8,372	2.7%	259,741	84.9%	87.3%	0.0
58	amt_credit_sum_debt_max	float64	2.4 MB	157,148	51.4%	8,372	2.7%	81,812	26.8%	27.5%	0.0
59	amt_credit_sum_debt_mean	float64	2.4 MB	195,128	63.8%	8,372	2.7%	80,654	26.4%	27.1%	0.0
60	amt_credit_sum_debt_std	float64	2.4 MB	191,562	62.6%	56,661	18.5%	52,345	17.1%	21.0%	0.0
61	amt_credit_sum_debt_median	float64	2.4 MB	73,411	24.0%	8,372	2.7%	201,540	65.9%	67.8%	0.0
62	amt_credit_sum_debt_range	float64	2.4 MB	150,306	49.1%	8,372	2.7%	100,634	32.9%	33.8%	0.0
63	amt_credit_sum_debt_sum	float64	2.4 MB	176,861	57.8%	0	0%	89,026	29.1%	29.1%	0.0
64	amt_credit_sum_limit_min	float64	2.4 MB	3,544	1.2%	25,308	8.3%	276,325	90.4%	98.5%	0.0
65	amt_credit_sum_limit_max	float64	2.4 MB	39,697	13.0%	25,308	8.3%	224,051	73.3%	79.9%	0.0
66	amt_credit_sum_limit_mean	float64	2.4 MB	44,905	14.7%	25,308	8.3%	223,992	73.2%	79.9%	0.0
67	amt_credit_sum_limit_std	float64	2.4 MB	43,893	14.4%	84,010	27.5%	168,698	55.2%	76.1%	0.0
68	amt_credit_sum_limit_median	float64	2.4 MB	9,814	3.2%	25,308	8.3%	268,175	87.7%	95.6%	0.0
69	amt_credit_sum_limit_range	float64	2.4 MB	37,439	12.2%	25,308	8.3%	227,400	74.4%	81.1%	0.0
70	amt_credit_sum_limit_sum	float64	2.4 MB	42,987	14.1%	0	0%	249,300	81.5%	81.5%	0.0
71	amt_credit_sum_overdue_min	float64	2.4 MB	116	<0.1%	0	0%	305,649	99.9%	99.9%	0.0
72	amt_credit_sum_overdue_max	float64	2.4 MB	1,350	0.4%	0	0%	302,010	98.8%	98.8%	0.0
73	amt_credit_sum_overdue_mean	float64	2.4 MB	2,081	0.7%	0	0%	302,010	98.8%	98.8%	0.0
74	amt_credit_sum_overdue_std	float64	2.4 MB	2,399	0.8%	41,520	13.6%	260,648	85.2%	98.6%	0.0
75	amt_credit_sum_overdue_median	float64	2.4 MB	295	0.1%	0	0%	305,321	99.8%	99.8%	0.0
76	amt_credit_sum_overdue_range	float64	2.4 MB	1,312	0.4%	0	0%	302,168	98.8%	98.8%	0.0
77	amt_credit_sum_overdue_sum	float64	2.4 MB	1,369	0.4%	0	0%	302,010	98.8%	98.8%	0.0
78	mode_credit_type	object	21.8 MB	6	<0.1%	0	0%	266,665	87.2%	87.2%	Consumer credit
79	n_different_credit_types	int64	2.4 MB	5	<0.1%	0	0%	166,664	54.5%	54.5%	2
80	n_consumer_credits	int64	2.4 MB	57	<0.1%	0	0%	55,195	18.0%	18.0%	1
81	n_credit_card_credits	int64	2.4 MB	22	<0.1%	0	0%	105,846	34.6%	34.6%	0
82	n_car_loans	int64	2.4 MB	11	<0.1%	0	0%	283,015	92.5%	92.5%	0
83	n_mortgages	int64	2.4 MB	8	<0.1%	0	0%	288,957	94.5%	94.5%	0
84	n_microloans	int64	2.4 MB	37	<0.1%	0	0%	301,246	98.5%	98.5%	0
85	n_other_type_credit	int64	2.4 MB	10	<0.1%	0	0%	302,296	98.9%	98.9%	0
86	days_credit_update_min	int64	2.4 MB	2,971	1.0%	0	0%	925	0.3%	0.3%	-18
87	days_credit_update_max	int64	2.4 MB	2,694	0.9%	0	0%	12,598	4.1%	4.1%	-7
88	days_credit_update_mean	float64	2.4 MB	59,481	19.5%	0	0%	872	0.3%	0.3%	-14.0
89	days_credit_update_std	float64	2.4 MB	215,765	70.6%	41,520	13.6%	2,983	1.0%	1.1%	0.0
90	days_credit_update_median	float64	2.4 MB	4,968	1.6%	0	0%	1,841	0.6%	0.6%	-18.0
91	days_credit_update_range	int64	2.4 MB	2,951	1.0%	0	0%	44,503	14.6%	14.6%	0
92	amt_annuity_min	float64	2.4 MB	15,274	5.0%	187,587	61.3%	83,584	27.3%	70.7%	0.0
93	amt_annuity_max	float64	2.4 MB	30,558	10.0%	187,587	61.3%	28,057	9.2%	23.7%	0.0
94	amt_annuity_mean	float64	2.4 MB	58,097	19.0%	187,587	61.3%	28,057	9.2%	23.7%	0.0
95	amt_annuity_std	float64	2.4 MB	58,197	19.0%	213,412	69.8%	25,973	8.5%	28.1%	0.0
96	amt_annuity_median	float64	2.4 MB	27,073	8.9%	187,587	61.3%	53,165	17.4%	45.0%	0.0
97	amt_annuity_range	float64	2.4 MB	27,081	8.9%	187,587	61.3%	51,798	16.9%	43.8%	0.0

5.2 Table `bureau_balance`

Code

bureau_balance.head()

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS
0	5715448	0	C
1	5715448	-1	C
2	5715448	-2	C
3	5715448	-3	C
4	5715448	-4	C

To table bureau_balance:

Add identifier SK_ID_CURR
Remove rows with irrelevant SK_ID_BUREAU values

Code

bureau_balance_relevant = pd.merge(
    bureau[["SK_ID_CURR", "SK_ID_BUREAU"]].drop_duplicates(),
    bureau_balance,
    on="SK_ID_BUREAU",
    how="inner",
)

Code

n_bureau_total = bureau["SK_ID_BUREAU"].nunique()
n_bureau_only = len(set(bureau["SK_ID_BUREAU"]) - set(bureau_balance["SK_ID_BUREAU"]))
n_blance_total = bureau_balance["SK_ID_BUREAU"].nunique()
n_balance_only = len(set(bureau_balance["SK_ID_BUREAU"]) - set(bureau["SK_ID_BUREAU"]))
n_common = len(
    set(bureau_balance["SK_ID_BUREAU"]).intersection(set(bureau["SK_ID_BUREAU"]))
)

Code

print(
    "Number of unique SK_ID_BUREAU values:\n",
    f"{n_bureau_total:8.0f} - in `bureau` table (total);\n",
    f"{n_bureau_only:8.0f} - in `bureau` but not in `bureau_balance`;\n",
    f"{n_blance_total:8.0f} - in `bureau_balance` table (total);\n",
    f"{n_balance_only:8.0f} - in `bureau_balance` but not in `bureau`;\n",
    f"{n_common:8.0f} - common in `bureau` and `bureau_balance`.\n",
)

Number of unique SK_ID_BUREAU values:
  1716428 - in `bureau` table (total);
   942074 - in `bureau` but not in `bureau_balance`;
   817395 - in `bureau_balance` table (total);
    43041 - in `bureau_balance` but not in `bureau`;
   774354 - common in `bureau` and `bureau_balance`.

Code

print("In `application`:")
print(application[["SK_ID_CURR"]].nunique())

print("\nIn `bureau`:")
print(bureau[["SK_ID_CURR", "SK_ID_BUREAU"]].nunique())

print("\nIn `bureau_balance_relevant`:")
print(bureau_balance_relevant[["SK_ID_CURR"]].nunique())

In `application`:
SK_ID_CURR    307511
dtype: int64

In `bureau`:
SK_ID_CURR       305811
SK_ID_BUREAU    1716428
dtype: int64

In `bureau_balance_relevant`:
SK_ID_CURR    134542
dtype: int64

Code

file = dir_interim + "aggregated--bureau_balance_aggregated.feather"

if os.path.exists(file):
    bureau_balance_aggregated = pd.read_feather(file)
else:
    bureau_balance_aggregated_1 = (
        bureau_balance_relevant.groupby("SK_ID_CURR")
        .agg(
            bureau_months_balance_min=("MONTHS_BALANCE", "min"),
            bureau_months_balance_max=("MONTHS_BALANCE", "max"),
        )
        .reset_index()
    )

    bureau_balance_aggregated_2 = (
        bureau_balance_relevant
        # Remove non-numeric status C and X
        .query("STATUS not in ['C', 'X']")
        # drop unused categories and convert STATUS to numeric
        .assign(STATUS=lambda df: df["STATUS"].astype(str).astype(int))
        .groupby("SK_ID_CURR")
        .agg(
            bureau_dpd_status_min=("STATUS", "min"),
            bureau_dpd_status_max=("STATUS", "max"),
            bureau_dpd_status_mean=("STATUS", "mean"),
            bureau_dpd_status_std=("STATUS", "std"),
            bureau_dpd_status_median=("STATUS", "median"),
            bureau_dpd_status_range=("STATUS", lambda x: x.max() - x.min()),
        )
        .reset_index()
    )

    # merge bureau_balance_aggregated_1 and bureau_balance_aggregated_2
    bureau_balance_aggregated = pd.merge(
        bureau_balance_aggregated_1,
        bureau_balance_aggregated_2,
        on="SK_ID_CURR",
        how="inner",
    )

    bureau_balance_aggregated.to_feather(file)

del file

Code

bureau_balance_aggregated.shape

(130773, 9)

Code

bureau_balance_aggregated.head()

	SK_ID_CURR	bureau_months_balance_min	bureau_months_balance_max	bureau_dpd_status_max	bureau_dpd_status_mean	bureau_dpd_status_std	bureau_dpd_status_range
0	100001	-51	0	1	0.03	0.18	1
1	100002	-47	0	1	0.38	0.49	1
2	100005	-12	0	0	0.00	0.00	0
3	100010	-90	-2	0	0.00	0.00	0
4	100013	-68	0	1	0.08	0.28	1

Code

an.col_info(bureau_balance_aggregated, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	523.1 kB	130,773	100.0%	0	0%	1	<0.1%	<0.1%	100001
2	bureau_months_balance_min	int8	130.8 kB	97	0.1%	0	0%	3,241	2.5%	2.5%	-95
3	bureau_months_balance_max	int8	130.8 kB	91	0.1%	0	0%	126,223	96.5%	96.5%	0
4	bureau_dpd_status_min	int32	523.1 kB	6	<0.1%	0	0%	130,695	99.9%	99.9%	0
5	bureau_dpd_status_max	int32	523.1 kB	6	<0.1%	0	0%	82,251	62.9%	62.9%	0
6	bureau_dpd_status_mean	float64	1.0 MB	5,801	4.4%	0	0%	82,251	62.9%	62.9%	0.0
7	bureau_dpd_status_std	float64	1.0 MB	16,075	12.3%	1,188	0.9%	81,108	62.0%	62.6%	0.0
8	bureau_dpd_status_median	float64	1.0 MB	11	<0.1%	0	0%	129,117	98.7%	98.7%	0.0
9	bureau_dpd_status_range	int32	523.1 kB	6	<0.1%	0	0%	82,296	62.9%	62.9%	0

5.3 Table `previous_application`

Code

previous_application.head()

	SK_ID_PREV	SK_ID_CURR	NAME_CONTRACT_TYPE	AMT_ANNUITY	AMT_APPLICATION	AMT_CREDIT	AMT_DOWN_PAYMENT	AMT_GOODS_PRICE	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	FLAG_LAST_APPL_PER_CONTRACT	NFLAG_LAST_APPL_IN_DAY	RATE_DOWN_PAYMENT	RATE_INTEREST_PRIMARY	RATE_INTEREST_PRIVILEGED	NAME_CASH_LOAN_PURPOSE	NAME_CONTRACT_STATUS	DAYS_DECISION	NAME_PAYMENT_TYPE	CODE_REJECT_REASON	NAME_TYPE_SUITE	NAME_CLIENT_TYPE	NAME_GOODS_CATEGORY	NAME_PORTFOLIO	NAME_PRODUCT_TYPE	CHANNEL_TYPE	SELLERPLACE_AREA	NAME_SELLER_INDUSTRY	CNT_PAYMENT	NAME_YIELD_GROUP	PRODUCT_COMBINATION	DAYS_FIRST_DRAWING	DAYS_FIRST_DUE	DAYS_LAST_DUE_1ST_VERSION	DAYS_LAST_DUE	DAYS_TERMINATION	NFLAG_INSURED_ON_APPROVAL
0	2030495	271877	Consumer loans	1730.43	17145.00	17145.00	0.00	17145.00	SATURDAY	15	Y	1	0.00	0.18	0.87	XAP	Approved	-73	Cash through the bank	XAP	NaN	Repeater	Mobile	POS	XNA	Country-wide	35	Connectivity	12.00	middle	POS mobile with interest	365243.00	-42.00	300.00	-42.00	-37.00	0.00
1	2802425	108129	Cash loans	25188.62	607500.00	679671.00	NaN	607500.00	THURSDAY	11	Y	1	NaN	NaN	NaN	XNA	Approved	-164	XNA	XAP	Unaccompanied	Repeater	XNA	Cash	x-sell	Contact center	-1	XNA	36.00	low_action	Cash X-Sell: low	365243.00	-134.00	916.00	365243.00	365243.00	1.00
2	2523466	122040	Cash loans	15060.74	112500.00	136444.50	NaN	112500.00	TUESDAY	11	Y	1	NaN	NaN	NaN	XNA	Approved	-301	Cash through the bank	XAP	Spouse, partner	Repeater	XNA	Cash	x-sell	Credit and cash offices	-1	XNA	12.00	high	Cash X-Sell: high	365243.00	-271.00	59.00	365243.00	365243.00	1.00
3	2819243	176158	Cash loans	47041.33	450000.00	470790.00	NaN	450000.00	MONDAY	7	Y	1	NaN	NaN	NaN	XNA	Approved	-512	Cash through the bank	XAP	NaN	Repeater	XNA	Cash	x-sell	Credit and cash offices	-1	XNA	12.00	middle	Cash X-Sell: middle	365243.00	-482.00	-152.00	-182.00	-177.00	1.00
4	1784265	202054	Cash loans	31924.40	337500.00	404055.00	NaN	337500.00	THURSDAY	9	Y	1	NaN	NaN	NaN	Repairs	Refused	-781	Cash through the bank	HC	NaN	Repeater	XNA	Cash	walk-in	Credit and cash offices	-1	XNA	24.00	high	Cash Street: high	NaN	NaN	NaN	NaN	NaN	NaN

Code

previous_application[["SK_ID_PREV", "SK_ID_CURR"]].nunique()

SK_ID_PREV    1670214
SK_ID_CURR     338857
dtype: int64

Code

file = dir_interim + "aggregated--previous_application_aggregated.feather"

if os.path.exists(file):
    previous_application_aggregated = pd.read_feather(file)

else:
    previous_application_aggregated = (
        previous_application.groupby("SK_ID_CURR")
        .agg(
            n_different_loans=("NAME_CONTRACT_TYPE", "nunique"),
            n_cash_loans=("NAME_CONTRACT_TYPE", lambda x: (x == "Cash loans").sum()),
            n_consumer_loans=(
                "NAME_CONTRACT_TYPE",
                lambda x: (x == "Consumer loans").sum(),
            ),
            n_revolving_loans=(
                "NAME_CONTRACT_TYPE",
                lambda x: (x == "Revolving loans").sum(),
            ),
            amt_annuity_min=("AMT_ANNUITY", "min"),
            amt_annuity_max=("AMT_ANNUITY", "max"),
            amt_annuity_mean=("AMT_ANNUITY", "mean"),
            amt_annuity_std=("AMT_ANNUITY", "std"),
            amt_annuity_median=("AMT_ANNUITY", "median"),
            amt_annuity_range=("AMT_ANNUITY", lambda x: x.max() - x.min()),
            amt_application_min=("AMT_APPLICATION", "min"),
            amt_application_max=("AMT_APPLICATION", "max"),
            amt_application_mean=("AMT_APPLICATION", "mean"),
            amt_application_std=("AMT_APPLICATION", "std"),
            amt_application_median=("AMT_APPLICATION", "median"),
            amt_application_range=("AMT_APPLICATION", lambda x: x.max() - x.min()),
            amt_credit_min=("AMT_CREDIT", "min"),
            amt_credit_max=("AMT_CREDIT", "max"),
            amt_credit_mean=("AMT_CREDIT", "mean"),
            amt_credit_std=("AMT_CREDIT", "std"),
            amt_credit_median=("AMT_CREDIT", "median"),
            amt_credit_range=("AMT_CREDIT", lambda x: x.max() - x.min()),
            amt_down_payment_min=("AMT_DOWN_PAYMENT", "min"),
            amt_down_payment_max=("AMT_DOWN_PAYMENT", "max"),
            amt_down_payment_mean=("AMT_DOWN_PAYMENT", "mean"),
            amt_down_payment_std=("AMT_DOWN_PAYMENT", "std"),
            amt_down_payment_median=("AMT_DOWN_PAYMENT", "median"),
            amt_down_payment_range=("AMT_DOWN_PAYMENT", lambda x: x.max() - x.min()),
            amt_goods_price_min=("AMT_GOODS_PRICE", "min"),
            amt_goods_price_max=("AMT_GOODS_PRICE", "max"),
            amt_goods_price_mean=("AMT_GOODS_PRICE", "mean"),
            amt_goods_price_std=("AMT_GOODS_PRICE", "std"),
            amt_goods_price_median=("AMT_GOODS_PRICE", "median"),
            amt_goods_price_range=("AMT_GOODS_PRICE", lambda x: x.max() - x.min()),
            rate_down_payment_min=("RATE_DOWN_PAYMENT", "min"),
            rate_down_payment_max=("RATE_DOWN_PAYMENT", "max"),
            rate_down_payment_mean=("RATE_DOWN_PAYMENT", "mean"),
            rate_down_payment_std=("RATE_DOWN_PAYMENT", "std"),
            rate_down_payment_median=("RATE_DOWN_PAYMENT", "median"),
            rate_down_payment_range=("RATE_DOWN_PAYMENT", lambda x: x.max() - x.min()),
            # Many missing values
            rate_interest_primary_min=("RATE_INTEREST_PRIMARY", "min"),
            rate_interest_primary_max=("RATE_INTEREST_PRIMARY", "max"),
            rate_interest_primary_mean=("RATE_INTEREST_PRIMARY", "mean"),
            rate_interest_primary_std=("RATE_INTEREST_PRIMARY", "std"),
            rate_interest_primary_median=("RATE_INTEREST_PRIMARY", "median"),
            rate_interest_primary_range=(
                "RATE_INTEREST_PRIMARY",
                lambda x: x.max() - x.min(),
            ),
            rate_interest_primary_count=("RATE_INTEREST_PRIMARY", "count"),
            # Many missing values
            rate_interest_privileged_min=("RATE_INTEREST_PRIVILEGED", "min"),
            rate_interest_privileged_max=("RATE_INTEREST_PRIVILEGED", "max"),
            rate_interest_privileged_mean=("RATE_INTEREST_PRIVILEGED", "mean"),
            rate_interest_privileged_std=("RATE_INTEREST_PRIVILEGED", "std"),
            rate_interest_privileged_median=("RATE_INTEREST_PRIVILEGED", "median"),
            rate_interest_privileged_range=(
                "RATE_INTEREST_PRIVILEGED",
                lambda x: x.max() - x.min(),
            ),
            rate_interest_privileged_count=("RATE_INTEREST_PRIVILEGED", "count"),
            n_different_contract_types=("NAME_CONTRACT_TYPE", "nunique"),
            n_contract_status_approved=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Approved").sum(),
            ),
            n_contract_status_canceled=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Canceled").sum(),
            ),
            n_contract_status_refused=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Refused").sum(),
            ),
            n_contract_status_unused_offer=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Unused offer").sum(),
            ),
            days_decision_min=("DAYS_DECISION", "min"),
            days_decision_max=("DAYS_DECISION", "max"),
            days_decision_mean=("DAYS_DECISION", "mean"),
            days_decision_std=("DAYS_DECISION", "std"),
            days_decision_median=("DAYS_DECISION", "median"),
            days_decision_range=("DAYS_DECISION", lambda x: x.max() - x.min()),
            n_payment_type_cash_through_bank=(
                "NAME_PAYMENT_TYPE",
                lambda x: (x == "Cash through the bank").sum(),
            ),
            n_payment_type_cash_from_account=(
                "NAME_PAYMENT_TYPE",
                lambda x: (x == "on-cash from your account").sum(),
            ),
            n_payment_type_not_available=(
                "NAME_PAYMENT_TYPE",
                lambda x: (x == "XNA").sum(),
            ),
            n_reject_reason_not_applicable=(
                "CODE_REJECT_REASON",
                lambda x: (x == "XAP").sum(),
            ),
            n_reject_reason_hc=("CODE_REJECT_REASON", lambda x: (x == "HC").sum()),
            n_reject_reason_limit=(
                "CODE_REJECT_REASON",
                lambda x: (x == "LIMIT").sum(),
            ),
            n_reject_reason_scoc=("CODE_REJECT_REASON", lambda x: (x == "SCO").sum()),
            n_reject_reason_client=(
                "CODE_REJECT_REASON",
                lambda x: (x == "CLIENT").sum(),
            ),
            n_reject_reason_scofr=(
                "CODE_REJECT_REASON",
                lambda x: (x == "SCOFR").sum(),
            ),
            n_client_type_new=("NAME_CLIENT_TYPE", lambda x: (x == "New").sum()),
            n_client_type_repeater=(
                "NAME_CLIENT_TYPE",
                lambda x: (x == "Repeater").sum(),
            ),
            n_client_type_refreshed=(
                "NAME_CLIENT_TYPE",
                lambda x: (x == "Refreshed").sum(),
            ),
            n_portfolio_pos=("NAME_PORTFOLIO", lambda x: (x == "POS").sum()),
            n_portfolio_cash=("NAME_PORTFOLIO", lambda x: (x == "Cash").sum()),
            n_portfolio_cards=("NAME_PORTFOLIO", lambda x: (x == "Cards").sum()),
            n_product_type_xsell=("NAME_PRODUCT_TYPE", lambda x: (x == "x-sell").sum()),
            n_product_type_walk_in=(
                "NAME_PRODUCT_TYPE",
                lambda x: (x == "walk-in").sum(),
            ),
            n_different_channels=("CHANNEL_TYPE", "nunique"),
            n_channel_type_credit_and_cash=(
                "CHANNEL_TYPE",
                lambda x: (x == "Credit and cash offices").sum(),
            ),
            n_channel_type_countrywide=(
                "CHANNEL_TYPE",
                lambda x: (x == "Country-wide").sum(),
            ),
            n_channel_type_stone=("CHANNEL_TYPE", lambda x: (x == "Stone").sum()),
            n_channel_type_regional_and_local=(
                "CHANNEL_TYPE",
                lambda x: (x == "Regional / Local").sum(),
            ),
            n_channel_type_contact_center=(
                "CHANNEL_TYPE",
                lambda x: (x == "Contact center").sum(),
            ),
            n_channel_type_ap_minus=(
                "CHANNEL_TYPE",
                lambda x: (x == "AP+ (Cash loan)").sum(),
            ),
            n_channel_type_channel_corporate_sales=(
                "CHANNEL_TYPE",
                lambda x: (x == "Channel of corporate sales").sum(),
            ),
            n_channel_type_car_dealer=(
                "CHANNEL_TYPE",
                lambda x: (x == "Car dealer").sum(),
            ),
            n_cnt_payment_0=("CNT_PAYMENT", lambda x: (x == 0).sum()),
            cnt_payment_min=("CNT_PAYMENT", "min"),
            cnt_payment_max=("CNT_PAYMENT", "max"),
            cnt_payment_mean=("CNT_PAYMENT", "mean"),
            cnt_payment_std=("CNT_PAYMENT", "std"),
            cnt_payment_median=("CNT_PAYMENT", "median"),
            cnt_payment_range=("CNT_PAYMENT", lambda x: x.max() - x.min()),
            n_yield_group_low_action=(
                "NAME_YIELD_GROUP",
                lambda x: (x == "low_action").sum(),
            ),
            n_yield_group_low_normal=(
                "NAME_YIELD_GROUP",
                lambda x: (x == "low_normal").sum(),
            ),
            n_yield_group_middle=("NAME_YIELD_GROUP", lambda x: (x == "middle").sum()),
            n_yield_group_high=("NAME_YIELD_GROUP", lambda x: (x == "high").sum()),
            days_first_draw_min=("DAYS_FIRST_DRAWING", "min"),
            days_first_draw_max=("DAYS_FIRST_DRAWING", "max"),
            days_first_draw_mean=("DAYS_FIRST_DRAWING", "mean"),
            days_first_draw_std=("DAYS_FIRST_DRAWING", "std"),
            days_first_draw_median=("DAYS_FIRST_DRAWING", "median"),
            days_first_draw_range=("DAYS_FIRST_DRAWING", lambda x: x.max() - x.min()),
            days_last_due_1st_version_min=("DAYS_LAST_DUE_1ST_VERSION", "min"),
            days_last_due_1st_version_max=("DAYS_LAST_DUE_1ST_VERSION", "max"),
            days_last_due_1st_version_mean=("DAYS_LAST_DUE_1ST_VERSION", "mean"),
            days_last_due_1st_version_std=("DAYS_LAST_DUE_1ST_VERSION", "std"),
            days_last_due_1st_version_median=("DAYS_LAST_DUE_1ST_VERSION", "median"),
            days_last_due_1st_version_range=(
                "DAYS_LAST_DUE_1ST_VERSION",
                lambda x: x.max() - x.min(),
            ),
            days_last_due_min=("DAYS_LAST_DUE", "min"),
            days_last_due_max=("DAYS_LAST_DUE", "max"),
            days_last_due_mean=("DAYS_LAST_DUE", "mean"),
            days_last_due_std=("DAYS_LAST_DUE", "std"),
            days_last_due_median=("DAYS_LAST_DUE", "median"),
            days_last_due_range=("DAYS_LAST_DUE", lambda x: x.max() - x.min()),
            days_termination_min=("DAYS_TERMINATION", "min"),
            days_termination_max=("DAYS_TERMINATION", "max"),
            days_termination_mean=("DAYS_TERMINATION", "mean"),
            days_termination_std=("DAYS_TERMINATION", "std"),
            days_termination_median=("DAYS_TERMINATION", "median"),
            days_termination_range=("DAYS_TERMINATION", lambda x: x.max() - x.min()),
            n_nflag_insured_on_approval_sum=("NFLAG_INSURED_ON_APPROVAL", "sum"),
            n_nflag_insured_on_approval_mean=("NFLAG_INSURED_ON_APPROVAL", "mean"),
            # Output 0/1
            n_nflag_insured_on_approval_any=(
                "NFLAG_INSURED_ON_APPROVAL",
                lambda x: x.any(),
            ),
        )
        .reset_index()
    )

    previous_application_aggregated.to_feather(file)

del file
# Time: 37m 56.3s

Code

previous_application_aggregated.shape

(338857, 130)

Code

previous_application_aggregated.head()

	SK_ID_CURR	n_different_loans	n_cash_loans	n_consumer_loans	amt_annuity_min	amt_annuity_max	amt_annuity_mean	amt_annuity_std	amt_annuity_median	amt_annuity_range	amt_application_min	amt_application_max	amt_application_mean	amt_application_std	amt_application_median	amt_application_range	amt_credit_min	amt_credit_max	amt_credit_mean	amt_credit_std	amt_credit_median	amt_credit_range	amt_down_payment_min	amt_down_payment_max	amt_down_payment_mean	amt_down_payment_std	amt_down_payment_median	amt_down_payment_range	amt_goods_price_min	amt_goods_price_max	amt_goods_price_mean	amt_goods_price_std	amt_goods_price_median	amt_goods_price_range	rate_down_payment_min	rate_down_payment_max	rate_down_payment_mean	rate_down_payment_std	rate_down_payment_median	rate_down_payment_range	rate_interest_primary_min	rate_interest_primary_max	rate_interest_primary_mean	rate_interest_primary_std	rate_interest_primary_median	rate_interest_primary_range	rate_interest_privileged_min	rate_interest_privileged_max	rate_interest_privileged_mean	rate_interest_privileged_std	rate_interest_privileged_median	rate_interest_privileged_range	n_different_contract_types	n_contract_status_approved	n_contract_status_canceled	days_decision_min	days_decision_max	days_decision_mean	days_decision_std	days_decision_median	days_decision_range	n_payment_type_cash_through_bank	n_payment_type_not_available	n_reject_reason_not_applicable	n_client_type_new	n_client_type_repeater	n_client_type_refreshed	n_portfolio_pos	n_portfolio_cash	n_product_type_xsell	n_different_channels	n_channel_type_credit_and_cash	n_channel_type_countrywide	n_channel_type_stone	n_channel_type_regional_and_local	cnt_payment_min	cnt_payment_max	cnt_payment_mean	cnt_payment_std	cnt_payment_median	cnt_payment_range	n_yield_group_low_normal	n_yield_group_middle	n_yield_group_high	days_first_draw_min	days_first_draw_max	days_first_draw_mean	days_first_draw_std	days_first_draw_median	days_last_due_1st_version_min	days_last_due_1st_version_max	days_last_due_1st_version_mean	days_last_due_1st_version_std	days_last_due_1st_version_median	days_last_due_1st_version_range	days_last_due_min	days_last_due_max	days_last_due_mean	days_last_due_std	days_last_due_median	days_last_due_range	days_termination_min	days_termination_max	days_termination_mean	days_termination_std	days_termination_median	days_termination_range	n_nflag_insured_on_approval_sum	n_nflag_insured_on_approval_mean	n_nflag_insured_on_approval_any
0	100001	1	0	1	3951.00	3951.00	3951.00	NaN	3951.00	0.00	24835.50	24835.50	24835.50	NaN	24835.50	0.00	23787.00	23787.00	23787.00	NaN	23787.00	0.00	2520.00	2520.00	2520.00	NaN	2520.00	0.00	24835.50	24835.50	24835.50	NaN	24835.50	0.00	0.10	0.10	0.10	NaN	0.10	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	1	0	-1740	-1740	-1740.00	NaN	-1740.00	0	1	0	1	0	0	1	1	0	0	1	0	1	0	0	8.00	8.00	8.00	NaN	8.00	0.00	0	0	1	365243.00	365243.00	365243.00	NaN	365243.00	-1499.00	-1499.00	-1499.00	NaN	-1499.00	0.00	-1619.00	-1619.00	-1619.00	NaN	-1619.00	0.00	-1612.00	-1612.00	-1612.00	NaN	-1612.00	0.00	0.00	0.00	False
1	100002	1	0	1	9251.77	9251.77	9251.77	NaN	9251.77	0.00	179055.00	179055.00	179055.00	NaN	179055.00	0.00	179055.00	179055.00	179055.00	NaN	179055.00	0.00	0.00	0.00	0.00	NaN	0.00	0.00	179055.00	179055.00	179055.00	NaN	179055.00	0.00	0.00	0.00	0.00	NaN	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	1	0	-606	-606	-606.00	NaN	-606.00	0	0	1	1	1	0	0	1	0	0	1	0	0	1	0	24.00	24.00	24.00	NaN	24.00	0.00	1	0	0	365243.00	365243.00	365243.00	NaN	365243.00	125.00	125.00	125.00	NaN	125.00	0.00	-25.00	-25.00	-25.00	NaN	-25.00	0.00	-17.00	-17.00	-17.00	NaN	-17.00	0.00	0.00	0.00	False
2	100003	2	1	2	6737.31	98356.99	56553.99	46332.56	64567.67	91619.68	68809.50	900000.00	435436.50	424161.62	337500.00	831190.50	68053.50	1035882.00	484191.00	497949.86	348637.50	967828.50	0.00	6885.00	3442.50	4868.43	3442.50	6885.00	68809.50	900000.00	435436.50	424161.62	337500.00	831190.50	0.00	0.10	0.05	0.07	0.05	0.10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	3	0	-2341	-746	-1305.00	898.14	-828.00	1595	2	1	3	0	1	2	2	1	1	3	1	1	1	0	6.00	12.00	10.00	3.46	12.00	6.00	1	2	0	365243.00	365243.00	365243.00	0.00	365243.00	-1980.00	-386.00	-1004.33	854.97	-647.00	1594.00	-1980.00	-536.00	-1054.33	803.57	-647.00	1444.00	-1976.00	-527.00	-1047.33	806.20	-639.00	1449.00	2.00	0.67	True
3	100004	1	0	1	5357.25	5357.25	5357.25	NaN	5357.25	0.00	24282.00	24282.00	24282.00	NaN	24282.00	0.00	20106.00	20106.00	20106.00	NaN	20106.00	0.00	4860.00	4860.00	4860.00	NaN	4860.00	0.00	24282.00	24282.00	24282.00	NaN	24282.00	0.00	0.21	0.21	0.21	NaN	0.21	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	1	0	-815	-815	-815.00	NaN	-815.00	0	1	0	1	1	0	0	1	0	0	1	0	0	0	1	4.00	4.00	4.00	NaN	4.00	0.00	0	1	0	365243.00	365243.00	365243.00	NaN	365243.00	-694.00	-694.00	-694.00	NaN	-694.00	0.00	-724.00	-724.00	-724.00	NaN	-724.00	0.00	-714.00	-714.00	-714.00	NaN	-714.00	0.00	0.00	0.00	False
4	100005	2	1	1	4813.20	4813.20	4813.20	NaN	4813.20	0.00	0.00	44617.50	22308.75	31549.34	22308.75	44617.50	0.00	40153.50	20076.75	28392.81	20076.75	40153.50	4464.00	4464.00	4464.00	NaN	4464.00	0.00	44617.50	44617.50	44617.50	NaN	44617.50	0.00	0.11	0.11	0.11	NaN	0.11	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	1	1	-757	-315	-536.00	312.54	-536.00	442	1	1	2	1	1	0	1	0	0	2	1	1	0	0	12.00	12.00	12.00	NaN	12.00	0.00	0	0	1	365243.00	365243.00	365243.00	NaN	365243.00	-376.00	-376.00	-376.00	NaN	-376.00	0.00	-466.00	-466.00	-466.00	NaN	-466.00	0.00	-460.00	-460.00	-460.00	NaN	-460.00	0.00	0.00	0.00	False

Code

an.col_info(previous_application_aggregated, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	1.4 MB	338,857	100.0%	0	0%	1	<0.1%	<0.1%	100001
2	n_different_loans	int64	2.7 MB	4	<0.1%	0	0%	129,371	38.2%	38.2%	2
3	n_cash_loans	int64	2.7 MB	60	<0.1%	0	0%	138,032	40.7%	40.7%	0
4	n_consumer_loans	int64	2.7 MB	37	<0.1%	0	0%	130,690	38.6%	38.6%	1
5	n_revolving_loans	int64	2.7 MB	28	<0.1%	0	0%	215,412	63.6%	63.6%	0
6	amt_annuity_min	float64	2.7 MB	159,918	47.2%	480	0.1%	27,542	8.1%	8.1%	2250.0
7	amt_annuity_max	float64	2.7 MB	164,390	48.5%	480	0.1%	3,835	1.1%	1.1%	22500.0
8	amt_annuity_mean	float64	2.7 MB	311,139	91.8%	480	0.1%	567	0.2%	0.2%	2250.0
9	amt_annuity_std	float64	2.7 MB	262,010	77.3%	73,917	21.8%	465	0.1%	0.2%	0.0
10	amt_annuity_median	float64	2.7 MB	241,854	71.4%	480	0.1%	2,229	0.7%	0.7%	11250.0
11	amt_annuity_range	float64	2.7 MB	234,308	69.1%	480	0.1%	73,902	21.8%	21.8%	0.0
12	amt_application_min	float64	2.7 MB	39,315	11.6%	0	0%	162,024	47.8%	47.8%	0.0
13	amt_application_max	float64	2.7 MB	53,054	15.7%	0	0%	15,919	4.7%	4.7%	450000.0
14	amt_application_mean	float64	2.7 MB	218,595	64.5%	0	0%	1,105	0.3%	0.3%	0.0
15	amt_application_std	float64	2.7 MB	246,788	72.8%	60,458	17.8%	1,810	0.5%	0.7%	0.0
16	amt_application_median	float64	2.7 MB	85,476	25.2%	0	0%	18,425	5.4%	5.4%	0.0
17	amt_application_range	float64	2.7 MB	72,070	21.3%	0	0%	62,268	18.4%	18.4%	0.0
18	amt_credit_min	float64	2.7 MB	40,983	12.1%	0	0%	136,261	40.2%	40.2%	0.0
19	amt_credit_max	float64	2.7 MB	62,833	18.5%	0	0%	7,875	2.3%	2.3%	450000.0
20	amt_credit_mean	float64	2.7 MB	239,733	70.7%	0	0%	456	0.1%	0.1%	45000.0
21	amt_credit_std	float64	2.7 MB	256,680	75.7%	60,458	17.8%	528	0.2%	0.2%	0.0
22	amt_credit_median	float64	2.7 MB	95,228	28.1%	0	0%	14,039	4.1%	4.1%	0.0
23	amt_credit_range	float64	2.7 MB	97,356	28.7%	0	0%	60,986	18.0%	18.0%	0.0
24	amt_down_payment_min	float64	2.7 MB	13,418	4.0%	20,104	5.9%	209,795	61.9%	65.8%	0.0
25	amt_down_payment_max	float64	2.7 MB	23,434	6.9%	20,104	5.9%	91,088	26.9%	28.6%	0.0
26	amt_down_payment_mean	float64	2.7 MB	59,904	17.7%	20,104	5.9%	91,087	26.9%	28.6%	0.0
27	amt_down_payment_std	float64	2.7 MB	88,724	26.2%	146,201	43.1%	33,106	9.8%	17.2%	0.0
28	amt_down_payment_median	float64	2.7 MB	26,428	7.8%	20,104	5.9%	126,088	37.2%	39.6%	0.0
29	amt_down_payment_range	float64	2.7 MB	23,245	6.9%	20,104	5.9%	159,203	47.0%	49.9%	0.0
30	amt_goods_price_min	float64	2.7 MB	50,839	15.0%	1,064	0.3%	20,036	5.9%	5.9%	45000.0
31	amt_goods_price_max	float64	2.7 MB	53,050	15.7%	1,064	0.3%	15,923	4.7%	4.7%	450000.0
32	amt_goods_price_mean	float64	2.7 MB	211,422	62.4%	1,064	0.3%	1,315	0.4%	0.4%	135000.0
33	amt_goods_price_std	float64	2.7 MB	227,746	67.2%	74,570	22.0%	2,191	0.6%	0.8%	0.0
34	amt_goods_price_median	float64	2.7 MB	91,665	27.1%	1,064	0.3%	7,493	2.2%	2.2%	135000.0
35	amt_goods_price_range	float64	2.7 MB	110,829	32.7%	1,064	0.3%	75,697	22.3%	22.4%	0.0
36	rate_down_payment_min	float32	1.4 MB	69,639	20.6%	20,104	5.9%	209,795	61.9%	65.8%	0.0
37	rate_down_payment_max	float32	1.4 MB	125,138	36.9%	20,104	5.9%	91,088	26.9%	28.6%	0.0
38	rate_down_payment_mean	float32	1.4 MB	184,996	54.6%	20,104	5.9%	91,087	26.9%	28.6%	0.0
39	rate_down_payment_std	float32	1.4 MB	141,668	41.8%	146,201	43.1%	32,914	9.7%	17.1%	0.0
40	rate_down_payment_median	float32	1.4 MB	134,571	39.7%	20,104	5.9%	126,088	37.2%	39.6%	0.0
41	rate_down_payment_range	float32	1.4 MB	113,325	33.4%	20,104	5.9%	159,011	46.9%	49.9%	0.0
42	rate_interest_primary_min	float32	1.4 MB	146	<0.1%	333,136	98.3%	1,161	0.3%	20.3%	0.18913634
43	rate_interest_primary_max	float32	1.4 MB	146	<0.1%	333,136	98.3%	1,185	0.3%	20.7%	0.18913634
44	rate_interest_primary_mean	float32	1.4 MB	210	0.1%	333,136	98.3%	1,146	0.3%	20.0%	0.18913634
45	rate_interest_primary_std	float32	1.4 MB	61	<0.1%	338,639	99.9%	65	<0.1%	29.8%	0.0
46	rate_interest_primary_median	float32	1.4 MB	202	0.1%	333,136	98.3%	1,148	0.3%	20.1%	0.18913634
47	rate_interest_primary_range	float32	1.4 MB	55	<0.1%	333,136	98.3%	5,568	1.6%	97.3%	0.0
48	rate_interest_primary_count	int64	2.7 MB	5	<0.1%	0	0%	333,136	98.3%	98.3%	0
49	rate_interest_privileged_min	float32	1.4 MB	25	<0.1%	333,136	98.3%	1,642	0.5%	28.7%	0.83509517
50	rate_interest_privileged_max	float32	1.4 MB	25	<0.1%	333,136	98.3%	1,669	0.5%	29.2%	0.83509517
51	rate_interest_privileged_mean	float32	1.4 MB	52	<0.1%	333,136	98.3%	1,622	0.5%	28.4%	0.83509517
52	rate_interest_privileged_std	float32	1.4 MB	26	<0.1%	338,639	99.9%	97	<0.1%	44.5%	0.0
53	rate_interest_privileged_median	float32	1.4 MB	45	<0.1%	333,136	98.3%	1,624	0.5%	28.4%	0.83509517
54	rate_interest_privileged_range	float32	1.4 MB	21	<0.1%	333,136	98.3%	5,600	1.7%	97.9%	0.0
55	rate_interest_privileged_count	int64	2.7 MB	5	<0.1%	0	0%	333,136	98.3%	98.3%	0
56	n_different_contract_types	int64	2.7 MB	4	<0.1%	0	0%	129,371	38.2%	38.2%	2
57	n_contract_status_approved	int64	2.7 MB	26	<0.1%	0	0%	88,369	26.1%	26.1%	1
58	n_contract_status_canceled	int64	2.7 MB	40	<0.1%	0	0%	206,163	60.8%	60.8%	0
59	n_contract_status_refused	int64	2.7 MB	47	<0.1%	0	0%	220,580	65.1%	65.1%	0
60	n_contract_status_unused_offer	int64	2.7 MB	13	<0.1%	0	0%	316,778	93.5%	93.5%	0
61	days_decision_min	int16	677.7 kB	2,921	0.9%	0	0%	218	0.1%	0.1%	-476
62	days_decision_max	int16	677.7 kB	2,922	0.9%	0	0%	1,005	0.3%	0.3%	-7
63	days_decision_mean	float64	2.7 MB	65,447	19.3%	0	0%	174	0.1%	0.1%	-355.0
64	days_decision_std	float64	2.7 MB	213,594	63.0%	60,458	17.8%	6,599	1.9%	2.4%	0.0
65	days_decision_median	float64	2.7 MB	5,723	1.7%	0	0%	413	0.1%	0.1%	-364.0
66	days_decision_range	int16	677.7 kB	2,920	0.9%	0	0%	67,057	19.8%	19.8%	0
67	n_payment_type_cash_through_bank	int64	2.7 MB	48	<0.1%	0	0%	91,254	26.9%	26.9%	1
68	n_payment_type_cash_from_account	int64	2.7 MB	1	<0.1%	0	0%	338,857	100.0%	100.0%	0
69	n_payment_type_not_available	int64	2.7 MB	49	<0.1%	0	0%	117,205	34.6%	34.6%	0
70	n_reject_reason_not_applicable	int64	2.7 MB	48	<0.1%	0	0%	72,626	21.4%	21.4%	1
71	n_reject_reason_hc	int64	2.7 MB	38	<0.1%	0	0%	260,046	76.7%	76.7%	0
72	n_reject_reason_limit	int64	2.7 MB	23	<0.1%	0	0%	305,796	90.2%	90.2%	0
73	n_reject_reason_scoc	int64	2.7 MB	21	<0.1%	0	0%	313,863	92.6%	92.6%	0
74	n_reject_reason_client	int64	2.7 MB	13	<0.1%	0	0%	316,778	93.5%	93.5%	0
75	n_reject_reason_scofr	int64	2.7 MB	19	<0.1%	0	0%	330,820	97.6%	97.6%	0
76	n_client_type_new	int64	2.7 MB	20	<0.1%	0	0%	254,343	75.1%	75.1%	1
77	n_client_type_repeater	int64	2.7 MB	66	<0.1%	0	0%	81,285	24.0%	24.0%	0
78	n_client_type_refreshed	int64	2.7 MB	25	<0.1%	0	0%	249,259	73.6%	73.6%	0
79	n_portfolio_pos	int64	2.7 MB	35	<0.1%	0	0%	136,271	40.2%	40.2%	1
80	n_portfolio_cash	int64	2.7 MB	39	<0.1%	0	0%	164,386	48.5%	48.5%	0
81	n_portfolio_cards	int64	2.7 MB	22	<0.1%	0	0%	222,739	65.7%	65.7%	0
82	n_product_type_xsell	int64	2.7 MB	35	<0.1%	0	0%	161,249	47.6%	47.6%	0
83	n_product_type_walk_in	int64	2.7 MB	34	<0.1%	0	0%	253,278	74.7%	74.7%	0
84	n_different_channels	int64	2.7 MB	7	<0.1%	0	0%	131,033	38.7%	38.7%	2
85	n_channel_type_credit_and_cash	int64	2.7 MB	58	<0.1%	0	0%	158,669	46.8%	46.8%	0
86	n_channel_type_countrywide	int64	2.7 MB	36	<0.1%	0	0%	112,423	33.2%	33.2%	1
87	n_channel_type_stone	int64	2.7 MB	25	<0.1%	0	0%	202,971	59.9%	59.9%	0
88	n_channel_type_regional_and_local	int64	2.7 MB	20	<0.1%	0	0%	263,217	77.7%	77.7%	0
89	n_channel_type_contact_center	int64	2.7 MB	23	<0.1%	0	0%	290,870	85.8%	85.8%	0
90	n_channel_type_ap_minus	int64	2.7 MB	36	<0.1%	0	0%	312,654	92.3%	92.3%	0
91	n_channel_type_channel_corporate_sales	int64	2.7 MB	23	<0.1%	0	0%	336,352	99.3%	99.3%	0
92	n_channel_type_car_dealer	int64	2.7 MB	6	<0.1%	0	0%	338,506	99.9%	99.9%	0
93	n_cnt_payment_0	int64	2.7 MB	22	<0.1%	0	0%	222,739	65.7%	65.7%	0
94	cnt_payment_min	float32	1.4 MB	33	<0.1%	478	0.1%	116,118	34.3%	34.3%	0.0
95	cnt_payment_max	float32	1.4 MB	44	<0.1%	478	0.1%	87,946	26.0%	26.0%	12.0
96	cnt_payment_mean	float32	1.4 MB	3,000	0.9%	478	0.1%	41,964	12.4%	12.4%	12.0
97	cnt_payment_std	float32	1.4 MB	19,288	5.7%	73,917	21.8%	16,773	4.9%	6.3%	0.0
98	cnt_payment_median	float32	1.4 MB	92	<0.1%	478	0.1%	90,394	26.7%	26.7%	12.0
99	cnt_payment_range	float32	1.4 MB	69	<0.1%	478	0.1%	90,212	26.6%	26.7%	0.0
100	n_yield_group_low_action	int64	2.7 MB	24	<0.1%	0	0%	271,402	80.1%	80.1%	0
101	n_yield_group_low_normal	int64	2.7 MB	26	<0.1%	0	0%	156,824	46.3%	46.3%	0
102	n_yield_group_middle	int64	2.7 MB	29	<0.1%	0	0%	131,496	38.8%	38.8%	0
103	n_yield_group_high	int64	2.7 MB	31	<0.1%	0	0%	149,675	44.2%	44.2%	0
104	days_first_draw_min	float32	1.4 MB	2,838	0.8%	1,517	0.4%	274,837	81.1%	81.5%	365243.0
105	days_first_draw_max	float32	1.4 MB	1,118	0.3%	1,517	0.4%	334,702	98.8%	99.2%	365243.0
106	days_first_draw_mean	float32	1.4 MB	17,379	5.1%	1,517	0.4%	274,837	81.1%	81.5%	365243.0
107	days_first_draw_std	float64	2.7 MB	17,027	5.0%	93,408	27.6%	185,577	54.8%	75.6%	0.0
108	days_first_draw_median	float32	1.4 MB	3,264	1.0%	1,517	0.4%	322,224	95.1%	95.5%	365243.0
109	days_first_draw_range	float32	1.4 MB	2,844	0.8%	1,517	0.4%	277,468	81.9%	82.3%	0.0
110	days_last_due_1st_version_min	float32	1.4 MB	4,222	1.2%	1,517	0.4%	2,920	0.9%	0.9%	365243.0
111	days_last_due_1st_version_max	float32	1.4 MB	4,560	1.3%	1,517	0.4%	93,314	27.5%	27.7%	365243.0
112	days_last_due_1st_version_mean	float32	1.4 MB	65,986	19.5%	1,517	0.4%	2,920	0.9%	0.9%	365243.0
113	days_last_due_1st_version_std	float64	2.7 MB	169,901	50.1%	93,408	27.6%	76	<0.1%	<0.1%	241.83051916579925
114	days_last_due_1st_version_median	float32	1.4 MB	11,486	3.4%	1,517	0.4%	2,966	0.9%	0.9%	365243.0
115	days_last_due_1st_version_range	float32	1.4 MB	8,111	2.4%	1,517	0.4%	91,940	27.1%	27.3%	0.0
116	days_last_due_min	float32	1.4 MB	2,873	0.8%	1,517	0.4%	23,418	6.9%	6.9%	365243.0
117	days_last_due_max	float32	1.4 MB	2,793	0.8%	1,517	0.4%	164,234	48.5%	48.7%	365243.0
118	days_last_due_mean	float32	1.4 MB	66,778	19.7%	1,517	0.4%	23,418	6.9%	6.9%	365243.0
119	days_last_due_std	float64	2.7 MB	161,247	47.6%	93,408	27.6%	5,082	1.5%	2.1%	0.0
120	days_last_due_median	float32	1.4 MB	8,067	2.4%	1,517	0.4%	34,523	10.2%	10.2%	365243.0
121	days_last_due_range	float32	1.4 MB	5,628	1.7%	1,517	0.4%	96,973	28.6%	28.7%	0.0
122	days_termination_min	float32	1.4 MB	2,830	0.8%	1,517	0.4%	25,760	7.6%	7.6%	365243.0
123	days_termination_max	float32	1.4 MB	2,733	0.8%	1,517	0.4%	174,661	51.5%	51.8%	365243.0
124	days_termination_mean	float32	1.4 MB	65,778	19.4%	1,517	0.4%	25,760	7.6%	7.6%	365243.0
125	days_termination_std	float64	2.7 MB	153,047	45.2%	93,408	27.6%	5,733	1.7%	2.3%	0.0
126	days_termination_median	float32	1.4 MB	7,915	2.3%	1,517	0.4%	37,945	11.2%	11.2%	365243.0
127	days_termination_range	float32	1.4 MB	5,298	1.6%	1,517	0.4%	97,624	28.8%	28.9%	0.0
128	n_nflag_insured_on_approval_sum	float32	1.4 MB	19	<0.1%	0	0%	158,702	46.8%	46.8%	0.0
129	n_nflag_insured_on_approval_mean	float32	1.4 MB	113	<0.1%	1,517	0.4%	157,185	46.4%	46.6%	0.0
130	n_nflag_insured_on_approval_any	bool	338.9 kB	2	<0.1%	0	0%	180,155	53.2%	53.2%	True

5.4 Table `installments_payments`

Code

installments_payments.head()

	SK_ID_PREV	SK_ID_CURR	NUM_INSTALMENT_VERSION	NUM_INSTALMENT_NUMBER	DAYS_INSTALMENT	DAYS_ENTRY_PAYMENT	AMT_INSTALMENT	AMT_PAYMENT
0	1054186	161674	1.00	6	-1180.00	-1187.00	6948.36	6948.36
1	1330831	151639	0.00	34	-2156.00	-2156.00	1716.53	1716.53
2	2085231	193053	2.00	1	-63.00	-63.00	25425.00	25425.00
3	2452527	199697	1.00	3	-2418.00	-2426.00	24350.13	24350.13
4	2714724	167756	1.00	2	-1383.00	-1366.00	2165.04	2160.59

Code

file = dir_interim + "aggregated--installments_payments_aggregated.feather"

if os.path.exists(file):
    installments_payments_aggregated = pd.read_feather(file)

else:
    installments_payments_aggregated = (
        installments_payments.assign(
            diff_days_installment_payment=lambda df: df["DAYS_INSTALMENT"]
            - df["DAYS_ENTRY_PAYMENT"],
            diff_amt_installment_payment=lambda df: df["AMT_INSTALMENT"]
            - df["AMT_PAYMENT"],
            diff_percent_installment_payment=lambda df: np.where(
                df["AMT_PAYMENT"] == 0,  # To avoid infinite values
                np.nan,
                df["AMT_INSTALMENT"] / df["AMT_PAYMENT"],
            ),
        )
        .groupby("SK_ID_CURR")
        .agg(
            n_installments_total=("SK_ID_PREV", "count"),
            n_installments_late=(
                "diff_days_installment_payment",
                lambda x: (x < 0).sum(),
            ),
            n_installments_early=(
                "diff_days_installment_payment",
                lambda x: (x > 0).sum(),
            ),
            n_installments_on_time=(
                "diff_days_installment_payment",
                lambda x: (x == 0).sum(),
            ),
            percent_installments_late=(
                "diff_days_installment_payment",
                lambda x: (x < 0).mean(),
            ),
            percent_installments_early=(
                "diff_days_installment_payment",
                lambda x: (x > 0).mean(),
            ),
            percent_installments_on_time=(
                "diff_days_installment_payment",
                lambda x: (x == 0).mean(),
            ),
            n_installments_late_7=(
                "diff_days_installment_payment",
                lambda x: (x < -7).sum(),
            ),
            n_installments_late_30=(
                "diff_days_installment_payment",
                lambda x: (x < -30).sum(),
            ),
            n_installments_late_60=(
                "diff_days_installment_payment",
                lambda x: (x < -60).sum(),
            ),
            any_installments_late_7=(
                "diff_days_installment_payment",
                lambda x: (x < -7).any().astype("int8"),
            ),
            any_installments_late_30=(
                "diff_days_installment_payment",
                lambda x: (x < -30).any().astype("int8"),
            ),
            any_installments_late_60=(
                "diff_days_installment_payment",
                lambda x: (x < -60).any().astype("int8"),
            ),
            percent_installments_late_7=(
                "diff_days_installment_payment",
                lambda x: (x < -7).mean(),
            ),
            percent_installments_late_30=(
                "diff_days_installment_payment",
                lambda x: (x < -30).mean(),
            ),
            percent_installments_late_60=(
                "diff_days_installment_payment",
                lambda x: (x < -60).mean(),
            ),
            diff_days_installment_payment_min=("diff_days_installment_payment", "min"),
            diff_days_installment_payment_max=("diff_days_installment_payment", "max"),
            diff_days_installment_payment_mean=(
                "diff_days_installment_payment",
                "mean",
            ),
            diff_days_installment_payment_std=("diff_days_installment_payment", "std"),
            diff_days_installment_payment_median=(
                "diff_days_installment_payment",
                "median",
            ),
            diff_days_installment_payment_range=(
                "diff_days_installment_payment",
                lambda x: x.max() - x.min(),
            ),
            diff_days_installment_payment_sum=("diff_days_installment_payment", "sum"),
            diff_days_installment_payment_sum_late_only=(
                "diff_days_installment_payment",
                lambda x: x[x < 0].sum(),
            ),
            diff_amt_installment_payment_min=("diff_amt_installment_payment", "min"),
            diff_amt_installment_payment_max=("diff_amt_installment_payment", "max"),
            diff_amt_installment_payment_mean=("diff_amt_installment_payment", "mean"),
            diff_amt_installment_payment_std=("diff_amt_installment_payment", "std"),
            diff_amt_installment_payment_median=(
                "diff_amt_installment_payment",
                "median",
            ),
            diff_amt_installment_payment_range=(
                "diff_amt_installment_payment",
                lambda x: x.max() - x.min(),
            ),
            diff_percent_installment_payment_min=(
                "diff_percent_installment_payment",
                "min",
            ),
            diff_percent_installment_payment_max=(
                "diff_percent_installment_payment",
                "max",
            ),
            diff_percent_installment_payment_mean=(
                "diff_percent_installment_payment",
                "mean",
            ),
            diff_percent_installment_payment_std=(
                "diff_percent_installment_payment",
                "std",
            ),
            diff_percent_installment_payment_median=(
                "diff_percent_installment_payment",
                "median",
            ),
            diff_percent_installment_payment_range=(
                "diff_percent_installment_payment",
                lambda x: x.max() - x.min(),
            ),
        )
    ).reset_index()

    installments_payments_aggregated.to_feather(file)

del file
# Time: 13m 13.4s

Code

installments_payments_aggregated.shape

(339587, 37)

Code

installments_payments_aggregated.head()

	SK_ID_CURR	n_installments_total	n_installments_late	n_installments_early	n_installments_on_time	percent_installments_late	percent_installments_early	percent_installments_on_time	n_installments_late_7	any_installments_late_7	percent_installments_late_7	diff_days_installment_payment_min	diff_days_installment_payment_max	diff_days_installment_payment_mean	diff_days_installment_payment_std	diff_days_installment_payment_median	diff_days_installment_payment_range	diff_days_installment_payment_sum	diff_days_installment_payment_sum_late_only	diff_percent_installment_payment_min	diff_percent_installment_payment_max	diff_percent_installment_payment_mean	diff_percent_installment_payment_median
0	100001	7	1	4	2	0.14	0.57	0.29	1	1	0.14	-11.00	36.00	7.29	14.63	6.00	47.00	51.00	-11.00	1.00	1.00	1.00	1.00
1	100002	19	0	19	0	0.00	1.00	0.00	0	0	0.00	12.00	31.00	20.42	4.93	19.00	19.00	388.00	0.00	1.00	1.00	1.00	1.00
2	100003	25	0	25	0	0.00	1.00	0.00	0	0	0.00	1.00	14.00	7.16	3.73	6.00	13.00	179.00	0.00	1.00	1.00	1.00	1.00
3	100004	3	0	3	0	0.00	1.00	0.00	0	0	0.00	3.00	11.00	7.67	4.16	9.00	8.00	23.00	0.00	1.00	1.00	1.00	1.00
4	100005	9	1	8	0	0.11	0.89	0.00	0	0	0.00	-1.00	37.00	23.56	13.51	29.00	38.00	212.00	-1.00	1.00	1.00	1.00	1.00

Code

an.col_info(installments_payments_aggregated, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	1.4 MB	339,587	100.0%	0	0%	1	<0.1%	<0.1%	100001
2	n_installments_total	int64	2.7 MB	323	0.1%	0	0%	14,187	4.2%	4.2%	12
3	n_installments_late	int64	2.7 MB	103	<0.1%	0	0%	159,742	47.0%	47.0%	0
4	n_installments_early	int64	2.7 MB	227	0.1%	0	0%	15,226	4.5%	4.5%	6
5	n_installments_on_time	int64	2.7 MB	150	<0.1%	0	0%	146,945	43.3%	43.3%	0
6	percent_installments_late	float64	2.7 MB	5,237	1.5%	0	0%	159,742	47.0%	47.0%	0.0
7	percent_installments_early	float64	2.7 MB	9,175	2.7%	0	0%	107,885	31.8%	31.8%	1.0
8	percent_installments_on_time	float64	2.7 MB	9,266	2.7%	0	0%	146,945	43.3%	43.3%	0.0
9	n_installments_late_7	int64	2.7 MB	61	<0.1%	0	0%	246,891	72.7%	72.7%	0
10	n_installments_late_30	int64	2.7 MB	47	<0.1%	0	0%	318,171	93.7%	93.7%	0
11	n_installments_late_60	int64	2.7 MB	42	<0.1%	0	0%	329,789	97.1%	97.1%	0
12	any_installments_late_7	int8	339.6 kB	2	<0.1%	0	0%	246,891	72.7%	72.7%	0
13	any_installments_late_30	int8	339.6 kB	2	<0.1%	0	0%	318,171	93.7%	93.7%	0
14	any_installments_late_60	int8	339.6 kB	2	<0.1%	0	0%	329,789	97.1%	97.1%	0
15	percent_installments_late_7	float64	2.7 MB	3,033	0.9%	0	0%	246,891	72.7%	72.7%	0.0
16	percent_installments_late_30	float64	2.7 MB	1,061	0.3%	0	0%	318,171	93.7%	93.7%	0.0
17	percent_installments_late_60	float64	2.7 MB	762	0.2%	0	0%	329,789	97.1%	97.1%	0.0
18	diff_days_installment_payment_min	float32	1.4 MB	1,736	0.5%	9	<0.1%	51,800	15.3%	15.3%	0.0
19	diff_days_installment_payment_max	float32	1.4 MB	455	0.1%	9	<0.1%	25,531	7.5%	7.5%	30.0
20	diff_days_installment_payment_mean	float32	1.4 MB	68,196	20.1%	9	<0.1%	1,295	0.4%	0.4%	9.0
21	diff_days_installment_payment_std	float32	1.4 MB	252,379	74.3%	977	0.3%	586	0.2%	0.2%	0.0
22	diff_days_installment_payment_median	float32	1.4 MB	351	0.1%	9	<0.1%	35,844	10.6%	10.6%	0.0
23	diff_days_installment_payment_range	float32	1.4 MB	1,760	0.5%	9	<0.1%	9,072	2.7%	2.7%	30.0
24	diff_days_installment_payment_sum	float32	1.4 MB	5,110	1.5%	0	0%	885	0.3%	0.3%	77.0
25	diff_days_installment_payment_sum_late_only	float32	1.4 MB	2,155	0.6%	0	0%	159,742	47.0%	47.0%	0.0
26	diff_amt_installment_payment_min	float64	2.7 MB	41,900	12.3%	9	<0.1%	294,774	86.8%	86.8%	0.0
27	diff_amt_installment_payment_max	float64	2.7 MB	119,803	35.3%	9	<0.1%	194,218	57.2%	57.2%	0.0
28	diff_amt_installment_payment_mean	float64	2.7 MB	160,502	47.3%	9	<0.1%	171,327	50.5%	50.5%	0.0
29	diff_amt_installment_payment_std	float64	2.7 MB	167,917	49.4%	977	0.3%	170,360	50.2%	50.3%	0.0
30	diff_amt_installment_payment_median	float64	2.7 MB	10,695	3.1%	9	<0.1%	326,204	96.1%	96.1%	0.0
31	diff_amt_installment_payment_range	float64	2.7 MB	145,084	42.7%	9	<0.1%	171,328	50.5%	50.5%	0.0
32	diff_percent_installment_payment_min	float64	2.7 MB	42,855	12.6%	9	<0.1%	294,774	86.8%	86.8%	1.0
33	diff_percent_installment_payment_max	float64	2.7 MB	135,199	39.8%	9	<0.1%	194,452	57.3%	57.3%	1.0
34	diff_percent_installment_payment_mean	float64	2.7 MB	145,164	42.7%	9	<0.1%	171,537	50.5%	50.5%	1.0
35	diff_percent_installment_payment_std	float64	2.7 MB	167,620	49.4%	977	0.3%	170,571	50.2%	50.4%	0.0
36	diff_percent_installment_payment_median	float64	2.7 MB	13,240	3.9%	9	<0.1%	326,204	96.1%	96.1%	1.0
37	diff_percent_installment_payment_range	float64	2.7 MB	159,045	46.8%	9	<0.1%	171,539	50.5%	50.5%	0.0

5.5 Table `pos_cash_balance`

Code

pos_cash_balance.head()

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	CNT_INSTALMENT	CNT_INSTALMENT_FUTURE	NAME_CONTRACT_STATUS
0	1803195	182943	-31	48.00	45.00	Active
1	1715348	367990	-33	36.00	35.00	Active
2	1784872	397406	-32	12.00	9.00	Active
3	1903291	269225	-35	48.00	42.00	Active
4	2341044	334279	-35	36.00	35.00	Active

Code

file = dir_interim + "aggregated--pos_cash_balance_aggregated.feather"

if os.path.exists(file):
    pos_cash_balance_aggregated = pd.read_feather(file)
else:
    pos_cash_balance_aggregated = (
        pos_cash_balance.assign(
            cnt_installments_diff=lambda df: df["CNT_INSTALMENT"]
            - df["CNT_INSTALMENT_FUTURE"]
        )
        .groupby("SK_ID_CURR")
        .agg(
            n_previous_pos_applications=("SK_ID_PREV", "count"),
            n_previous_pos_applications_active=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Active").sum(),
            ),
            n_previous_pos_applications_signed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Signed").sum(),
            ),
            n_previous_pos_applications_completed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Completed").sum(),
            ),
            cnt_installment_min=("CNT_INSTALMENT", "min"),
            cnt_installment_max=("CNT_INSTALMENT", "max"),
            cnt_installment_mean=("CNT_INSTALMENT", "mean"),
            cnt_installment_std=("CNT_INSTALMENT", "std"),
            cnt_installment_median=("CNT_INSTALMENT", "median"),
            cnt_installment_range=("CNT_INSTALMENT", lambda x: x.max() - x.min()),
            cnt_installment_future_min=("CNT_INSTALMENT_FUTURE", "min"),
            cnt_installment_future_max=("CNT_INSTALMENT_FUTURE", "max"),
            cnt_installment_future_mean=("CNT_INSTALMENT_FUTURE", "mean"),
            cnt_installment_future_std=("CNT_INSTALMENT_FUTURE", "std"),
            cnt_installment_future_median=("CNT_INSTALMENT_FUTURE", "median"),
            cnt_installment_future_range=(
                "CNT_INSTALMENT_FUTURE",
                lambda x: x.max() - x.min(),
            ),
            cnt_installments_diff_min=("cnt_installments_diff", "min"),
            cnt_installments_diff_max=("cnt_installments_diff", "max"),
            cnt_installments_diff_mean=("cnt_installments_diff", "mean"),
            cnt_installments_diff_std=("cnt_installments_diff", "std"),
            cnt_installments_diff_median=("cnt_installments_diff", "median"),
            cnt_installments_diff_range=(
                "cnt_installments_diff",
                lambda x: x.max() - x.min(),
            ),
            sk_dpd_pos_applications_min=("SK_DPD", "min"),
            sk_dpd_pos_applications_max=("SK_DPD", "max"),
            sk_dpd_pos_applications_mean=("SK_DPD", "mean"),
            sk_dpd_pos_applications_std=("SK_DPD", "std"),
            sk_dpd_pos_applications_median=("SK_DPD", "median"),
            sk_dpd_pos_applications_range=("SK_DPD", lambda x: x.max() - x.min()),
            sk_dpd_def_pos_applications_min=("SK_DPD_DEF", "min"),
            sk_dpd_def_pos_applications_max=("SK_DPD_DEF", "max"),
            sk_dpd_def_pos_applications_mean=("SK_DPD_DEF", "mean"),
            sk_dpd_def_pos_applications_std=("SK_DPD_DEF", "std"),
            sk_dpd_def_pos_applications_median=("SK_DPD_DEF", "median"),
            sk_dpd_def_pos_applications_range=(
                "SK_DPD_DEF",
                lambda x: x.max() - x.min(),
            ),
        )
        .reset_index()
    )

    pos_cash_balance_aggregated.to_feather(file)

del file
# Time: 4m 50.1s

Code

pos_cash_balance_aggregated.shape

(337252, 35)

Code

an.col_info(pos_cash_balance_aggregated, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	1.3 MB	337,252	100.0%	0	0%	1	<0.1%	<0.1%	100001
2	n_previous_pos_applications	int64	2.7 MB	234	0.1%	0	0%	15,747	4.7%	4.7%	13
3	n_previous_pos_applications_active	int64	2.7 MB	217	0.1%	0	0%	18,950	5.6%	5.6%	12
4	n_previous_pos_applications_signed	int64	2.7 MB	32	<0.1%	0	0%	269,536	79.9%	79.9%	0
5	n_previous_pos_applications_completed	int64	2.7 MB	52	<0.1%	0	0%	121,689	36.1%	36.1%	1
6	cnt_installment_min	float32	1.3 MB	58	<0.1%	28	<0.1%	70,190	20.8%	20.8%	6.0
7	cnt_installment_max	float32	1.3 MB	65	<0.1%	28	<0.1%	96,939	28.7%	28.7%	12.0
8	cnt_installment_mean	float32	1.3 MB	45,080	13.4%	28	<0.1%	24,836	7.4%	7.4%	12.0
9	cnt_installment_std	float32	1.3 MB	134,889	40.0%	394	0.1%	81,052	24.0%	24.1%	0.0
10	cnt_installment_median	float32	1.3 MB	106	<0.1%	28	<0.1%	102,288	30.3%	30.3%	12.0
11	cnt_installment_range	float32	1.3 MB	72	<0.1%	28	<0.1%	81,418	24.1%	24.1%	0.0
12	cnt_installment_future_min	float32	1.3 MB	61	<0.1%	28	<0.1%	305,633	90.6%	90.6%	0.0
13	cnt_installment_future_max	float32	1.3 MB	65	<0.1%	28	<0.1%	95,391	28.3%	28.3%	12.0
14	cnt_installment_future_mean	float32	1.3 MB	43,319	12.8%	28	<0.1%	11,968	3.5%	3.5%	6.0
15	cnt_installment_future_std	float32	1.3 MB	145,833	43.2%	394	0.1%	11,689	3.5%	3.5%	2.1602468
16	cnt_installment_future_median	float32	1.3 MB	121	<0.1%	28	<0.1%	36,769	10.9%	10.9%	6.0
17	cnt_installment_future_range	float32	1.3 MB	68	<0.1%	28	<0.1%	86,320	25.6%	25.6%	12.0
18	cnt_installments_diff_min	float32	1.3 MB	63	<0.1%	28	<0.1%	329,680	97.8%	97.8%	0.0
19	cnt_installments_diff_max	float32	1.3 MB	68	<0.1%	28	<0.1%	59,647	17.7%	17.7%	12.0
20	cnt_installments_diff_mean	float32	1.3 MB	26,103	7.7%	28	<0.1%	14,771	4.4%	4.4%	3.0
21	cnt_installments_diff_std	float32	1.3 MB	112,198	33.3%	394	0.1%	12,391	3.7%	3.7%	2.1602468
22	cnt_installments_diff_median	float32	1.3 MB	67	<0.1%	28	<0.1%	49,529	14.7%	14.7%	4.0
23	cnt_installments_diff_range	float32	1.3 MB	89	<0.1%	28	<0.1%	59,158	17.5%	17.5%	12.0
24	sk_dpd_pos_applications_min	int16	674.5 kB	65	<0.1%	0	0%	337,185	>99.9%	>99.9%	0
25	sk_dpd_pos_applications_max	int16	674.5 kB	2,025	0.6%	0	0%	274,268	81.3%	81.3%	0
26	sk_dpd_pos_applications_mean	float64	2.7 MB	11,737	3.5%	0	0%	274,268	81.3%	81.3%	0.0
27	sk_dpd_pos_applications_std	float64	2.7 MB	34,423	10.2%	372	0.1%	273,896	81.2%	81.3%	0.0
28	sk_dpd_pos_applications_median	float64	2.7 MB	1,210	0.4%	0	0%	334,735	99.3%	99.3%	0.0
29	sk_dpd_pos_applications_range	int16	674.5 kB	1,984	0.6%	0	0%	274,268	81.3%	81.3%	0
30	sk_dpd_def_pos_applications_min	int16	674.5 kB	3	<0.1%	0	0%	337,250	>99.9%	>99.9%	0
31	sk_dpd_def_pos_applications_max	int16	674.5 kB	217	0.1%	0	0%	291,303	86.4%	86.4%	0
32	sk_dpd_def_pos_applications_mean	float64	2.7 MB	4,722	1.4%	0	0%	291,303	86.4%	86.4%	0.0
33	sk_dpd_def_pos_applications_std	float64	2.7 MB	20,867	6.2%	372	0.1%	290,931	86.3%	86.4%	0.0
34	sk_dpd_def_pos_applications_median	float64	2.7 MB	80	<0.1%	0	0%	336,957	99.9%	99.9%	0.0
35	sk_dpd_def_pos_applications_range	int16	674.5 kB	216	0.1%	0	0%	291,303	86.4%	86.4%	0

5.6 Table `credit_card_balance`

Code

credit_card_balance.head()

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	AMT_BALANCE	AMT_CREDIT_LIMIT_ACTUAL	AMT_DRAWINGS_ATM_CURRENT	AMT_DRAWINGS_CURRENT	AMT_DRAWINGS_POS_CURRENT	AMT_INST_MIN_REGULARITY	AMT_PAYMENT_CURRENT	AMT_PAYMENT_TOTAL_CURRENT	AMT_RECEIVABLE_PRINCIPAL	AMT_RECIVABLE	AMT_TOTAL_RECEIVABLE	CNT_DRAWINGS_ATM_CURRENT	CNT_DRAWINGS_CURRENT	CNT_DRAWINGS_POS_CURRENT	CNT_INSTALMENT_MATURE_CUM	NAME_CONTRACT_STATUS
0	2562384	378907	-6	56.97	135000	0.00	877.50	877.50	1700.33	1800.00	1800.00	0.00	0.00	0.00	0.00	1	1.00	35.00	Active
1	2582071	363914	-1	63975.56	45000	2250.00	2250.00	0.00	2250.00	2250.00	2250.00	60175.08	64875.56	64875.56	1.00	1	0.00	69.00	Active
2	1740877	371185	-7	31815.22	450000	0.00	0.00	0.00	2250.00	2250.00	2250.00	26926.42	31460.08	31460.08	0.00	0	0.00	30.00	Active
3	1389973	337855	-4	236572.11	225000	2250.00	2250.00	0.00	11795.76	11925.00	11925.00	224949.29	233048.97	233048.97	1.00	1	0.00	10.00	Active
4	1891521	126868	-1	453919.46	450000	0.00	11547.00	11547.00	22924.89	27000.00	27000.00	443044.40	453919.46	453919.46	0.00	1	1.00	101.00	Active

Code

file = dir_interim + "aggregated--credit_card_balance_aggregated.feather"

if os.path.exists(file):
    credit_card_balance_aggregated = pd.read_feather(file)
else:
    credit_card_balance_aggregated = (
        credit_card_balance.groupby("SK_ID_CURR")
        .agg(
            n_previous_credit_card_applications=("SK_ID_PREV", "count"),
            n_previous_credit_card_applications_completed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Completed").sum(),
            ),
            n_previous_credit_card_applications_active=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Active").sum(),
            ),
            n_previous_credit_card_applications_signed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Signed").sum(),
            ),
            n_contracts_credit_card_active=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Active").sum(),
            ),
            n_contracts_credit_card_completed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Completed").sum(),
            ),
            n_contracts_credit_card_signed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Signed").sum(),
            ),
            amt_balance_credit_card_min=("AMT_BALANCE", "min"),
            amt_balance_credit_card_max=("AMT_BALANCE", "max"),
            amt_balance_credit_card_mean=("AMT_BALANCE", "mean"),
            amt_balance_credit_card_std=("AMT_BALANCE", "std"),
            amt_balance_credit_card_median=("AMT_BALANCE", "median"),
            amt_balance_credit_card_range=("AMT_BALANCE", lambda x: x.max() - x.min()),
            amt_credit_limit_actual_min=("AMT_CREDIT_LIMIT_ACTUAL", "min"),
            amt_credit_limit_actual_max=("AMT_CREDIT_LIMIT_ACTUAL", "max"),
            amt_credit_limit_actual_mean=("AMT_CREDIT_LIMIT_ACTUAL", "mean"),
            amt_credit_limit_actual_std=("AMT_CREDIT_LIMIT_ACTUAL", "std"),
            amt_credit_limit_actual_median=("AMT_CREDIT_LIMIT_ACTUAL", "median"),
            amt_credit_limit_actual_range=(
                "AMT_CREDIT_LIMIT_ACTUAL",
                lambda x: x.max() - x.min(),
            ),
            amt_drawings_atm_current_min=("AMT_DRAWINGS_ATM_CURRENT", "min"),
            amt_drawings_atm_current_max=("AMT_DRAWINGS_ATM_CURRENT", "max"),
            amt_drawings_atm_current_mean=("AMT_DRAWINGS_ATM_CURRENT", "mean"),
            amt_drawings_atm_current_std=("AMT_DRAWINGS_ATM_CURRENT", "std"),
            amt_drawings_atm_current_median=("AMT_DRAWINGS_ATM_CURRENT", "median"),
            amt_drawings_atm_current_range=(
                "AMT_DRAWINGS_ATM_CURRENT",
                lambda x: x.max() - x.max(),
            ),
            amt_drawings_current_min=("AMT_DRAWINGS_CURRENT", "min"),
            amt_drawings_current_max=("AMT_DRAWINGS_CURRENT", "max"),
            amt_drawings_current_mean=("AMT_DRAWINGS_CURRENT", "mean"),
            amt_drawings_current_std=("AMT_DRAWINGS_CURRENT", "std"),
            amt_drawings_current_median=("AMT_DRAWINGS_CURRENT", "median"),
            amt_drawings_current_range=(
                "AMT_DRAWINGS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_drawings_other_current_min=("AMT_DRAWINGS_OTHER_CURRENT", "min"),
            amt_drawings_other_current_max=("AMT_DRAWINGS_OTHER_CURRENT", "max"),
            amt_drawings_other_current_mean=("AMT_DRAWINGS_OTHER_CURRENT", "mean"),
            amt_drawings_other_current_std=("AMT_DRAWINGS_OTHER_CURRENT", "std"),
            amt_drawings_other_current_median=("AMT_DRAWINGS_OTHER_CURRENT", "median"),
            amt_drawings_other_current_range=(
                "AMT_DRAWINGS_OTHER_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_drawings_pos_current_min=("AMT_DRAWINGS_POS_CURRENT", "min"),
            amt_drawings_pos_current_max=("AMT_DRAWINGS_POS_CURRENT", "max"),
            amt_drawings_pos_current_mean=("AMT_DRAWINGS_POS_CURRENT", "mean"),
            amt_drawings_pos_current_std=("AMT_DRAWINGS_POS_CURRENT", "std"),
            amt_drawings_pos_current_median=("AMT_DRAWINGS_POS_CURRENT", "median"),
            amt_drawings_pos_current_range=(
                "AMT_DRAWINGS_POS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_inst_min_regularity_min=("AMT_INST_MIN_REGULARITY", "min"),
            amt_inst_min_regularity_max=("AMT_INST_MIN_REGULARITY", "max"),
            amt_inst_min_regularity_mean=("AMT_INST_MIN_REGULARITY", "mean"),
            amt_inst_min_regularity_std=("AMT_INST_MIN_REGULARITY", "std"),
            amt_inst_min_regularity_median=("AMT_INST_MIN_REGULARITY", "median"),
            amt_inst_min_regularity_range=(
                "AMT_INST_MIN_REGULARITY",
                lambda x: x.max() - x.min(),
            ),
            amt_payment_current_min=("AMT_PAYMENT_CURRENT", "min"),
            amt_payment_current_max=("AMT_PAYMENT_CURRENT", "max"),
            amt_payment_current_mean=("AMT_PAYMENT_CURRENT", "mean"),
            amt_payment_current_std=("AMT_PAYMENT_CURRENT", "std"),
            amt_payment_current_median=("AMT_PAYMENT_CURRENT", "median"),
            amt_payment_current_range=(
                "AMT_PAYMENT_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_payment_total_current_min=("AMT_PAYMENT_TOTAL_CURRENT", "min"),
            amt_payment_total_current_max=("AMT_PAYMENT_TOTAL_CURRENT", "max"),
            amt_payment_total_current_mean=("AMT_PAYMENT_TOTAL_CURRENT", "mean"),
            amt_payment_total_current_std=("AMT_PAYMENT_TOTAL_CURRENT", "std"),
            amt_payment_total_current_median=("AMT_PAYMENT_TOTAL_CURRENT", "median"),
            amt_payment_total_current_range=(
                "AMT_PAYMENT_TOTAL_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_receivable_principal_min=("AMT_RECEIVABLE_PRINCIPAL", "min"),
            amt_receivable_principal_max=("AMT_RECEIVABLE_PRINCIPAL", "max"),
            amt_receivable_principal_mean=("AMT_RECEIVABLE_PRINCIPAL", "mean"),
            amt_receivable_principal_std=("AMT_RECEIVABLE_PRINCIPAL", "std"),
            amt_receivable_principal_median=("AMT_RECEIVABLE_PRINCIPAL", "median"),
            amt_receivable_principal_range=(
                "AMT_RECEIVABLE_PRINCIPAL",
                lambda x: x.max() - x.min(),
            ),
            amt_receivable_min=("AMT_RECIVABLE", "min"),
            amt_receivable_max=("AMT_RECIVABLE", "max"),
            amt_receivable_mean=("AMT_RECIVABLE", "mean"),
            amt_receivable_std=("AMT_RECIVABLE", "std"),
            amt_receivable_median=("AMT_RECIVABLE", "median"),
            amt_receivable_range=("AMT_RECIVABLE", lambda x: x.max() - x.min()),
            amt_total_receivable_min=("AMT_TOTAL_RECEIVABLE", "min"),
            amt_total_receivable_max=("AMT_TOTAL_RECEIVABLE", "max"),
            amt_total_receivable_mean=("AMT_TOTAL_RECEIVABLE", "mean"),
            amt_total_receivable_std=("AMT_TOTAL_RECEIVABLE", "std"),
            amt_total_receivable_median=("AMT_TOTAL_RECEIVABLE", "median"),
            amt_total_receivable_range=(
                "AMT_TOTAL_RECEIVABLE",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_atm_current_min=("CNT_DRAWINGS_ATM_CURRENT", "min"),
            cnt_drawings_atm_current_max=("CNT_DRAWINGS_ATM_CURRENT", "max"),
            cnt_drawings_atm_current_mean=("CNT_DRAWINGS_ATM_CURRENT", "mean"),
            cnt_drawings_atm_current_std=("CNT_DRAWINGS_ATM_CURRENT", "std"),
            cnt_drawings_atm_current_median=("CNT_DRAWINGS_ATM_CURRENT", "median"),
            cnt_drawings_atm_current_range=(
                "CNT_DRAWINGS_ATM_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_current_min=("CNT_DRAWINGS_CURRENT", "min"),
            cnt_drawings_current_max=("CNT_DRAWINGS_CURRENT", "max"),
            cnt_drawings_current_mean=("CNT_DRAWINGS_CURRENT", "mean"),
            cnt_drawings_current_std=("CNT_DRAWINGS_CURRENT", "std"),
            cnt_drawings_current_median=("CNT_DRAWINGS_CURRENT", "median"),
            cnt_drawings_current_range=(
                "CNT_DRAWINGS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_other_current_min=("CNT_DRAWINGS_OTHER_CURRENT", "min"),
            cnt_drawings_other_current_max=("CNT_DRAWINGS_OTHER_CURRENT", "max"),
            cnt_drawings_other_current_mean=("CNT_DRAWINGS_OTHER_CURRENT", "mean"),
            cnt_drawings_other_current_std=("CNT_DRAWINGS_OTHER_CURRENT", "std"),
            cnt_drawings_other_current_median=("CNT_DRAWINGS_OTHER_CURRENT", "median"),
            cnt_drawings_other_current_range=(
                "CNT_DRAWINGS_OTHER_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_pos_current_min=("CNT_DRAWINGS_POS_CURRENT", "min"),
            cnt_drawings_pos_current_max=("CNT_DRAWINGS_POS_CURRENT", "max"),
            cnt_drawings_pos_current_mean=("CNT_DRAWINGS_POS_CURRENT", "mean"),
            cnt_drawings_pos_current_std=("CNT_DRAWINGS_POS_CURRENT", "std"),
            cnt_drawings_pos_current_median=("CNT_DRAWINGS_POS_CURRENT", "median"),
            cnt_drawings_pos_current_range=(
                "CNT_DRAWINGS_POS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_installment_mature_cum_min=("CNT_INSTALMENT_MATURE_CUM", "min"),
            cnt_installment_mature_cum_max=("CNT_INSTALMENT_MATURE_CUM", "max"),
            cnt_installment_mature_cum_mean=("CNT_INSTALMENT_MATURE_CUM", "mean"),
            cnt_installment_mature_cum_std=("CNT_INSTALMENT_MATURE_CUM", "std"),
            cnt_installment_mature_cum_median=("CNT_INSTALMENT_MATURE_CUM", "median"),
            cnt_installment_mature_cum_range=(
                "CNT_INSTALMENT_MATURE_CUM",
                lambda x: x.max() - x.min(),
            ),
            sk_dpd_credit_card_min=("SK_DPD", "min"),
            sk_dpd_credit_card_max=("SK_DPD", "max"),
            sk_dpd_credit_card_mean=("SK_DPD", "mean"),
            sk_dpd_credit_card_std=("SK_DPD", "std"),
            sk_dpd_credit_card_median=("SK_DPD", "median"),
            sk_dpd_credit_card_range=("SK_DPD", lambda x: x.max() - x.min()),
            sk_dpd_def_credit_card_min=("SK_DPD_DEF", "min"),
            sk_dpd_def_credit_card_max=("SK_DPD_DEF", "max"),
            sk_dpd_def_credit_card_mean=("SK_DPD_DEF", "mean"),
            sk_dpd_def_credit_card_std=("SK_DPD_DEF", "std"),
            sk_dpd_def_credit_card_median=("SK_DPD_DEF", "median"),
            sk_dpd_def_credit_card_range=("SK_DPD_DEF", lambda x: x.max() - x.min()),
        )
        .reset_index()
        .pipe(klib.convert_datatypes)
    )

    credit_card_balance_aggregated.to_feather(file)

del file

Code

credit_card_balance_aggregated.shape

(103558, 122)

Code

credit_card_balance_aggregated.head()

	SK_ID_CURR	n_previous_credit_card_applications	n_previous_credit_card_applications_completed	n_previous_credit_card_applications_active	n_contracts_credit_card_active	n_contracts_credit_card_completed	amt_balance_credit_card_max	amt_balance_credit_card_mean	amt_balance_credit_card_std	amt_balance_credit_card_range	amt_credit_limit_actual_min	amt_credit_limit_actual_max	amt_credit_limit_actual_mean	amt_credit_limit_actual_std	amt_credit_limit_actual_median	amt_credit_limit_actual_range	amt_drawings_atm_current_min	amt_drawings_atm_current_max	amt_drawings_atm_current_mean	amt_drawings_atm_current_std	amt_drawings_atm_current_median	amt_drawings_atm_current_range	amt_drawings_current_max	amt_drawings_current_mean	amt_drawings_current_std	amt_drawings_current_range	amt_drawings_other_current_min	amt_drawings_other_current_max	amt_drawings_other_current_mean	amt_drawings_other_current_std	amt_drawings_other_current_median	amt_drawings_other_current_range	amt_drawings_pos_current_min	amt_drawings_pos_current_max	amt_drawings_pos_current_mean	amt_drawings_pos_current_std	amt_drawings_pos_current_median	amt_drawings_pos_current_range	amt_inst_min_regularity_max	amt_inst_min_regularity_mean	amt_inst_min_regularity_std	amt_inst_min_regularity_range	amt_payment_current_min	amt_payment_current_max	amt_payment_current_mean	amt_payment_current_std	amt_payment_current_median	amt_payment_current_range	amt_payment_total_current_max	amt_payment_total_current_mean	amt_payment_total_current_std	amt_payment_total_current_range	amt_receivable_principal_max	amt_receivable_principal_mean	amt_receivable_principal_std	amt_receivable_principal_range	amt_receivable_min	amt_receivable_max	amt_receivable_mean	amt_receivable_std	amt_receivable_range	amt_total_receivable_min	amt_total_receivable_max	amt_total_receivable_mean	amt_total_receivable_std	amt_total_receivable_range	cnt_drawings_atm_current_min	cnt_drawings_atm_current_max	cnt_drawings_atm_current_mean	cnt_drawings_atm_current_std	cnt_drawings_atm_current_median	cnt_drawings_atm_current_range	cnt_drawings_current_max	cnt_drawings_current_mean	cnt_drawings_current_std	cnt_drawings_current_range	cnt_drawings_other_current_min	cnt_drawings_other_current_max	cnt_drawings_other_current_mean	cnt_drawings_other_current_std	cnt_drawings_other_current_median	cnt_drawings_other_current_range	cnt_drawings_pos_current_min	cnt_drawings_pos_current_max	cnt_drawings_pos_current_mean	cnt_drawings_pos_current_std	cnt_drawings_pos_current_median	cnt_drawings_pos_current_range	cnt_installment_mature_cum_min	cnt_installment_mature_cum_max	cnt_installment_mature_cum_mean	cnt_installment_mature_cum_std	cnt_installment_mature_cum_median	cnt_installment_mature_cum_range	sk_dpd_credit_card_max	sk_dpd_credit_card_mean	sk_dpd_credit_card_std	sk_dpd_credit_card_range	sk_dpd_def_credit_card_max	sk_dpd_def_credit_card_mean	sk_dpd_def_credit_card_std	sk_dpd_def_credit_card_range
0	100006	6	0	6	6	0	0.00	0.00	0.00	0.00	270000	270000	270000.00	0.00	270000.00	0	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0	0.00	0.00	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0	0.00	0.00	0	0	0.00	0.00	0
1	100011	74	0	74	74	0	189000.00	54482.11	68127.24	189000.00	90000	180000	164189.19	34482.74	180000.00	90000	0.00	180000.00	2432.43	20924.57	0.00	0.00	180000.00	2432.43	20924.57	180000.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	9000.00	3956.22	4487.75	9000.00	0.00	55485.00	4843.06	7279.60	563.36	55485.00	55485.00	4520.07	7473.87	55485.00	180000.00	52402.09	65758.82	180000.00	-563.36	189000.00	54433.18	68166.97	189563.36	-563.36	189000.00	54433.18	68166.97	189563.36	0.00	4.00	0.05	0.46	0.00	4.00	4	0.05	0.46	4	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	33.00	25.77	10.29	33.00	32.00	0	0.00	0.00	0	0	0.00	0.00	0
2	100013	96	0	96	96	0	161420.22	18159.92	43237.41	161420.22	45000	157500	131718.75	47531.59	157500.00	112500	0.00	157500.00	6350.00	28722.27	0.00	0.00	157500.00	5953.12	27843.37	157500.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	7875.00	1454.54	3028.41	7875.00	0.00	153675.00	7168.35	21626.14	274.32	153675.00	153675.00	6817.17	21730.66	153675.00	157500.00	17255.56	41279.75	157500.00	-274.32	161420.22	18101.08	43262.03	161694.54	-274.32	161420.22	18101.08	43262.03	161694.54	0.00	7.00	0.26	1.19	0.00	7.00	7	0.24	1.15	7	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	22.00	18.72	5.85	22.00	21.00	1	0.01	0.10	1	1	0.01	0.10	1
3	100021	17	10	7	7	10	0.00	0.00	0.00	0.00	675000	675000	675000.00	0.00	675000.00	0	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0	0.00	0.00	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0	0.00	0.00	0	0	0.00	0.00	0
4	100023	8	0	8	8	0	0.00	0.00	0.00	0.00	45000	225000	135000.00	96214.05	135000.00	180000	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0	0.00	0.00	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0	0.00	0.00	0	0	0.00	0.00	0

Code

an.col_info(credit_card_balance_aggregated, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	SK_ID_CURR	int32	414.2 kB	103,558	100.0%	0	0%	1	<0.1%	<0.1%	100006
2	n_previous_credit_card_applications	int16	207.1 kB	132	0.1%	0	0%	7,606	7.3%	7.3%	96
3	n_previous_credit_card_applications_completed	int8	103.6 kB	44	<0.1%	0	0%	90,377	87.3%	87.3%	0
4	n_previous_credit_card_applications_active	int16	207.1 kB	104	0.1%	0	0%	6,697	6.5%	6.5%	96
5	n_previous_credit_card_applications_signed	int8	103.6 kB	45	<0.1%	0	0%	98,629	95.2%	95.2%	0
6	n_contracts_credit_card_active	int16	207.1 kB	104	0.1%	0	0%	6,697	6.5%	6.5%	96
7	n_contracts_credit_card_completed	int8	103.6 kB	44	<0.1%	0	0%	90,377	87.3%	87.3%	0
8	n_contracts_credit_card_signed	int8	103.6 kB	45	<0.1%	0	0%	98,629	95.2%	95.2%	0
9	amt_balance_credit_card_min	float64	828.5 kB	13,320	12.9%	0	0%	88,898	85.8%	85.8%	0.0
10	amt_balance_credit_card_max	float64	828.5 kB	66,374	64.1%	0	0%	33,355	32.2%	32.2%	0.0
11	amt_balance_credit_card_mean	float64	828.5 kB	70,080	67.7%	0	0%	33,325	32.2%	32.2%	0.0
12	amt_balance_credit_card_std	float64	828.5 kB	69,961	67.6%	692	0.7%	32,822	31.7%	31.9%	0.0
13	amt_balance_credit_card_median	float64	828.5 kB	45,779	44.2%	0	0%	57,018	55.1%	55.1%	0.0
14	amt_balance_credit_card_range	float64	828.5 kB	66,657	64.4%	0	0%	33,514	32.4%	32.4%	0.0
15	amt_credit_limit_actual_min	int32	414.2 kB	180	0.2%	0	0%	26,273	25.4%	25.4%	45000
16	amt_credit_limit_actual_max	int32	414.2 kB	54	0.1%	0	0%	14,894	14.4%	14.4%	135000
17	amt_credit_limit_actual_mean	float64	828.5 kB	13,036	12.6%	0	0%	5,575	5.4%	5.4%	45000.0
18	amt_credit_limit_actual_std	float64	828.5 kB	26,234	25.3%	692	0.7%	43,431	41.9%	42.2%	0.0
19	amt_credit_limit_actual_median	float32	414.2 kB	191	0.2%	0	0%	12,884	12.4%	12.4%	0.0
20	amt_credit_limit_actual_range	int32	414.2 kB	185	0.2%	0	0%	44,123	42.6%	42.6%	0
21	amt_drawings_atm_current_min	float64	828.5 kB	144	0.1%	31,364	30.3%	71,267	68.8%	98.7%	0.0
22	amt_drawings_atm_current_max	float64	828.5 kB	1,370	1.3%	31,364	30.3%	12,162	11.7%	16.8%	0.0
23	amt_drawings_atm_current_mean	float64	828.5 kB	24,822	24.0%	31,364	30.3%	12,162	11.7%	16.8%	0.0
24	amt_drawings_atm_current_std	float64	828.5 kB	50,080	48.4%	31,866	30.8%	11,947	11.5%	16.7%	0.0
25	amt_drawings_atm_current_median	float64	828.5 kB	458	0.4%	31,364	30.3%	61,890	59.8%	85.7%	0.0
26	amt_drawings_atm_current_range	float32	414.2 kB	1	<0.1%	31,364	30.3%	72,194	69.7%	100.0%	0.0
27	amt_drawings_current_min	float64	828.5 kB	2,363	2.3%	0	0%	100,558	97.1%	97.1%	0.0
28	amt_drawings_current_max	float64	828.5 kB	28,333	27.4%	0	0%	33,294	32.2%	32.2%	0.0
29	amt_drawings_current_mean	float64	828.5 kB	57,397	55.4%	0	0%	33,293	32.1%	32.1%	0.0
30	amt_drawings_current_std	float64	828.5 kB	65,572	63.3%	692	0.7%	32,811	31.7%	31.9%	0.0
31	amt_drawings_current_median	float64	828.5 kB	15,653	15.1%	0	0%	81,097	78.3%	78.3%	0.0
32	amt_drawings_current_range	float64	828.5 kB	28,388	27.4%	0	0%	33,503	32.4%	32.4%	0.0
33	amt_drawings_other_current_min	float32	414.2 kB	7	<0.1%	31,364	30.3%	72,188	69.7%	>99.9%	0.0
34	amt_drawings_other_current_max	float64	828.5 kB	1,482	1.4%	31,364	30.3%	65,693	63.4%	91.0%	0.0
35	amt_drawings_other_current_mean	float64	828.5 kB	4,397	4.2%	31,364	30.3%	65,693	63.4%	91.0%	0.0
36	amt_drawings_other_current_std	float64	828.5 kB	5,323	5.1%	31,866	30.8%	65,195	63.0%	90.9%	0.0
37	amt_drawings_other_current_median	float64	828.5 kB	45	<0.1%	31,364	30.3%	72,137	69.7%	99.9%	0.0
38	amt_drawings_other_current_range	float64	828.5 kB	1,481	1.4%	31,364	30.3%	65,697	63.4%	91.0%	0.0
39	amt_drawings_pos_current_min	float64	828.5 kB	2,906	2.8%	31,364	30.3%	68,950	66.6%	95.5%	0.0
40	amt_drawings_pos_current_max	float64	828.5 kB	33,877	32.7%	31,364	30.3%	31,370	30.3%	43.5%	0.0
41	amt_drawings_pos_current_mean	float64	828.5 kB	39,808	38.4%	31,364	30.3%	31,370	30.3%	43.5%	0.0
42	amt_drawings_pos_current_std	float64	828.5 kB	40,122	38.7%	31,866	30.8%	31,159	30.1%	43.5%	0.0
43	amt_drawings_pos_current_median	float64	828.5 kB	14,228	13.7%	31,364	30.3%	56,554	54.6%	78.3%	0.0
44	amt_drawings_pos_current_range	float64	828.5 kB	33,748	32.6%	31,364	30.3%	31,661	30.6%	43.9%	0.0
45	amt_inst_min_regularity_min	float64	828.5 kB	2,652	2.6%	0	0%	98,268	94.9%	94.9%	0.0
46	amt_inst_min_regularity_max	float64	828.5 kB	37,619	36.3%	0	0%	33,662	32.5%	32.5%	0.0
47	amt_inst_min_regularity_mean	float64	828.5 kB	67,591	65.3%	0	0%	33,662	32.5%	32.5%	0.0
48	amt_inst_min_regularity_std	float64	828.5 kB	67,817	65.5%	692	0.7%	33,542	32.4%	32.6%	0.0
49	amt_inst_min_regularity_median	float64	828.5 kB	28,060	27.1%	0	0%	57,678	55.7%	55.7%	0.0
50	amt_inst_min_regularity_range	float64	828.5 kB	38,181	36.9%	0	0%	34,234	33.1%	33.1%	0.0
51	amt_payment_current_min	float64	828.5 kB	11,528	11.1%	31,438	30.4%	45,218	43.7%	62.7%	0.0
52	amt_payment_current_max	float64	828.5 kB	29,790	28.8%	31,438	30.4%	1,552	1.5%	2.2%	22500.0
53	amt_payment_current_mean	float64	828.5 kB	66,748	64.5%	31,438	30.4%	143	0.1%	0.2%	0.0
54	amt_payment_current_std	float64	828.5 kB	69,558	67.2%	31,956	30.9%	672	0.6%	0.9%	0.0
55	amt_payment_current_median	float64	828.5 kB	25,981	25.1%	31,438	30.4%	4,491	4.3%	6.2%	9000.0
56	amt_payment_current_range	float64	828.5 kB	35,713	34.5%	31,438	30.4%	1,190	1.1%	1.7%	0.0
57	amt_payment_total_current_min	float64	828.5 kB	1,754	1.7%	0	0%	100,571	97.1%	97.1%	0.0
58	amt_payment_total_current_max	float64	828.5 kB	35,265	34.1%	0	0%	31,936	30.8%	30.8%	0.0
59	amt_payment_total_current_mean	float64	828.5 kB	67,932	65.6%	0	0%	31,936	30.8%	30.8%	0.0
60	amt_payment_total_current_std	float64	828.5 kB	70,720	68.3%	692	0.7%	31,367	30.3%	30.5%	0.0
61	amt_payment_total_current_median	float64	828.5 kB	21,251	20.5%	0	0%	52,198	50.4%	50.4%	0.0
62	amt_payment_total_current_range	float64	828.5 kB	35,845	34.6%	0	0%	32,059	31.0%	31.0%	0.0
63	amt_receivable_principal_min	float64	828.5 kB	9,864	9.5%	0	0%	90,893	87.8%	87.8%	0.0
64	amt_receivable_principal_max	float64	828.5 kB	54,361	52.5%	0	0%	34,174	33.0%	33.0%	0.0
65	amt_receivable_principal_mean	float64	828.5 kB	68,980	66.6%	0	0%	34,137	33.0%	33.0%	0.0
66	amt_receivable_principal_std	float64	828.5 kB	69,008	66.6%	692	0.7%	33,638	32.5%	32.7%	0.0
67	amt_receivable_principal_median	float64	828.5 kB	42,298	40.8%	0	0%	60,279	58.2%	58.2%	0.0
68	amt_receivable_principal_range	float64	828.5 kB	56,103	54.2%	0	0%	34,330	33.2%	33.2%	0.0
69	amt_receivable_min	float64	828.5 kB	22,497	21.7%	0	0%	75,273	72.7%	72.7%	0.0
70	amt_receivable_max	float64	828.5 kB	66,001	63.7%	0	0%	33,593	32.4%	32.4%	0.0
71	amt_receivable_mean	float64	828.5 kB	70,224	67.8%	0	0%	33,034	31.9%	31.9%	0.0
72	amt_receivable_std	float64	828.5 kB	70,144	67.7%	692	0.7%	32,520	31.4%	31.6%	0.0
73	amt_receivable_median	float64	828.5 kB	44,334	42.8%	0	0%	58,685	56.7%	56.7%	0.0
74	amt_receivable_range	float64	828.5 kB	67,801	65.5%	0	0%	33,212	32.1%	32.1%	0.0
75	amt_total_receivable_min	float64	828.5 kB	22,496	21.7%	0	0%	75,274	72.7%	72.7%	0.0
76	amt_total_receivable_max	float64	828.5 kB	66,012	63.7%	0	0%	33,592	32.4%	32.4%	0.0
77	amt_total_receivable_mean	float64	828.5 kB	70,224	67.8%	0	0%	33,034	31.9%	31.9%	0.0
78	amt_total_receivable_std	float64	828.5 kB	70,145	67.7%	692	0.7%	32,520	31.4%	31.6%	0.0
79	amt_total_receivable_median	float64	828.5 kB	44,334	42.8%	0	0%	58,685	56.7%	56.7%	0.0
80	amt_total_receivable_range	float64	828.5 kB	67,802	65.5%	0	0%	33,212	32.1%	32.1%	0.0
81	cnt_drawings_atm_current_min	float32	414.2 kB	23	<0.1%	31,364	30.3%	71,268	68.8%	98.7%	0.0
82	cnt_drawings_atm_current_max	float32	414.2 kB	44	<0.1%	31,364	30.3%	12,162	11.7%	16.8%	0.0
83	cnt_drawings_atm_current_mean	float32	414.2 kB	3,544	3.4%	31,364	30.3%	12,162	11.7%	16.8%	0.0
84	cnt_drawings_atm_current_std	float32	414.2 kB	23,861	23.0%	31,866	30.8%	11,970	11.6%	16.7%	0.0
85	cnt_drawings_atm_current_median	float32	414.2 kB	38	<0.1%	31,364	30.3%	61,890	59.8%	85.7%	0.0
86	cnt_drawings_atm_current_range	float32	414.2 kB	44	<0.1%	31,364	30.3%	12,472	12.0%	17.3%	0.0
87	cnt_drawings_current_min	int16	207.1 kB	45	<0.1%	0	0%	100,581	97.1%	97.1%	0
88	cnt_drawings_current_max	int16	207.1 kB	123	0.1%	0	0%	33,866	32.7%	32.7%	0
89	cnt_drawings_current_mean	float32	414.2 kB	7,111	6.9%	0	0%	33,866	32.7%	32.7%	0.0
90	cnt_drawings_current_std	float32	414.2 kB	38,027	36.7%	692	0.7%	33,390	32.2%	32.5%	0.0
91	cnt_drawings_current_median	float32	414.2 kB	121	0.1%	0	0%	81,288	78.5%	78.5%	0.0
92	cnt_drawings_current_range	int16	207.1 kB	123	0.1%	0	0%	34,082	32.9%	32.9%	0
93	cnt_drawings_other_current_min	float32	414.2 kB	3	<0.1%	31,364	30.3%	72,188	69.7%	>99.9%	0.0
94	cnt_drawings_other_current_max	float32	414.2 kB	11	<0.1%	31,364	30.3%	65,675	63.4%	91.0%	0.0
95	cnt_drawings_other_current_mean	float32	414.2 kB	470	0.5%	31,364	30.3%	65,675	63.4%	91.0%	0.0
96	cnt_drawings_other_current_std	float32	414.2 kB	958	0.9%	31,866	30.8%	65,178	62.9%	90.9%	0.0
97	cnt_drawings_other_current_median	float32	414.2 kB	5	<0.1%	31,364	30.3%	72,137	69.7%	99.9%	0.0
98	cnt_drawings_other_current_range	float32	414.2 kB	11	<0.1%	31,364	30.3%	65,680	63.4%	91.0%	0.0
99	cnt_drawings_pos_current_min	float32	414.2 kB	48	<0.1%	31,364	30.3%	68,950	66.6%	95.5%	0.0
100	cnt_drawings_pos_current_max	float32	414.2 kB	128	0.1%	31,364	30.3%	31,370	30.3%	43.5%	0.0
101	cnt_drawings_pos_current_mean	float32	414.2 kB	5,492	5.3%	31,364	30.3%	31,370	30.3%	43.5%	0.0
102	cnt_drawings_pos_current_std	float32	414.2 kB	21,472	20.7%	31,866	30.8%	31,173	30.1%	43.5%	0.0
103	cnt_drawings_pos_current_median	float32	414.2 kB	123	0.1%	31,364	30.3%	56,554	54.6%	78.3%	0.0
104	cnt_drawings_pos_current_range	float32	414.2 kB	124	0.1%	31,364	30.3%	31,675	30.6%	43.9%	0.0
105	cnt_installment_mature_cum_min	float32	414.2 kB	30	<0.1%	0	0%	66,905	64.6%	64.6%	0.0
106	cnt_installment_mature_cum_max	float32	414.2 kB	121	0.1%	0	0%	33,312	32.2%	32.2%	0.0
107	cnt_installment_mature_cum_mean	float32	414.2 kB	15,471	14.9%	0	0%	33,312	32.2%	32.2%	0.0
108	cnt_installment_mature_cum_std	float32	414.2 kB	19,210	18.5%	692	0.7%	33,274	32.1%	32.3%	0.0
109	cnt_installment_mature_cum_median	float32	414.2 kB	148	0.1%	0	0%	35,237	34.0%	34.0%	0.0
110	cnt_installment_mature_cum_range	float32	414.2 kB	96	0.1%	0	0%	33,966	32.8%	32.8%	0.0
111	sk_dpd_credit_card_min	int16	207.1 kB	2	<0.1%	0	0%	103,557	>99.9%	>99.9%	0
112	sk_dpd_credit_card_max	int16	207.1 kB	438	0.4%	0	0%	82,898	80.0%	80.0%	0
113	sk_dpd_credit_card_mean	float32	414.2 kB	3,945	3.8%	0	0%	82,898	80.0%	80.0%	0.0
114	sk_dpd_credit_card_std	float32	414.2 kB	5,159	5.0%	692	0.7%	82,206	79.4%	79.9%	0.0
115	sk_dpd_credit_card_median	float32	414.2 kB	292	0.3%	0	0%	102,677	99.1%	99.1%	0.0
116	sk_dpd_credit_card_range	int16	207.1 kB	438	0.4%	0	0%	82,898	80.0%	80.0%	0
117	sk_dpd_def_credit_card_min	int16	207.1 kB	2	<0.1%	0	0%	103,557	>99.9%	>99.9%	0
118	sk_dpd_def_credit_card_max	int16	207.1 kB	62	0.1%	0	0%	86,529	83.6%	83.6%	0
119	sk_dpd_def_credit_card_mean	float32	414.2 kB	1,629	1.6%	0	0%	86,529	83.6%	83.6%	0.0
120	sk_dpd_def_credit_card_std	float32	414.2 kB	2,275	2.2%	692	0.7%	85,837	82.9%	83.4%	0.0
121	sk_dpd_def_credit_card_median	float32	414.2 kB	26	<0.1%	0	0%	103,494	99.9%	99.9%	0.0
122	sk_dpd_def_credit_card_range	int16	207.1 kB	62	0.1%	0	0%	86,529	83.6%	83.6%	0

5.7 Merge and Further Pre-Process Tables

All the following tables should be left joined to application datasets on SK_ID_CURR variable:

application_train
bureau_aggregated
bureau_balance_aggregated
previous_application_aggregated
installments_payments_aggregated
pos_cash_balance_aggregated
credit_card_balance_aggregated

At first, data will be merged with the application_train table and inspected.
In Section 6.1, the features that are either redundant (duplicated or correlated to other features) or problematic (e.g., constant or almost constant) will be identified based on the training set only and the code to do required pre-processing (e.g., to remove the unnecessary columns) will be created.
In Section 6.2, to create training, validation and test datasets, the datasets with aggregated features will be merged with application_train, application_validation and application_test datasets, respectively and the required preprocessing steps created in the previous sections will be applied.

Code

def merge_credit_history(to, on="SK_ID_CURR"):
    merged = (
        to.merge(bureau_aggregated, on=on, how="left", suffixes=("", "_bureau"))
        .merge(
            bureau_balance_aggregated,
            on=on,
            how="left",
            suffixes=("", "_bureau_balance"),
        )
        .merge(
            previous_application_aggregated,
            on=on,
            how="left",
            suffixes=("", "_previous_application"),
        )
        .merge(
            installments_payments_aggregated,
            on=on,
            how="left",
            suffixes=("", "_installments_payments"),
        )
        .merge(
            pos_cash_balance_aggregated,
            on=on,
            how="left",
            suffixes=("", "_pos_cash_balance"),
        )
        .merge(
            credit_card_balance_aggregated,
            on=on,
            how="left",
            suffixes=("", "_credit_card_balance"),
        )
    )
    return merged

Code

def preprocess_credit_data(df):
    education_values = [
        "Lower secondary",
        "Secondary / secondary special",
        "Incomplete higher",
        "Higher education",
        "Academic degree",
    ]

    education_dtype = pd.CategoricalDtype(categories=education_values, ordered=True)

    return (
        df.rename(
            columns={"n_nflag_insured_on_approval_any": "any_nflag_insured_on_approval"}
        )
        .assign(
            # Feature engineering
            any_nflag_insured_on_approval=lambda df: (
                df["any_nflag_insured_on_approval"] == "True"
            ).astype("Int8"),
            FLAG_OWN_CAR=lambda df: (df["FLAG_OWN_CAR"] == "Y").astype("Int8"),
            FLAG_OWN_REALTY=lambda df: (df["FLAG_OWN_REALTY"] == "Y").astype("Int8"),
            FLAG_IS_EMERGENCY=lambda df: (df["EMERGENCYSTATE_MODE"] == "Yes").astype(
                "Int8"
            ),
            NAME_EDUCATION_TYPE=lambda df: df["NAME_EDUCATION_TYPE"].astype(
                education_dtype
            ),
            ord_education_type=lambda df: df["NAME_EDUCATION_TYPE"].cat.codes,
            flag_has_children=lambda df: (df["CNT_CHILDREN"] > 0).astype("Int8"),
            DAYS_EMPLOYED=lambda df: df["DAYS_EMPLOYED"].replace(365243, np.nan),
            years_employed=lambda df: df["DAYS_EMPLOYED"] / -365,
            amt_income_total_per_family_member=lambda df: df["AMT_INCOME_TOTAL"]
            / df["CNT_FAM_MEMBERS"],
            cnt_fam_members_excluding_children=lambda df: df["CNT_FAM_MEMBERS"]
            - df["CNT_CHILDREN"],
            amt_annuity_to_credit_ratio=lambda df: df["AMT_ANNUITY"] / df["AMT_CREDIT"],
            amt_annuity_to_income_ratio=lambda df: df["AMT_ANNUITY"]
            / df["AMT_INCOME_TOTAL"],
            amt_credit_to_income_ratio=lambda df: df["AMT_CREDIT"]
            / df["AMT_INCOME_TOTAL"],
            amt_annuity_to_income_per_family_member=lambda df: df["AMT_ANNUITY"]
            / df["amt_income_total_per_family_member"],
            # Make explicit the missing values: XNA → NaN
            ORGANIZATION_TYPE=lambda df: df["ORGANIZATION_TYPE"].replace("XNA", np.nan),
        )
        .drop(
            columns=[
                "SK_ID_CURR",
                # Restricted by legal constraints
                "CODE_GENDER",
                "NAME_FAMILY_STATUS",
                "DAYS_BIRTH",
                # Not useful, unethical
                "WEEKDAY_APPR_PROCESS_START",
                "HOUR_APPR_PROCESS_START",
                # Already used/processed
                "EMERGENCYSTATE_MODE",
                "DAYS_EMPLOYED",
            ]
        )
    )

Code

file = dir_interim + "merged--credits_train--01.feather"

if os.path.exists(file):
    credits_train = pd.read_feather(file)
else:
    credits_train = (
        merge_credit_history(to=application_train)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
    )
    credits_train.to_feather(file)

del file

# Time: 1m 1.4s

5.8 Inspect Training Set

Code

credits_train.shape

(215257, 548)

Code

credits_train.head()

	NAME_CONTRACT_TYPE	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	FLAG_EMAIL	OCCUPATION_TYPE	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	ORGANIZATION_TYPE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	COMMONAREA_MODE	ELEVATORS_MODE	ENTRANCES_MODE	FLOORSMAX_MODE	FLOORSMIN_MODE	LANDAREA_MODE	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	FONDKAPREMONT_MODE	HOUSETYPE_MODE	TOTALAREA_MODE	WALLSMATERIAL_MODE	OBS_30_CNT_SOCIAL_CIRCLE	DEF_30_CNT_SOCIAL_CIRCLE	OBS_60_CNT_SOCIAL_CIRCLE	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	FLAG_DOCUMENT_9	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR	n_credits_total	n_credits_active	n_credits_closed	n_credits_bad_debt	n_credits_sold	mode_credit_currency	n_different_currencies	n_currency_1	n_currency_2	n_currency_3	n_currency_4	days_credit_min	days_credit_max	days_credit_mean	days_credit_std	days_credit_median	days_credit_range	days_credit_overdue_min	days_credit_overdue_max	days_credit_overdue_mean	days_credit_overdue_std	days_credit_overdue_median	days_credit_overdue_range	days_credit_enddate_min	days_credit_enddate_max	days_credit_enddate_mean	days_credit_enddate_std	days_credit_enddate_median	days_credit_enddate_range	days_enddate_fact_min	days_enddate_fact_max	days_enddate_fact_mean	days_enddate_fact_std	days_enddate_fact_median	days_enddate_fact_range	amt_credit_max_overdue_min	...	cnt_installment_future_range	cnt_installments_diff_max	cnt_installments_diff_mean	cnt_installments_diff_std	cnt_installments_diff_median	cnt_installments_diff_range	n_previous_credit_card_applications	n_previous_credit_card_applications_completed	n_previous_credit_card_applications_active	n_previous_credit_card_applications_signed	n_contracts_credit_card_active	n_contracts_credit_card_completed	n_contracts_credit_card_signed	amt_balance_credit_card_min	amt_balance_credit_card_max	amt_balance_credit_card_mean	amt_balance_credit_card_std	amt_balance_credit_card_median	amt_balance_credit_card_range	amt_credit_limit_actual_min	amt_credit_limit_actual_max	amt_credit_limit_actual_mean	amt_credit_limit_actual_std	amt_credit_limit_actual_median	amt_credit_limit_actual_range	amt_drawings_atm_current_min	amt_drawings_atm_current_max	amt_drawings_atm_current_mean	amt_drawings_atm_current_std	amt_drawings_atm_current_median	amt_drawings_atm_current_range	amt_drawings_current_min	amt_drawings_current_max	amt_drawings_current_mean	amt_drawings_current_std	amt_drawings_current_median	amt_drawings_current_range	amt_drawings_other_current_min	amt_drawings_other_current_max	amt_drawings_other_current_mean	amt_drawings_other_current_std	amt_drawings_other_current_median	amt_drawings_other_current_range	amt_drawings_pos_current_min	amt_drawings_pos_current_max	amt_drawings_pos_current_mean	amt_drawings_pos_current_std	amt_drawings_pos_current_median	amt_drawings_pos_current_range	amt_inst_min_regularity_min	amt_inst_min_regularity_max	amt_inst_min_regularity_mean	amt_inst_min_regularity_std	amt_inst_min_regularity_median	amt_inst_min_regularity_range	amt_payment_current_min	amt_payment_current_max	amt_payment_current_mean	amt_payment_current_std	amt_payment_current_median	amt_payment_current_range	amt_payment_total_current_min	amt_payment_total_current_max	amt_payment_total_current_mean	amt_payment_total_current_std	amt_payment_total_current_median	amt_payment_total_current_range	amt_receivable_principal_min	amt_receivable_principal_max	amt_receivable_principal_mean	amt_receivable_principal_std	amt_receivable_principal_median	amt_receivable_principal_range	amt_receivable_min	amt_receivable_max	amt_receivable_mean	amt_receivable_std	amt_receivable_median	amt_receivable_range	amt_total_receivable_min	amt_total_receivable_max	amt_total_receivable_mean	amt_total_receivable_std	amt_total_receivable_median	amt_total_receivable_range	cnt_drawings_atm_current_min	cnt_drawings_atm_current_max	cnt_drawings_atm_current_mean	cnt_drawings_atm_current_std	cnt_drawings_atm_current_median	cnt_drawings_atm_current_range	cnt_drawings_current_min	cnt_drawings_current_max	cnt_drawings_current_mean	cnt_drawings_current_std	cnt_drawings_current_median	cnt_drawings_current_range	cnt_drawings_other_current_min	cnt_drawings_other_current_max	cnt_drawings_other_current_mean	cnt_drawings_other_current_std	cnt_drawings_other_current_median	cnt_drawings_other_current_range	cnt_drawings_pos_current_min	cnt_drawings_pos_current_max	cnt_drawings_pos_current_mean	cnt_drawings_pos_current_std	cnt_drawings_pos_current_median	cnt_drawings_pos_current_range	cnt_installment_mature_cum_min	cnt_installment_mature_cum_max	cnt_installment_mature_cum_mean	cnt_installment_mature_cum_std	cnt_installment_mature_cum_median	cnt_installment_mature_cum_range	sk_dpd_credit_card_min	sk_dpd_credit_card_max	sk_dpd_credit_card_mean	sk_dpd_credit_card_std	sk_dpd_credit_card_median	sk_dpd_credit_card_range	sk_dpd_def_credit_card_min	sk_dpd_def_credit_card_max	sk_dpd_def_credit_card_mean	sk_dpd_def_credit_card_std	sk_dpd_def_credit_card_median	sk_dpd_def_credit_card_range	ord_education_type	flag_has_children	years_employed	amt_income_total_per_family_member	cnt_fam_members_excluding_children	amt_annuity_to_credit_ratio	amt_annuity_to_income_ratio	amt_credit_to_income_ratio	amt_annuity_to_income_per_family_member
0	Cash loans	1	1	2	405000.00	1971072.00	68643.00	1800000.00	Unaccompanied	Commercial associate	Higher education	House / apartment	0.01	-7460.00	-1823	13.00	1	1	1	0	0	Accountants	4.00	3	3	Self-employed	0.68	0.33	0.64	0.12	0.10	0.98	0.78	NaN	0.00	0.24	0.17	0.21	0.00	0.10	0.12	NaN	0.03	0.12	0.10	0.98	0.79	NaN	0.00	0.24	0.17	0.21	0.00	0.11	0.13	NaN	0.03	0.12	0.10	0.98	0.79	NaN	0.00	0.24	0.17	0.21	0.00	0.10	0.13	NaN	0.03	reg oper account	block of flats	0.10	Stone, brick	4.00	0.00	4.00	0.00	-2169.00	1	0	0	0.00	0.00	0.00	0.00	0.00	0.00	4.00	2.00	2.00	0.00	0.00	currency 1	1.00	4.00	0.00	0.00	0.00	-1239.00	-145.00	-846.75	489.28	-1001.50	1094.00	0.00	0.00	0.00	0.00	0.00	0.00	-746	934	51.00	698.62	8.00	1680	-746	-362	-554.00	271.53	-554.00	384	0.00	...	12.00	12.00	6.00	3.89	6.00	12.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3	1	2.82	101250.00	2.00	0.03	0.17	4.87	0.68
1	Cash loans	0	1	0	337500.00	508495.50	38146.50	454500.00	Family	State servant	Higher education	House / apartment	0.01	-4054.00	-1090	NaN	1	1	1	0	0	Managers	2.00	2	2	Agriculture	NaN	0.62	0.44	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.00	1.00	2.00	1.00	-659.00	0	1	0	0.00	0.00	0.00	0.00	0.00	6.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	...	24.00	11.00	5.41	3.51	5.50	11.00	11.00	0.00	11.00	0.00	11.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	765000.00	765000.00	765000.00	0.00	765000.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	3	0	3.31	168750.00	2.00	0.08	0.11	1.51	0.23
2	Cash loans	0	1	1	112500.00	110146.50	13068.00	90000.00	Unaccompanied	Commercial associate	Secondary / secondary special	House / apartment	0.01	-5554.00	-4130	NaN	1	1	1	1	1	Laborers	3.00	2	2	Business Entity Type 3	0.36	0.65	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	-172.00	0	0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	...	60.00	10.00	3.23	2.82	2.50	10.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	1	1.62	37500.00	2.00	0.12	0.12	0.98	0.35
3	Cash loans	0	1	2	40500.00	66384.00	3519.00	45000.00	Unaccompanied	Commercial associate	Secondary / secondary special	House / apartment	0.03	-5285.00	-5290	NaN	1	1	1	0	0	Sales staff	4.00	2	2	Self-employed	0.39	0.60	0.45	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	-1576.00	1	0	0	0.00	0.00	0.00	1.00	0.00	2.00	5.00	3.00	2.00	0.00	0.00	currency 1	1.00	5.00	0.00	0.00	0.00	-1345.00	-325.00	-728.00	398.50	-545.00	1020.00	0.00	0.00	0.00	0.00	0.00	0.00	-679	30905	6060.20	13897.16	41.00	31584	-649	-518	-583.50	92.63	-583.50	131	NaN	...	24.00	24.00	8.53	7.40	6.00	24.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	1	14.73	10125.00	2.00	0.05	0.09	1.64	0.35
4	Cash loans	1	0	0	225000.00	298512.00	31801.50	270000.00	Unaccompanied	Commercial associate	Secondary / secondary special	House / apartment	0.02	-86.00	-3033	11.00	1	1	1	0	0	Drivers	2.00	2	2	Construction	0.74	0.66	0.72	0.30	0.14	1.00	0.99	0.10	0.40	0.17	0.46	0.00	0.00	0.24	0.25	0.00	0.00	0.30	0.14	1.00	0.99	0.10	0.40	0.17	0.46	0.00	0.00	0.26	0.26	0.00	0.00	0.30	0.14	1.00	0.99	0.10	0.40	0.17	0.46	0.00	0.00	0.25	0.25	0.00	0.00	reg oper account	block of flats	0.27	Stone, brick	3.00	0.00	3.00	0.00	-624.00	1	0	0	0.00	0.00	0.00	0.00	0.00	0.00	3.00	1.00	2.00	0.00	0.00	currency 1	1.00	3.00	0.00	0.00	0.00	-2861.00	-965.00	-1644.00	1056.31	-1106.00	1896.00	0.00	0.00	0.00	0.00	0.00	0.00	-2526	703	-569.67	1719.64	114.00	3229	-2501	-723	-1612.00	1257.24	-1612.00	1778	41400.00	...	10.00	5.00	2.50	1.87	2.50	5.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	0	3.27	112500.00	2.00	0.11	0.14	1.33	0.28

5 rows × 548 columns

Info on all columns:

Column info (whole dataset)

credits_train_col_info = an.col_info(credits_train)
credits_train_col_info.pipe(an.style_col_info)

Table 5.1. All columns of the merged dataset.

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	TARGET	int8	215.3 kB	2	<0.1%	0	0%	197,880	91.9%	91.9%	0
2	NAME_CONTRACT_TYPE	category	215.5 kB	2	<0.1%	0	0%	194,675	90.4%	90.4%	Cash loans
3	FLAG_OWN_CAR	Int8	430.5 kB	2	<0.1%	0	0%	142,086	66.0%	66.0%	0
4	FLAG_OWN_REALTY	Int8	430.5 kB	2	<0.1%	0	0%	149,412	69.4%	69.4%	1
5	CNT_CHILDREN	int8	215.3 kB	12	<0.1%	0	0%	150,641	70.0%	70.0%	0
6	AMT_INCOME_TOTAL	float64	1.7 MB	1,949	0.9%	0	0%	24,982	11.6%	11.6%	135000.0
7	AMT_CREDIT	float32	861.0 kB	5,097	2.4%	0	0%	6,823	3.2%	3.2%	450000.0
8	AMT_ANNUITY	float32	861.0 kB	12,801	5.9%	8	<0.1%	4,499	2.1%	2.1%	9000.0
9	AMT_GOODS_PRICE	float32	861.0 kB	828	0.4%	187	0.1%	18,194	8.5%	8.5%	450000.0
10	NAME_TYPE_SUITE	category	216.0 kB	7	<0.1%	901	0.4%	174,089	80.9%	81.2%	Unaccompanied
11	NAME_INCOME_TYPE	category	216.1 kB	8	<0.1%	0	0%	110,984	51.6%	51.6%	Working
12	NAME_EDUCATION_TYPE	category	215.8 kB	5	<0.1%	0	0%	152,993	71.1%	71.1%	Secondary / secondary special
13	NAME_HOUSING_TYPE	category	215.9 kB	6	<0.1%	0	0%	191,159	88.8%	88.8%	House / apartment
14	REGION_POPULATION_RELATIVE	float32	861.0 kB	81	<0.1%	0	0%	11,494	5.3%	5.3%	0.035792
15	DAYS_REGISTRATION	float32	861.0 kB	15,249	7.1%	0	0%	79	<0.1%	<0.1%	-7.0
16	DAYS_ID_PUBLISH	int16	430.5 kB	6,122	2.8%	0	0%	119	0.1%	0.1%	-4074
17	OWN_CAR_AGE	float32	861.0 kB	61	<0.1%	142,091	66.0%	5,232	2.4%	7.2%	7.0
18	FLAG_MOBIL	int8	215.3 kB	2	<0.1%	0	0%	215,256	>99.9%	>99.9%	1
19	FLAG_EMP_PHONE	int8	215.3 kB	2	<0.1%	0	0%	176,491	82.0%	82.0%	1
20	FLAG_WORK_PHONE	int8	215.3 kB	2	<0.1%	0	0%	172,406	80.1%	80.1%	0
21	FLAG_CONT_MOBILE	int8	215.3 kB	2	<0.1%	0	0%	214,855	99.8%	99.8%	1
22	FLAG_PHONE	int8	215.3 kB	2	<0.1%	0	0%	154,906	72.0%	72.0%	0
23	FLAG_EMAIL	int8	215.3 kB	2	<0.1%	0	0%	203,006	94.3%	94.3%	0
24	OCCUPATION_TYPE	category	217.1 kB	18	<0.1%	67,480	31.3%	38,591	17.9%	26.1%	Laborers
25	CNT_FAM_MEMBERS	float32	861.0 kB	12	<0.1%	1	<0.1%	110,671	51.4%	51.4%	2.0
26	REGION_RATING_CLIENT	int8	215.3 kB	3	<0.1%	0	0%	158,846	73.8%	73.8%	2
27	REGION_RATING_CLIENT_W_CITY	int8	215.3 kB	3	<0.1%	0	0%	160,564	74.6%	74.6%	2
28	REG_REGION_NOT_LIVE_REGION	int8	215.3 kB	2	<0.1%	0	0%	211,999	98.5%	98.5%	0
29	REG_REGION_NOT_WORK_REGION	int8	215.3 kB	2	<0.1%	0	0%	204,222	94.9%	94.9%	0
30	LIVE_REGION_NOT_WORK_REGION	int8	215.3 kB	2	<0.1%	0	0%	206,386	95.9%	95.9%	0
31	REG_CITY_NOT_LIVE_CITY	int8	215.3 kB	2	<0.1%	0	0%	198,549	92.2%	92.2%	0
32	REG_CITY_NOT_WORK_CITY	int8	215.3 kB	2	<0.1%	0	0%	165,697	77.0%	77.0%	0
33	LIVE_CITY_NOT_WORK_CITY	int8	215.3 kB	2	<0.1%	0	0%	176,518	82.0%	82.0%	0
34	ORGANIZATION_TYPE	category	221.3 kB	57	<0.1%	38,756	18.0%	47,582	22.1%	27.0%	Business Entity Type 3
35	EXT_SOURCE_1	float32	861.0 kB	83,961	39.0%	121,373	56.4%	5	<0.1%	<0.1%	0.44398212
36	EXT_SOURCE_2	float32	861.0 kB	102,229	47.5%	464	0.2%	503	0.2%	0.2%	0.28589788
37	EXT_SOURCE_3	float32	861.0 kB	804	0.4%	42,680	19.8%	985	0.5%	0.6%	0.7463002
38	APARTMENTS_AVG	float32	861.0 kB	2,207	1.0%	109,076	50.7%	4,712	2.2%	4.4%	0.0825
39	BASEMENTAREA_AVG	float32	861.0 kB	3,626	1.7%	125,793	58.4%	10,282	4.8%	11.5%	0.0
40	YEARS_BEGINEXPLUATATION_AVG	float32	861.0 kB	260	0.1%	104,910	48.7%	3,073	1.4%	2.8%	0.9871
41	YEARS_BUILD_AVG	float32	861.0 kB	146	0.1%	143,036	66.4%	2,132	1.0%	3.0%	0.8232
42	COMMONAREA_AVG	float32	861.0 kB	2,964	1.4%	150,300	69.8%	5,899	2.7%	9.1%	0.0
43	ELEVATORS_AVG	float32	861.0 kB	241	0.1%	114,570	53.2%	60,109	27.9%	59.7%	0.0
44	ENTRANCES_AVG	float32	861.0 kB	266	0.1%	108,270	50.3%	23,867	11.1%	22.3%	0.1379
45	FLOORSMAX_AVG	float32	861.0 kB	371	0.2%	106,970	49.7%	43,449	20.2%	40.1%	0.1667
46	FLOORSMIN_AVG	float32	861.0 kB	280	0.1%	146,054	67.9%	23,117	10.7%	33.4%	0.2083
47	LANDAREA_AVG	float32	861.0 kB	3,360	1.6%	127,644	59.3%	10,845	5.0%	12.4%	0.0
48	LIVINGAPARTMENTS_AVG	float32	861.0 kB	1,761	0.8%	147,049	68.3%	2,984	1.4%	4.4%	0.0504
49	LIVINGAREA_AVG	float32	861.0 kB	4,983	2.3%	107,990	50.2%	202	0.1%	0.2%	0.0
50	NONLIVINGAPARTMENTS_AVG	float32	861.0 kB	345	0.2%	149,354	69.4%	38,319	17.8%	58.1%	0.0
51	NONLIVINGAREA_AVG	float32	861.0 kB	3,042	1.4%	118,577	55.1%	41,099	19.1%	42.5%	0.0
52	APARTMENTS_MODE	float32	861.0 kB	744	0.3%	109,076	50.7%	5,301	2.5%	5.0%	0.084
53	BASEMENTAREA_MODE	float32	861.0 kB	3,687	1.7%	125,793	58.4%	11,561	5.4%	12.9%	0.0
54	YEARS_BEGINEXPLUATATION_MODE	float32	861.0 kB	210	0.1%	104,910	48.7%	3,039	1.4%	2.8%	0.9871
55	YEARS_BUILD_MODE	float32	861.0 kB	152	0.1%	143,036	66.4%	2,090	1.0%	2.9%	0.8301
56	COMMONAREA_MODE	float32	861.0 kB	2,908	1.4%	150,300	69.8%	6,770	3.1%	10.4%	0.0
57	ELEVATORS_MODE	float32	861.0 kB	26	<0.1%	114,570	53.2%	62,808	29.2%	62.4%	0.0
58	ENTRANCES_MODE	float32	861.0 kB	30	<0.1%	108,270	50.3%	25,310	11.8%	23.7%	0.1379
59	FLOORSMAX_MODE	float32	861.0 kB	25	<0.1%	106,970	49.7%	46,048	21.4%	42.5%	0.1667
60	FLOORSMIN_MODE	float32	861.0 kB	25	<0.1%	146,054	67.9%	24,209	11.2%	35.0%	0.2083
61	LANDAREA_MODE	float32	861.0 kB	3,406	1.6%	127,644	59.3%	12,121	5.6%	13.8%	0.0
62	LIVINGAPARTMENTS_MODE	float32	861.0 kB	715	0.3%	147,049	68.3%	3,447	1.6%	5.1%	0.0551
63	LIVINGAREA_MODE	float32	861.0 kB	5,083	2.4%	107,990	50.2%	310	0.1%	0.3%	0.0
64	NONLIVINGAPARTMENTS_MODE	float32	861.0 kB	148	0.1%	149,354	69.4%	41,574	19.3%	63.1%	0.0
65	NONLIVINGAREA_MODE	float32	861.0 kB	3,090	1.4%	118,577	55.1%	46,933	21.8%	48.5%	0.0
66	APARTMENTS_MEDI	float32	861.0 kB	1,120	0.5%	109,076	50.7%	5,000	2.3%	4.7%	0.0833
67	BASEMENTAREA_MEDI	float32	861.0 kB	3,614	1.7%	125,793	58.4%	10,458	4.9%	11.7%	0.0
68	YEARS_BEGINEXPLUATATION_MEDI	float32	861.0 kB	232	0.1%	104,910	48.7%	3,060	1.4%	2.8%	0.9871
69	YEARS_BUILD_MEDI	float32	861.0 kB	148	0.1%	143,036	66.4%	2,118	1.0%	2.9%	0.8256
70	COMMONAREA_MEDI	float32	861.0 kB	2,982	1.4%	150,300	69.8%	6,068	2.8%	9.3%	0.0
71	ELEVATORS_MEDI	float32	861.0 kB	46	<0.1%	114,570	53.2%	61,040	28.4%	60.6%	0.0
72	ENTRANCES_MEDI	float32	861.0 kB	46	<0.1%	108,270	50.3%	24,940	11.6%	23.3%	0.1379
73	FLOORSMAX_MEDI	float32	861.0 kB	49	<0.1%	106,970	49.7%	44,659	20.7%	41.2%	0.1667
74	FLOORSMIN_MEDI	float32	861.0 kB	47	<0.1%	146,054	67.9%	23,733	11.0%	34.3%	0.2083
75	LANDAREA_MEDI	float32	861.0 kB	3,393	1.6%	127,644	59.3%	11,058	5.1%	12.6%	0.0
76	LIVINGAPARTMENTS_MEDI	float32	861.0 kB	1,063	0.5%	147,049	68.3%	3,142	1.5%	4.6%	0.0513
77	LIVINGAREA_MEDI	float32	861.0 kB	5,067	2.4%	107,990	50.2%	210	0.1%	0.2%	0.0
78	NONLIVINGAPARTMENTS_MEDI	float32	861.0 kB	190	0.1%	149,354	69.4%	39,384	18.3%	59.8%	0.0
79	NONLIVINGAREA_MEDI	float32	861.0 kB	3,083	1.4%	118,577	55.1%	42,610	19.8%	44.1%	0.0
80	FONDKAPREMONT_MODE	category	215.7 kB	4	<0.1%	147,099	68.3%	51,785	24.1%	76.0%	reg oper account
81	HOUSETYPE_MODE	category	215.6 kB	3	<0.1%	107,834	50.1%	105,515	49.0%	98.2%	block of flats
82	TOTALAREA_MODE	float32	861.0 kB	4,896	2.3%	103,833	48.2%	417	0.2%	0.4%	0.0
83	WALLSMATERIAL_MODE	category	216.0 kB	7	<0.1%	109,329	50.8%	46,298	21.5%	43.7%	Panel
84	OBS_30_CNT_SOCIAL_CIRCLE	float32	861.0 kB	32	<0.1%	714	0.3%	114,550	53.2%	53.4%	0.0
85	DEF_30_CNT_SOCIAL_CIRCLE	float32	861.0 kB	10	<0.1%	714	0.3%	189,988	88.3%	88.6%	0.0
86	OBS_60_CNT_SOCIAL_CIRCLE	float32	861.0 kB	32	<0.1%	714	0.3%	115,085	53.5%	53.6%	0.0
87	DEF_60_CNT_SOCIAL_CIRCLE	float32	861.0 kB	9	<0.1%	714	0.3%	196,614	91.3%	91.6%	0.0
88	DAYS_LAST_PHONE_CHANGE	float32	861.0 kB	3,720	1.7%	1	<0.1%	26,201	12.2%	12.2%	0.0
89	FLAG_DOCUMENT_2	int8	215.3 kB	2	<0.1%	0	0%	215,246	>99.9%	>99.9%	0
90	FLAG_DOCUMENT_3	int8	215.3 kB	2	<0.1%	0	0%	152,845	71.0%	71.0%	1
91	FLAG_DOCUMENT_4	int8	215.3 kB	2	<0.1%	0	0%	215,238	>99.9%	>99.9%	0
92	FLAG_DOCUMENT_5	int8	215.3 kB	2	<0.1%	0	0%	212,025	98.5%	98.5%	0
93	FLAG_DOCUMENT_6	int8	215.3 kB	2	<0.1%	0	0%	196,348	91.2%	91.2%	0
94	FLAG_DOCUMENT_7	int8	215.3 kB	2	<0.1%	0	0%	215,221	>99.9%	>99.9%	0
95	FLAG_DOCUMENT_8	int8	215.3 kB	2	<0.1%	0	0%	197,689	91.8%	91.8%	0
96	FLAG_DOCUMENT_9	int8	215.3 kB	2	<0.1%	0	0%	214,440	99.6%	99.6%	0
97	FLAG_DOCUMENT_10	int8	215.3 kB	2	<0.1%	0	0%	215,253	>99.9%	>99.9%	0
98	FLAG_DOCUMENT_11	int8	215.3 kB	2	<0.1%	0	0%	214,448	99.6%	99.6%	0
99	FLAG_DOCUMENT_12	int8	215.3 kB	2	<0.1%	0	0%	215,256	>99.9%	>99.9%	0
100	FLAG_DOCUMENT_13	int8	215.3 kB	2	<0.1%	0	0%	214,541	99.7%	99.7%	0
101	FLAG_DOCUMENT_14	int8	215.3 kB	2	<0.1%	0	0%	214,614	99.7%	99.7%	0
102	FLAG_DOCUMENT_15	int8	215.3 kB	2	<0.1%	0	0%	215,015	99.9%	99.9%	0
103	FLAG_DOCUMENT_16	int8	215.3 kB	2	<0.1%	0	0%	213,089	99.0%	99.0%	0
104	FLAG_DOCUMENT_17	int8	215.3 kB	2	<0.1%	0	0%	215,200	>99.9%	>99.9%	0
105	FLAG_DOCUMENT_18	int8	215.3 kB	2	<0.1%	0	0%	213,525	99.2%	99.2%	0
106	FLAG_DOCUMENT_19	int8	215.3 kB	2	<0.1%	0	0%	215,124	99.9%	99.9%	0
107	FLAG_DOCUMENT_20	int8	215.3 kB	2	<0.1%	0	0%	215,146	99.9%	99.9%	0
108	FLAG_DOCUMENT_21	int8	215.3 kB	2	<0.1%	0	0%	215,187	>99.9%	>99.9%	0
109	AMT_REQ_CREDIT_BUREAU_HOUR	float32	861.0 kB	5	<0.1%	29,081	13.5%	185,061	86.0%	99.4%	0.0
110	AMT_REQ_CREDIT_BUREAU_DAY	float32	861.0 kB	9	<0.1%	29,081	13.5%	185,147	86.0%	99.4%	0.0
111	AMT_REQ_CREDIT_BUREAU_WEEK	float32	861.0 kB	9	<0.1%	29,081	13.5%	180,246	83.7%	96.8%	0.0
112	AMT_REQ_CREDIT_BUREAU_MON	float32	861.0 kB	22	<0.1%	29,081	13.5%	155,679	72.3%	83.6%	0.0
113	AMT_REQ_CREDIT_BUREAU_QRT	float32	861.0 kB	10	<0.1%	29,081	13.5%	150,895	70.1%	81.0%	0.0
114	AMT_REQ_CREDIT_BUREAU_YEAR	float32	861.0 kB	24	<0.1%	29,081	13.5%	50,313	23.4%	27.0%	0.0
115	n_credits_total	float32	861.0 kB	57	<0.1%	30,836	14.3%	25,129	11.7%	13.6%	1.0
116	n_credits_active	float32	861.0 kB	22	<0.1%	30,836	14.3%	51,735	24.0%	28.1%	1.0
117	n_credits_closed	float32	861.0 kB	52	<0.1%	30,836	14.3%	37,807	17.6%	20.5%	1.0
118	n_credits_bad_debt	float32	861.0 kB	2	<0.1%	30,836	14.3%	184,408	85.7%	>99.9%	0.0
119	n_credits_sold	float32	861.0 kB	7	<0.1%	30,836	14.3%	180,711	84.0%	98.0%	0.0
120	mode_credit_currency	category	215.6 kB	3	<0.1%	30,836	14.3%	184,386	85.7%	>99.9%	currency 1
121	n_different_currencies	float32	861.0 kB	3	<0.1%	30,836	14.3%	183,765	85.4%	99.6%	1.0
122	n_currency_1	float32	861.0 kB	58	<0.1%	30,836	14.3%	25,155	11.7%	13.6%	1.0
123	n_currency_2	float32	861.0 kB	7	<0.1%	30,836	14.3%	183,835	85.4%	99.7%	0.0
124	n_currency_3	float32	861.0 kB	4	<0.1%	30,836	14.3%	184,319	85.6%	99.9%	0.0
125	n_currency_4	float32	861.0 kB	2	<0.1%	30,836	14.3%	184,414	85.7%	>99.9%	0.0
126	days_credit_min	float32	861.0 kB	2,921	1.4%	30,836	14.3%	205	0.1%	0.1%	-2919.0
127	days_credit_max	float32	861.0 kB	2,922	1.4%	30,836	14.3%	480	0.2%	0.3%	-91.0
128	days_credit_mean	float32	861.0 kB	53,697	24.9%	30,836	14.3%	61	<0.1%	<0.1%	-441.0
129	days_credit_std	float32	861.0 kB	133,052	61.8%	55,965	26.0%	1,383	0.6%	0.9%	0.0
130	days_credit_median	float32	861.0 kB	5,711	2.7%	30,836	14.3%	118	0.1%	0.1%	-561.0
131	days_credit_range	float32	861.0 kB	2,913	1.4%	30,836	14.3%	26,512	12.3%	14.4%	0.0
132	days_credit_overdue_min	float32	861.0 kB	69	<0.1%	30,836	14.3%	184,320	85.6%	99.9%	0.0
133	days_credit_overdue_max	float32	861.0 kB	671	0.3%	30,836	14.3%	182,056	84.6%	98.7%	0.0
134	days_credit_overdue_mean	float32	861.0 kB	1,195	0.6%	30,836	14.3%	182,056	84.6%	98.7%	0.0
135	days_credit_overdue_std	float32	861.0 kB	1,441	0.7%	55,965	26.0%	157,026	72.9%	98.6%	0.0
136	days_credit_overdue_median	float32	861.0 kB	168	0.1%	30,836	14.3%	184,119	85.5%	99.8%	0.0
137	days_credit_overdue_range	float32	861.0 kB	655	0.3%	30,836	14.3%	182,155	84.6%	98.8%	0.0
138	days_credit_enddate_min	Int32	1.1 MB	6,266	2.9%	32,432	15.1%	119	0.1%	0.1%	-2359
139	days_credit_enddate_max	Int32	1.1 MB	12,274	5.7%	32,432	15.1%	187	0.1%	0.1%	31060
140	days_credit_enddate_mean	Float64	1.9 MB	77,581	36.0%	32,432	15.1%	46	<0.1%	<0.1%	-99.0
141	days_credit_enddate_std	Float64	1.9 MB	134,001	62.3%	59,197	27.5%	1,369	0.6%	0.9%	0.0
142	days_credit_enddate_median	Float32	1.1 MB	13,238	6.1%	32,432	15.1%	113	0.1%	0.1%	0.0
143	days_credit_enddate_range	Int32	1.1 MB	17,383	8.1%	32,432	15.1%	28,134	13.1%	15.4%	0
144	days_enddate_fact_min	Int32	1.1 MB	2,901	1.3%	53,870	25.0%	122	0.1%	0.1%	-2450
145	days_enddate_fact_max	Int16	645.8 kB	2,793	1.3%	53,870	25.0%	340	0.2%	0.2%	-84
146	days_enddate_fact_mean	Float32	1.1 MB	35,685	16.6%	53,870	25.0%	71	<0.1%	<0.1%	-795.0
147	days_enddate_fact_std	Float32	1.1 MB	93,662	43.5%	91,572	42.5%	921	0.4%	0.7%	0.0
148	days_enddate_fact_median	Float32	1.1 MB	5,341	2.5%	53,870	25.0%	135	0.1%	0.1%	-919.0
149	days_enddate_fact_range	Int32	1.1 MB	2,796	1.3%	53,870	25.0%	38,623	17.9%	23.9%	0
150	amt_credit_max_overdue_min	float64	1.7 MB	9,923	4.6%	86,638	40.2%	116,256	54.0%	90.4%	0.0
151	amt_credit_max_overdue_max	float64	1.7 MB	32,871	15.3%	86,638	40.2%	79,549	37.0%	61.8%	0.0
152	amt_credit_max_overdue_mean	float64	1.7 MB	39,837	18.5%	86,638	40.2%	79,549	37.0%	61.8%	0.0
153	amt_credit_max_overdue_std	float64	1.7 MB	35,648	16.6%	132,328	61.5%	43,267	20.1%	52.2%	0.0
154	amt_credit_max_overdue_median	float64	1.7 MB	21,151	9.8%	86,638	40.2%	100,477	46.7%	78.1%	0.0
155	amt_credit_max_overdue_range	float64	1.7 MB	27,267	12.7%	86,638	40.2%	88,957	41.3%	69.2%	0.0
156	cnt_credit_prolong_min	float32	861.0 kB	6	<0.1%	30,836	14.3%	184,215	85.6%	99.9%	0.0
157	cnt_credit_prolong_max	float32	861.0 kB	9	<0.1%	30,836	14.3%	178,412	82.9%	96.7%	0.0
158	cnt_credit_prolong_mean	float32	861.0 kB	100	<0.1%	30,836	14.3%	178,412	82.9%	96.7%	0.0
159	cnt_credit_prolong_std	float32	861.0 kB	167	0.1%	55,965	26.0%	153,489	71.3%	96.4%	0.0
160	cnt_credit_prolong_median	float32	861.0 kB	8	<0.1%	30,836	14.3%	183,844	85.4%	99.7%	0.0
161	cnt_credit_prolong_range	float32	861.0 kB	9	<0.1%	30,836	14.3%	178,618	83.0%	96.9%	0.0
162	cnt_credit_prolong_sum	float32	861.0 kB	10	<0.1%	30,836	14.3%	178,412	82.9%	96.7%	0.0
163	amt_credit_sum_min	float64	1.7 MB	44,136	20.5%	30,836	14.3%	30,083	14.0%	16.3%	0.0
164	amt_credit_sum_max	float64	1.7 MB	49,429	23.0%	30,836	14.3%	6,293	2.9%	3.4%	450000.0
165	amt_credit_sum_mean	float64	1.7 MB	150,070	69.7%	30,836	14.3%	943	0.4%	0.5%	225000.0
166	amt_credit_sum_std	float64	1.7 MB	148,439	69.0%	55,965	26.0%	1,156	0.5%	0.7%	0.0
167	amt_credit_sum_median	float64	1.7 MB	77,800	36.1%	30,836	14.3%	5,011	2.3%	2.7%	225000.0
168	amt_credit_sum_range	float64	1.7 MB	94,343	43.8%	30,836	14.3%	26,285	12.2%	14.3%	0.0
169	amt_credit_sum_sum	float64	1.7 MB	147,742	68.6%	30,836	14.3%	924	0.4%	0.5%	225000.0
170	amt_credit_sum_debt_min	float64	1.7 MB	20,754	9.6%	36,039	16.7%	155,688	72.3%	86.9%	0.0
171	amt_credit_sum_debt_max	float64	1.7 MB	104,430	48.5%	36,039	16.7%	49,345	22.9%	27.5%	0.0
172	amt_credit_sum_debt_mean	float64	1.7 MB	121,544	56.5%	36,039	16.7%	48,543	22.6%	27.1%	0.0
173	amt_credit_sum_debt_std	float64	1.7 MB	116,314	54.0%	65,302	30.3%	31,435	14.6%	21.0%	0.0
174	amt_credit_sum_debt_median	float64	1.7 MB	48,592	22.6%	36,039	16.7%	120,818	56.1%	67.4%	0.0
175	amt_credit_sum_debt_range	float64	1.7 MB	98,760	45.9%	36,039	16.7%	60,698	28.2%	33.9%	0.0
176	amt_credit_sum_debt_sum	float64	1.7 MB	113,811	52.9%	30,836	14.3%	53,746	25.0%	29.1%	0.0
177	amt_credit_sum_limit_min	float64	1.7 MB	2,121	1.0%	45,585	21.2%	167,209	77.7%	98.5%	0.0
178	amt_credit_sum_limit_max	float64	1.7 MB	24,324	11.3%	45,585	21.2%	135,642	63.0%	79.9%	0.0
179	amt_credit_sum_limit_mean	float64	1.7 MB	27,475	12.8%	45,585	21.2%	135,599	63.0%	79.9%	0.0
180	amt_credit_sum_limit_std	float64	1.7 MB	26,937	12.5%	80,896	37.6%	102,265	47.5%	76.1%	0.0
181	amt_credit_sum_limit_median	float64	1.7 MB	5,916	2.7%	45,585	21.2%	162,356	75.4%	95.7%	0.0
182	amt_credit_sum_limit_range	float64	1.7 MB	22,987	10.7%	45,585	21.2%	137,576	63.9%	81.1%	0.0
183	amt_credit_sum_limit_sum	float64	1.7 MB	26,367	12.2%	30,836	14.3%	150,348	69.8%	81.5%	0.0
184	amt_credit_sum_overdue_min	float32	861.0 kB	81	<0.1%	30,836	14.3%	184,318	85.6%	99.9%	0.0
185	amt_credit_sum_overdue_max	float64	1.7 MB	918	0.4%	30,836	14.3%	182,090	84.6%	98.7%	0.0
186	amt_credit_sum_overdue_mean	float64	1.7 MB	1,424	0.7%	30,836	14.3%	182,090	84.6%	98.7%	0.0
187	amt_credit_sum_overdue_std	float64	1.7 MB	1,618	0.8%	55,965	26.0%	157,060	73.0%	98.6%	0.0
188	amt_credit_sum_overdue_median	float64	1.7 MB	200	0.1%	30,836	14.3%	184,121	85.5%	99.8%	0.0
189	amt_credit_sum_overdue_range	float64	1.7 MB	895	0.4%	30,836	14.3%	182,189	84.6%	98.8%	0.0
190	amt_credit_sum_overdue_sum	float64	1.7 MB	930	0.4%	30,836	14.3%	182,090	84.6%	98.7%	0.0
191	mode_credit_type	category	215.8 kB	6	<0.1%	30,836	14.3%	160,802	74.7%	87.2%	Consumer credit
192	n_different_credit_types	float32	861.0 kB	5	<0.1%	30,836	14.3%	100,733	46.8%	54.6%	2.0
193	n_consumer_credits	float32	861.0 kB	51	<0.1%	30,836	14.3%	33,496	15.6%	18.2%	1.0
194	n_credit_card_credits	float32	861.0 kB	22	<0.1%	30,836	14.3%	63,863	29.7%	34.6%	0.0
195	n_car_loans	float32	861.0 kB	9	<0.1%	30,836	14.3%	170,683	79.3%	92.6%	0.0
196	n_mortgages	float32	861.0 kB	7	<0.1%	30,836	14.3%	174,434	81.0%	94.6%	0.0
197	n_microloans	float32	861.0 kB	28	<0.1%	30,836	14.3%	181,975	84.5%	98.7%	0.0
198	n_other_type_credit	float32	861.0 kB	9	<0.1%	30,836	14.3%	182,373	84.7%	98.9%	0.0
199	days_credit_update_min	float32	861.0 kB	2,949	1.4%	30,836	14.3%	549	0.3%	0.3%	-19.0
200	days_credit_update_max	float32	861.0 kB	2,585	1.2%	30,836	14.3%	7,529	3.5%	4.1%	-7.0
201	days_credit_update_mean	float32	861.0 kB	46,055	21.4%	30,836	14.3%	512	0.2%	0.3%	-12.0
202	days_credit_update_std	float64	1.7 MB	131,798	61.2%	55,965	26.0%	1,885	0.9%	1.2%	0.0
203	days_credit_update_median	float32	861.0 kB	4,779	2.2%	30,836	14.3%	1,055	0.5%	0.6%	-22.0
204	days_credit_update_range	float32	861.0 kB	2,925	1.4%	30,836	14.3%	27,014	12.5%	14.6%	0.0
205	amt_annuity_min	float64	1.7 MB	9,921	4.6%	159,480	74.1%	36,975	17.2%	66.3%	0.0
206	amt_annuity_max	float64	1.7 MB	18,638	8.7%	159,480	74.1%	13,781	6.4%	24.7%	0.0
207	amt_annuity_mean	float64	1.7 MB	29,816	13.9%	159,480	74.1%	13,781	6.4%	24.7%	0.0
208	amt_annuity_std	float64	1.7 MB	25,917	12.0%	171,585	79.7%	15,071	7.0%	34.5%	0.0
209	amt_annuity_median	float64	1.7 MB	16,441	7.6%	159,480	74.1%	23,785	11.0%	42.6%	0.0
210	amt_annuity_range	float64	1.7 MB	15,462	7.2%	159,480	74.1%	27,176	12.6%	48.7%	0.0
211	bureau_months_balance_min	float32	861.0 kB	97	<0.1%	152,586	70.9%	1,508	0.7%	2.4%	-95.0
212	bureau_months_balance_max	float32	861.0 kB	89	<0.1%	152,586	70.9%	59,695	27.7%	95.3%	0.0
213	bureau_dpd_status_min	float32	861.0 kB	6	<0.1%	152,586	70.9%	62,638	29.1%	99.9%	0.0
214	bureau_dpd_status_max	float32	861.0 kB	6	<0.1%	152,586	70.9%	41,042	19.1%	65.5%	0.0
215	bureau_dpd_status_mean	float32	861.0 kB	3,772	1.8%	152,586	70.9%	41,042	19.1%	65.5%	0.0
216	bureau_dpd_status_std	float32	861.0 kB	7,016	3.3%	153,149	71.1%	40,500	18.8%	65.2%	0.0
217	bureau_dpd_status_median	float32	861.0 kB	11	<0.1%	152,586	70.9%	61,726	28.7%	98.5%	0.0
218	bureau_dpd_status_range	float32	861.0 kB	6	<0.1%	152,586	70.9%	41,063	19.1%	65.5%	0.0
219	n_different_loans	float32	861.0 kB	4	<0.1%	11,456	5.3%	77,974	36.2%	38.3%	2.0
220	n_cash_loans	float32	861.0 kB	55	<0.1%	11,456	5.3%	83,697	38.9%	41.1%	0.0
221	n_consumer_loans	float32	861.0 kB	36	<0.1%	11,456	5.3%	78,331	36.4%	38.4%	1.0
222	n_revolving_loans	float32	861.0 kB	25	<0.1%	11,456	5.3%	130,792	60.8%	64.2%	0.0
223	amt_annuity_min_previous_application	float64	1.7 MB	113,816	52.9%	11,752	5.5%	16,017	7.4%	7.9%	2250.0
224	amt_annuity_max_previous_application	float64	1.7 MB	110,598	51.4%	11,752	5.5%	2,363	1.1%	1.2%	22500.0
225	amt_annuity_mean_previous_application	float64	1.7 MB	191,798	89.1%	11,752	5.5%	367	0.2%	0.2%	2250.0
226	amt_annuity_std_previous_application	float64	1.7 MB	157,678	73.3%	56,274	26.1%	296	0.1%	0.2%	0.0
227	amt_annuity_median_previous_application	float64	1.7 MB	157,063	73.0%	11,752	5.5%	1,357	0.6%	0.7%	11250.0
228	amt_annuity_range_previous_application	float64	1.7 MB	146,639	68.1%	11,752	5.5%	44,818	20.8%	22.0%	0.0
229	amt_application_min	float64	1.7 MB	29,672	13.8%	11,456	5.3%	95,786	44.5%	47.0%	0.0
230	amt_application_max	float64	1.7 MB	39,568	18.4%	11,456	5.3%	9,541	4.4%	4.7%	450000.0
231	amt_application_mean	float64	1.7 MB	142,462	66.2%	11,456	5.3%	736	0.3%	0.4%	0.0
232	amt_application_std	float64	1.7 MB	150,921	70.1%	48,154	22.4%	1,132	0.5%	0.7%	0.0
233	amt_application_median	float64	1.7 MB	63,472	29.5%	11,456	5.3%	10,838	5.0%	5.3%	0.0
234	amt_application_range	float64	1.7 MB	51,986	24.2%	11,456	5.3%	37,830	17.6%	18.6%	0.0
235	amt_credit_min	float64	1.7 MB	33,220	15.4%	11,456	5.3%	79,660	37.0%	39.1%	0.0
236	amt_credit_max	float64	1.7 MB	49,618	23.1%	11,456	5.3%	4,696	2.2%	2.3%	450000.0
237	amt_credit_mean	float64	1.7 MB	156,814	72.8%	11,456	5.3%	293	0.1%	0.1%	45000.0
238	amt_credit_std	float64	1.7 MB	157,015	72.9%	48,154	22.4%	340	0.2%	0.2%	0.0
239	amt_credit_median	float64	1.7 MB	73,966	34.4%	11,456	5.3%	8,095	3.8%	4.0%	0.0
240	amt_credit_range	float64	1.7 MB	71,950	33.4%	11,456	5.3%	37,038	17.2%	18.2%	0.0
241	amt_down_payment_min	float64	1.7 MB	10,194	4.7%	23,703	11.0%	125,181	58.2%	65.4%	0.0
242	amt_down_payment_max	float64	1.7 MB	17,607	8.2%	23,703	11.0%	53,725	25.0%	28.0%	0.0
243	amt_down_payment_mean	float64	1.7 MB	42,577	19.8%	23,703	11.0%	53,725	25.0%	28.0%	0.0
244	amt_down_payment_std	float64	1.7 MB	57,310	26.6%	99,327	46.1%	19,374	9.0%	16.7%	0.0
245	amt_down_payment_median	float64	1.7 MB	19,734	9.2%	23,703	11.0%	74,539	34.6%	38.9%	0.0
246	amt_down_payment_range	float64	1.7 MB	17,144	8.0%	23,703	11.0%	94,998	44.1%	49.6%	0.0
247	amt_goods_price_min	float64	1.7 MB	39,170	18.2%	12,169	5.7%	11,596	5.4%	5.7%	45000.0
248	amt_goods_price_max	float64	1.7 MB	39,563	18.4%	12,169	5.7%	9,543	4.4%	4.7%	450000.0
249	amt_goods_price_mean	float64	1.7 MB	138,760	64.5%	12,169	5.7%	777	0.4%	0.4%	135000.0
250	amt_goods_price_std	float64	1.7 MB	140,074	65.1%	56,728	26.4%	1,360	0.6%	0.9%	0.0
251	amt_goods_price_median	float64	1.7 MB	67,080	31.2%	12,169	5.7%	4,499	2.1%	2.2%	135000.0
252	amt_goods_price_range	float64	1.7 MB	79,283	36.8%	12,169	5.7%	45,919	21.3%	22.6%	0.0
253	rate_down_payment_min	float32	861.0 kB	46,257	21.5%	23,703	11.0%	125,181	58.2%	65.4%	0.0
254	rate_down_payment_max	float32	861.0 kB	84,883	39.4%	23,703	11.0%	53,725	25.0%	28.0%	0.0
255	rate_down_payment_mean	float32	861.0 kB	116,968	54.3%	23,703	11.0%	53,725	25.0%	28.0%	0.0
256	rate_down_payment_std	float32	861.0 kB	88,115	40.9%	99,327	46.1%	19,263	8.9%	16.6%	0.0
257	rate_down_payment_median	float32	861.0 kB	87,629	40.7%	23,703	11.0%	74,539	34.6%	38.9%	0.0
258	rate_down_payment_range	float32	861.0 kB	73,615	34.2%	23,703	11.0%	94,887	44.1%	49.5%	0.0
259	rate_interest_primary_min	float32	861.0 kB	119	0.1%	212,016	98.5%	666	0.3%	20.5%	0.18913634
260	rate_interest_primary_max	float32	861.0 kB	119	0.1%	212,016	98.5%	674	0.3%	20.8%	0.18913634
261	rate_interest_primary_mean	float32	861.0 kB	160	0.1%	212,016	98.5%	655	0.3%	20.2%	0.18913634
262	rate_interest_primary_std	float32	861.0 kB	39	<0.1%	215,139	99.9%	37	<0.1%	31.4%	0.0
263	rate_interest_primary_median	float32	861.0 kB	157	0.1%	212,016	98.5%	655	0.3%	20.2%	0.18913634
264	rate_interest_primary_range	float32	861.0 kB	37	<0.1%	212,016	98.5%	3,160	1.5%	97.5%	0.0
265	rate_interest_primary_count	float32	861.0 kB	4	<0.1%	11,456	5.3%	200,560	93.2%	98.4%	0.0
266	rate_interest_privileged_min	float32	861.0 kB	21	<0.1%	212,016	98.5%	892	0.4%	27.5%	0.83509517
267	rate_interest_privileged_max	float32	861.0 kB	21	<0.1%	212,016	98.5%	906	0.4%	28.0%	0.83509517
268	rate_interest_privileged_mean	float32	861.0 kB	42	<0.1%	212,016	98.5%	881	0.4%	27.2%	0.83509517
269	rate_interest_privileged_std	float32	861.0 kB	21	<0.1%	215,139	99.9%	50	<0.1%	42.4%	0.0
270	rate_interest_privileged_median	float32	861.0 kB	40	<0.1%	212,016	98.5%	881	0.4%	27.2%	0.83509517
271	rate_interest_privileged_range	float32	861.0 kB	20	<0.1%	212,016	98.5%	3,173	1.5%	97.9%	0.0
272	rate_interest_privileged_count	float32	861.0 kB	4	<0.1%	11,456	5.3%	200,560	93.2%	98.4%	0.0
273	n_different_contract_types	float32	861.0 kB	4	<0.1%	11,456	5.3%	77,974	36.2%	38.3%	2.0
274	n_contract_status_approved	float32	861.0 kB	25	<0.1%	11,456	5.3%	53,519	24.9%	26.3%	1.0
275	n_contract_status_canceled	float32	861.0 kB	36	<0.1%	11,456	5.3%	126,281	58.7%	62.0%	0.0
276	n_contract_status_refused	float32	861.0 kB	44	<0.1%	11,456	5.3%	133,394	62.0%	65.5%	0.0
277	n_contract_status_unused_offer	float32	861.0 kB	11	<0.1%	11,456	5.3%	190,553	88.5%	93.5%	0.0
278	days_decision_min	float32	861.0 kB	2,921	1.4%	11,456	5.3%	136	0.1%	0.1%	-476.0
279	days_decision_max	float32	861.0 kB	2,921	1.4%	11,456	5.3%	598	0.3%	0.3%	-7.0
280	days_decision_mean	float32	861.0 kB	50,330	23.4%	11,456	5.3%	109	0.1%	0.1%	-351.0
281	days_decision_std	float32	861.0 kB	129,009	59.9%	48,154	22.4%	3,867	1.8%	2.3%	0.0
282	days_decision_median	float32	861.0 kB	5,656	2.6%	11,456	5.3%	255	0.1%	0.1%	-364.0
283	days_decision_range	float32	861.0 kB	2,919	1.4%	11,456	5.3%	40,565	18.8%	19.9%	0.0
284	n_payment_type_cash_through_bank	float32	861.0 kB	44	<0.1%	11,456	5.3%	54,943	25.5%	27.0%	1.0
285	n_payment_type_cash_from_account	float32	861.0 kB	1	<0.1%	11,456	5.3%	203,801	94.7%	100.0%	0.0
286	n_payment_type_not_available	float32	861.0 kB	46	<0.1%	11,456	5.3%	71,796	33.4%	35.2%	0.0
287	n_reject_reason_not_applicable	float32	861.0 kB	44	<0.1%	11,456	5.3%	44,154	20.5%	21.7%	1.0
288	n_reject_reason_hc	float32	861.0 kB	36	<0.1%	11,456	5.3%	157,346	73.1%	77.2%	0.0
289	n_reject_reason_limit	float32	861.0 kB	22	<0.1%	11,456	5.3%	183,819	85.4%	90.2%	0.0
290	n_reject_reason_scoc	float32	861.0 kB	20	<0.1%	11,456	5.3%	188,558	87.6%	92.5%	0.0
291	n_reject_reason_client	float32	861.0 kB	11	<0.1%	11,456	5.3%	190,553	88.5%	93.5%	0.0
292	n_reject_reason_scofr	float32	861.0 kB	16	<0.1%	11,456	5.3%	199,055	92.5%	97.7%	0.0
293	n_client_type_new	float32	861.0 kB	14	<0.1%	11,456	5.3%	154,064	71.6%	75.6%	1.0
294	n_client_type_repeater	float32	861.0 kB	61	<0.1%	11,456	5.3%	49,122	22.8%	24.1%	0.0
295	n_client_type_refreshed	float32	861.0 kB	23	<0.1%	11,456	5.3%	150,108	69.7%	73.7%	0.0
296	n_portfolio_pos	float32	861.0 kB	32	<0.1%	11,456	5.3%	81,754	38.0%	40.1%	1.0
297	n_portfolio_cash	float32	861.0 kB	39	<0.1%	11,456	5.3%	99,269	46.1%	48.7%	0.0
298	n_portfolio_cards	float32	861.0 kB	21	<0.1%	11,456	5.3%	135,213	62.8%	66.3%	0.0
299	n_product_type_xsell	float32	861.0 kB	33	<0.1%	11,456	5.3%	97,659	45.4%	47.9%	0.0
300	n_product_type_walk_in	float32	861.0 kB	28	<0.1%	11,456	5.3%	152,783	71.0%	75.0%	0.0
301	n_different_channels	float32	861.0 kB	7	<0.1%	11,456	5.3%	79,085	36.7%	38.8%	2.0
302	n_channel_type_credit_and_cash	float32	861.0 kB	52	<0.1%	11,456	5.3%	96,482	44.8%	47.3%	0.0
303	n_channel_type_countrywide	float32	861.0 kB	34	<0.1%	11,456	5.3%	67,466	31.3%	33.1%	1.0
304	n_channel_type_stone	float32	861.0 kB	22	<0.1%	11,456	5.3%	121,683	56.5%	59.7%	0.0
305	n_channel_type_regional_and_local	float32	861.0 kB	19	<0.1%	11,456	5.3%	158,328	73.6%	77.7%	0.0
306	n_channel_type_contact_center	float32	861.0 kB	19	<0.1%	11,456	5.3%	175,621	81.6%	86.2%	0.0
307	n_channel_type_ap_minus	float32	861.0 kB	33	<0.1%	11,456	5.3%	187,751	87.2%	92.1%	0.0
308	n_channel_type_channel_corporate_sales	float32	861.0 kB	20	<0.1%	11,456	5.3%	202,289	94.0%	99.3%	0.0
309	n_channel_type_car_dealer	float32	861.0 kB	6	<0.1%	11,456	5.3%	203,580	94.6%	99.9%	0.0
310	n_cnt_payment_0	float32	861.0 kB	21	<0.1%	11,456	5.3%	135,213	62.8%	66.3%	0.0
311	cnt_payment_min	float32	861.0 kB	31	<0.1%	11,752	5.5%	68,588	31.9%	33.7%	0.0
312	cnt_payment_max	float32	861.0 kB	39	<0.1%	11,752	5.5%	52,776	24.5%	25.9%	12.0
313	cnt_payment_mean	float32	861.0 kB	2,495	1.2%	11,752	5.5%	25,110	11.7%	12.3%	12.0
314	cnt_payment_std	float32	861.0 kB	14,394	6.7%	56,274	26.1%	10,117	4.7%	6.4%	0.0
315	cnt_payment_median	float32	861.0 kB	87	<0.1%	11,752	5.5%	53,998	25.1%	26.5%	12.0
316	cnt_payment_range	float32	861.0 kB	69	<0.1%	11,752	5.5%	54,639	25.4%	26.8%	0.0
317	n_yield_group_low_action	float32	861.0 kB	22	<0.1%	11,456	5.3%	163,415	75.9%	80.2%	0.0
318	n_yield_group_low_normal	float32	861.0 kB	23	<0.1%	11,456	5.3%	94,724	44.0%	46.5%	0.0
319	n_yield_group_middle	float32	861.0 kB	25	<0.1%	11,456	5.3%	80,043	37.2%	39.3%	0.0
320	n_yield_group_high	float32	861.0 kB	30	<0.1%	11,456	5.3%	89,153	41.4%	43.7%	0.0
321	days_first_draw_min	float32	861.0 kB	2,718	1.3%	12,377	5.7%	165,404	76.8%	81.5%	365243.0
322	days_first_draw_max	float32	861.0 kB	939	0.4%	12,377	5.7%	201,133	93.4%	99.1%	365243.0
323	days_first_draw_mean	float32	861.0 kB	14,131	6.6%	12,377	5.7%	165,404	76.8%	81.5%	365243.0
324	days_first_draw_std	float64	1.7 MB	13,562	6.3%	67,931	31.6%	111,591	51.8%	75.7%	0.0
325	days_first_draw_median	float32	861.0 kB	2,812	1.3%	12,377	5.7%	193,631	90.0%	95.4%	365243.0
326	days_first_draw_range	float32	861.0 kB	2,723	1.3%	12,377	5.7%	167,145	77.6%	82.4%	0.0
327	days_last_due_1st_version_min	float32	861.0 kB	4,081	1.9%	12,377	5.7%	1,911	0.9%	0.9%	365243.0
328	days_last_due_1st_version_max	float32	861.0 kB	4,521	2.1%	12,377	5.7%	55,263	25.7%	27.2%	365243.0
329	days_last_due_1st_version_mean	float32	861.0 kB	51,499	23.9%	12,377	5.7%	1,911	0.9%	0.9%	365243.0
330	days_last_due_1st_version_std	float64	1.7 MB	104,185	48.4%	67,931	31.6%	50	<0.1%	<0.1%	241.83051916579925
331	days_last_due_1st_version_median	float32	861.0 kB	10,719	5.0%	12,377	5.7%	1,937	0.9%	1.0%	365243.0
332	days_last_due_1st_version_range	float32	861.0 kB	7,864	3.7%	12,377	5.7%	55,584	25.8%	27.4%	0.0
333	days_last_due_min	float32	861.0 kB	2,859	1.3%	12,377	5.7%	14,374	6.7%	7.1%	365243.0
334	days_last_due_max	float32	861.0 kB	2,761	1.3%	12,377	5.7%	98,527	45.8%	48.6%	365243.0
335	days_last_due_mean	float32	861.0 kB	51,645	24.0%	12,377	5.7%	14,374	6.7%	7.1%	365243.0
336	days_last_due_std	float64	1.7 MB	99,434	46.2%	67,931	31.6%	3,105	1.4%	2.1%	0.0
337	days_last_due_median	float32	861.0 kB	7,906	3.7%	12,377	5.7%	21,138	9.8%	10.4%	365243.0
338	days_last_due_range	float32	861.0 kB	5,592	2.6%	12,377	5.7%	58,659	27.3%	28.9%	0.0
339	days_termination_min	float32	861.0 kB	2,797	1.3%	12,377	5.7%	15,833	7.4%	7.8%	365243.0
340	days_termination_max	float32	861.0 kB	2,683	1.2%	12,377	5.7%	105,005	48.8%	51.8%	365243.0
341	days_termination_mean	float32	861.0 kB	51,017	23.7%	12,377	5.7%	15,833	7.4%	7.8%	365243.0
342	days_termination_std	float64	1.7 MB	95,145	44.2%	67,931	31.6%	3,494	1.6%	2.4%	0.0
343	days_termination_median	float32	861.0 kB	7,716	3.6%	12,377	5.7%	23,269	10.8%	11.5%	365243.0
344	days_termination_range	float32	861.0 kB	5,101	2.4%	12,377	5.7%	59,048	27.4%	29.1%	0.0
345	n_nflag_insured_on_approval_sum	float32	861.0 kB	19	<0.1%	11,456	5.3%	96,596	44.9%	47.4%	0.0
346	n_nflag_insured_on_approval_mean	float32	861.0 kB	102	<0.1%	12,377	5.7%	95,675	44.4%	47.2%	0.0
347	any_nflag_insured_on_approval	Int8	430.5 kB	1	<0.1%	0	0%	215,257	100.0%	100.0%	0
348	n_installments_total	float32	861.0 kB	310	0.1%	11,034	5.1%	8,624	4.0%	4.2%	12.0
349	n_installments_late	float32	861.0 kB	99	<0.1%	11,034	5.1%	95,670	44.4%	46.8%	0.0
350	n_installments_early	float32	861.0 kB	215	0.1%	11,034	5.1%	9,335	4.3%	4.6%	6.0
351	n_installments_on_time	float32	861.0 kB	140	0.1%	11,034	5.1%	88,381	41.1%	43.3%	0.0
352	percent_installments_late	float32	861.0 kB	4,464	2.1%	11,034	5.1%	95,670	44.4%	46.8%	0.0
353	percent_installments_early	float32	861.0 kB	7,892	3.7%	11,034	5.1%	64,688	30.1%	31.7%	1.0
354	percent_installments_on_time	float32	861.0 kB	7,944	3.7%	11,034	5.1%	88,381	41.1%	43.3%	0.0
355	n_installments_late_7	float32	861.0 kB	59	<0.1%	11,034	5.1%	147,558	68.5%	72.3%	0.0
356	n_installments_late_30	float32	861.0 kB	42	<0.1%	11,034	5.1%	190,963	88.7%	93.5%	0.0
357	n_installments_late_60	float32	861.0 kB	39	<0.1%	11,034	5.1%	198,146	92.1%	97.0%	0.0
358	any_installments_late_7	float32	861.0 kB	2	<0.1%	11,034	5.1%	147,558	68.5%	72.3%	0.0
359	any_installments_late_30	float32	861.0 kB	2	<0.1%	11,034	5.1%	190,963	88.7%	93.5%	0.0
360	any_installments_late_60	float32	861.0 kB	2	<0.1%	11,034	5.1%	198,146	92.1%	97.0%	0.0
361	percent_installments_late_7	float32	861.0 kB	2,595	1.2%	11,034	5.1%	147,558	68.5%	72.3%	0.0
362	percent_installments_late_30	float32	861.0 kB	894	0.4%	11,034	5.1%	190,963	88.7%	93.5%	0.0
363	percent_installments_late_60	float32	861.0 kB	629	0.3%	11,034	5.1%	198,146	92.1%	97.0%	0.0
364	diff_days_installment_payment_min	float32	861.0 kB	1,465	0.7%	11,037	5.1%	30,953	14.4%	15.2%	0.0
365	diff_days_installment_payment_max	float32	861.0 kB	409	0.2%	11,037	5.1%	15,321	7.1%	7.5%	30.0
366	diff_days_installment_payment_mean	float32	861.0 kB	50,246	23.3%	11,037	5.1%	761	0.4%	0.4%	9.0
367	diff_days_installment_payment_std	float32	861.0 kB	159,834	74.3%	11,500	5.3%	341	0.2%	0.2%	0.0
368	diff_days_installment_payment_median	float32	861.0 kB	320	0.1%	11,037	5.1%	21,620	10.0%	10.6%	0.0
369	diff_days_installment_payment_range	float32	861.0 kB	1,465	0.7%	11,037	5.1%	5,349	2.5%	2.6%	30.0
370	diff_days_installment_payment_sum	float32	861.0 kB	4,383	2.0%	11,034	5.1%	540	0.3%	0.3%	66.0
371	diff_days_installment_payment_sum_late_only	float32	861.0 kB	1,815	0.8%	11,034	5.1%	95,670	44.4%	46.8%	0.0
372	diff_amt_installment_payment_min	float64	1.7 MB	25,190	11.7%	11,037	5.1%	177,973	82.7%	87.1%	0.0
373	diff_amt_installment_payment_max	float64	1.7 MB	75,445	35.0%	11,037	5.1%	116,518	54.1%	57.1%	0.0
374	diff_amt_installment_payment_mean	float64	1.7 MB	97,257	45.2%	11,037	5.1%	103,060	47.9%	50.5%	0.0
375	diff_amt_installment_payment_std	float64	1.7 MB	101,021	46.9%	11,500	5.3%	102,599	47.7%	50.4%	0.0
376	diff_amt_installment_payment_median	float64	1.7 MB	6,855	3.2%	11,037	5.1%	195,960	91.0%	96.0%	0.0
377	diff_amt_installment_payment_range	float64	1.7 MB	90,195	41.9%	11,037	5.1%	103,062	47.9%	50.5%	0.0
378	diff_percent_installment_payment_min	float32	861.0 kB	25,589	11.9%	11,037	5.1%	177,973	82.7%	87.1%	1.0
379	diff_percent_installment_payment_max	float64	1.7 MB	83,143	38.6%	11,037	5.1%	116,664	54.2%	57.1%	1.0
380	diff_percent_installment_payment_mean	float64	1.7 MB	87,934	40.9%	11,037	5.1%	103,191	47.9%	50.5%	1.0
381	diff_percent_installment_payment_std	float64	1.7 MB	100,863	46.9%	11,500	5.3%	102,727	47.7%	50.4%	0.0
382	diff_percent_installment_payment_median	float32	861.0 kB	7,969	3.7%	11,037	5.1%	195,960	91.0%	96.0%	1.0
383	diff_percent_installment_payment_range	float64	1.7 MB	97,055	45.1%	11,037	5.1%	103,190	47.9%	50.5%	0.0
384	n_previous_pos_applications	float32	861.0 kB	221	0.1%	12,570	5.8%	9,559	4.4%	4.7%	13.0
385	n_previous_pos_applications_active	float32	861.0 kB	207	0.1%	12,570	5.8%	11,535	5.4%	5.7%	12.0
386	n_previous_pos_applications_signed	float32	861.0 kB	31	<0.1%	12,570	5.8%	162,017	75.3%	79.9%	0.0
387	n_previous_pos_applications_completed	float32	861.0 kB	45	<0.1%	12,570	5.8%	73,226	34.0%	36.1%	1.0
388	cnt_installment_min	float32	861.0 kB	53	<0.1%	12,588	5.8%	42,362	19.7%	20.9%	6.0
389	cnt_installment_max	float32	861.0 kB	54	<0.1%	12,588	5.8%	57,934	26.9%	28.6%	12.0
390	cnt_installment_mean	float32	861.0 kB	34,036	15.8%	12,588	5.8%	15,121	7.0%	7.5%	12.0
391	cnt_installment_std	float32	861.0 kB	86,454	40.2%	12,828	6.0%	49,452	23.0%	24.4%	0.0
392	cnt_installment_median	float32	861.0 kB	103	<0.1%	12,588	5.8%	61,162	28.4%	30.2%	12.0
393	cnt_installment_range	float32	861.0 kB	69	<0.1%	12,588	5.8%	49,692	23.1%	24.5%	0.0
394	cnt_installment_future_min	float32	861.0 kB	61	<0.1%	12,588	5.8%	183,466	85.2%	90.5%	0.0
395	cnt_installment_future_max	float32	861.0 kB	61	<0.1%	12,588	5.8%	56,961	26.5%	28.1%	12.0
396	cnt_installment_future_mean	float32	861.0 kB	33,098	15.4%	12,588	5.8%	7,294	3.4%	3.6%	6.0
397	cnt_installment_future_std	float32	861.0 kB	94,015	43.7%	12,828	6.0%	7,063	3.3%	3.5%	2.1602468
398	cnt_installment_future_median	float32	861.0 kB	121	0.1%	12,588	5.8%	22,039	10.2%	10.9%	6.0
399	cnt_installment_future_range	float32	861.0 kB	65	<0.1%	12,588	5.8%	51,476	23.9%	25.4%	12.0
400	cnt_installments_diff_min	float32	861.0 kB	58	<0.1%	12,588	5.8%	198,083	92.0%	97.7%	0.0
401	cnt_installments_diff_max	float32	861.0 kB	65	<0.1%	12,588	5.8%	36,048	16.7%	17.8%	12.0
402	cnt_installments_diff_mean	float32	861.0 kB	20,290	9.4%	12,588	5.8%	9,014	4.2%	4.4%	3.0
403	cnt_installments_diff_std	float32	861.0 kB	73,650	34.2%	12,828	6.0%	7,541	3.5%	3.7%	2.1602468
404	cnt_installments_diff_median	float32	861.0 kB	64	<0.1%	12,588	5.8%	29,837	13.9%	14.7%	4.0
405	cnt_installments_diff_range	float32	861.0 kB	82	<0.1%	12,588	5.8%	35,742	16.6%	17.6%	12.0
406	sk_dpd_pos_applications_min	float32	861.0 kB	44	<0.1%	12,570	5.8%	202,642	94.1%	>99.9%	0.0
407	sk_dpd_pos_applications_max	float32	861.0 kB	1,595	0.7%	12,570	5.8%	164,332	76.3%	81.1%	0.0
408	sk_dpd_pos_applications_mean	float32	861.0 kB	8,594	4.0%	12,570	5.8%	164,332	76.3%	81.1%	0.0
409	sk_dpd_pos_applications_std	float32	861.0 kB	20,325	9.4%	12,819	6.0%	164,083	76.2%	81.1%	0.0
410	sk_dpd_pos_applications_median	float32	861.0 kB	856	0.4%	12,570	5.8%	201,113	93.4%	99.2%	0.0
411	sk_dpd_pos_applications_range	float32	861.0 kB	1,566	0.7%	12,570	5.8%	164,332	76.3%	81.1%	0.0
412	sk_dpd_def_pos_applications_min	float32	861.0 kB	3	<0.1%	12,570	5.8%	202,685	94.2%	>99.9%	0.0
413	sk_dpd_def_pos_applications_max	float32	861.0 kB	173	0.1%	12,570	5.8%	174,617	81.1%	86.2%	0.0
414	sk_dpd_def_pos_applications_mean	float32	861.0 kB	3,858	1.8%	12,570	5.8%	174,617	81.1%	86.2%	0.0
415	sk_dpd_def_pos_applications_std	float32	861.0 kB	12,093	5.6%	12,819	6.0%	174,368	81.0%	86.1%	0.0
416	sk_dpd_def_pos_applications_median	float32	861.0 kB	61	<0.1%	12,570	5.8%	202,489	94.1%	99.9%	0.0
417	sk_dpd_def_pos_applications_range	float32	861.0 kB	172	0.1%	12,570	5.8%	174,617	81.1%	86.2%	0.0
418	n_previous_credit_card_applications	float32	861.0 kB	126	0.1%	154,158	71.6%	4,332	2.0%	7.1%	96.0
419	n_previous_credit_card_applications_completed	float32	861.0 kB	40	<0.1%	154,158	71.6%	53,625	24.9%	87.8%	0.0
420	n_previous_credit_card_applications_active	float32	861.0 kB	102	<0.1%	154,158	71.6%	3,810	1.8%	6.2%	96.0
421	n_previous_credit_card_applications_signed	float32	861.0 kB	37	<0.1%	154,158	71.6%	58,091	27.0%	95.1%	0.0
422	n_contracts_credit_card_active	float32	861.0 kB	102	<0.1%	154,158	71.6%	3,810	1.8%	6.2%	96.0
423	n_contracts_credit_card_completed	float32	861.0 kB	40	<0.1%	154,158	71.6%	53,625	24.9%	87.8%	0.0
424	n_contracts_credit_card_signed	float32	861.0 kB	37	<0.1%	154,158	71.6%	58,091	27.0%	95.1%	0.0
425	amt_balance_credit_card_min	float64	1.7 MB	8,310	3.9%	154,158	71.6%	52,144	24.2%	85.3%	0.0
426	amt_balance_credit_card_max	float64	1.7 MB	40,175	18.7%	154,158	71.6%	19,232	8.9%	31.5%	0.0
427	amt_balance_credit_card_mean	float64	1.7 MB	41,818	19.4%	154,158	71.6%	19,214	8.9%	31.4%	0.0
428	amt_balance_credit_card_std	float64	1.7 MB	41,728	19.4%	154,590	71.8%	18,904	8.8%	31.2%	0.0
429	amt_balance_credit_card_median	float64	1.7 MB	27,685	12.9%	154,158	71.6%	33,027	15.3%	54.1%	0.0
430	amt_balance_credit_card_range	float64	1.7 MB	40,268	18.7%	154,158	71.6%	19,336	9.0%	31.6%	0.0
431	amt_credit_limit_actual_min	float32	861.0 kB	150	0.1%	154,158	71.6%	15,769	7.3%	25.8%	45000.0
432	amt_credit_limit_actual_max	float32	861.0 kB	52	<0.1%	154,158	71.6%	8,852	4.1%	14.5%	135000.0
433	amt_credit_limit_actual_mean	float64	1.7 MB	9,366	4.4%	154,158	71.6%	3,297	1.5%	5.4%	45000.0
434	amt_credit_limit_actual_std	float64	1.7 MB	17,158	8.0%	154,590	71.8%	25,868	12.0%	42.6%	0.0
435	amt_credit_limit_actual_median	float32	861.0 kB	151	0.1%	154,158	71.6%	7,600	3.5%	12.4%	0.0
436	amt_credit_limit_actual_range	float32	861.0 kB	147	0.1%	154,158	71.6%	26,300	12.2%	43.0%	0.0
437	amt_drawings_atm_current_min	float32	861.0 kB	114	0.1%	172,254	80.0%	42,401	19.7%	98.6%	0.0
438	amt_drawings_atm_current_max	float64	1.7 MB	1,131	0.5%	172,254	80.0%	6,929	3.2%	16.1%	0.0
439	amt_drawings_atm_current_mean	float64	1.7 MB	17,404	8.1%	172,254	80.0%	6,929	3.2%	16.1%	0.0
440	amt_drawings_atm_current_std	float64	1.7 MB	30,960	14.4%	172,561	80.2%	6,804	3.2%	15.9%	0.0
441	amt_drawings_atm_current_median	float64	1.7 MB	378	0.2%	172,254	80.0%	36,581	17.0%	85.1%	0.0
442	amt_drawings_atm_current_range	float32	861.0 kB	1	<0.1%	172,254	80.0%	43,003	20.0%	100.0%	0.0
443	amt_drawings_current_min	float64	1.7 MB	1,475	0.7%	154,158	71.6%	59,264	27.5%	97.0%	0.0
444	amt_drawings_current_max	float64	1.7 MB	17,325	8.0%	154,158	71.6%	19,196	8.9%	31.4%	0.0
445	amt_drawings_current_mean	float64	1.7 MB	35,095	16.3%	154,158	71.6%	19,196	8.9%	31.4%	0.0
446	amt_drawings_current_std	float64	1.7 MB	39,419	18.3%	154,590	71.8%	18,901	8.8%	31.2%	0.0
447	amt_drawings_current_median	float64	1.7 MB	9,561	4.4%	154,158	71.6%	47,512	22.1%	77.8%	0.0
448	amt_drawings_current_range	float64	1.7 MB	17,342	8.1%	154,158	71.6%	19,333	9.0%	31.6%	0.0
449	amt_drawings_other_current_min	float32	861.0 kB	4	<0.1%	172,254	80.0%	43,000	20.0%	>99.9%	0.0
450	amt_drawings_other_current_max	float64	1.7 MB	1,084	0.5%	172,254	80.0%	38,999	18.1%	90.7%	0.0
451	amt_drawings_other_current_mean	float64	1.7 MB	2,925	1.4%	172,254	80.0%	38,999	18.1%	90.7%	0.0
452	amt_drawings_other_current_std	float64	1.7 MB	3,439	1.6%	172,561	80.2%	38,694	18.0%	90.6%	0.0
453	amt_drawings_other_current_median	float64	1.7 MB	33	<0.1%	172,254	80.0%	42,965	20.0%	99.9%	0.0
454	amt_drawings_other_current_range	float64	1.7 MB	1,083	0.5%	172,254	80.0%	39,001	18.1%	90.7%	0.0
455	amt_drawings_pos_current_min	float64	1.7 MB	1,772	0.8%	172,254	80.0%	41,083	19.1%	95.5%	0.0
456	amt_drawings_pos_current_max	float64	1.7 MB	20,726	9.6%	172,254	80.0%	19,027	8.8%	44.2%	0.0
457	amt_drawings_pos_current_mean	float64	1.7 MB	23,516	10.9%	172,254	80.0%	19,027	8.8%	44.2%	0.0
458	amt_drawings_pos_current_std	float64	1.7 MB	23,623	11.0%	172,561	80.2%	18,898	8.8%	44.3%	0.0
459	amt_drawings_pos_current_median	float64	1.7 MB	8,634	4.0%	172,254	80.0%	33,721	15.7%	78.4%	0.0
460	amt_drawings_pos_current_range	float64	1.7 MB	20,626	9.6%	172,254	80.0%	19,205	8.9%	44.7%	0.0
461	amt_inst_min_regularity_min	float64	1.7 MB	1,664	0.8%	154,158	71.6%	57,788	26.8%	94.6%	0.0
462	amt_inst_min_regularity_max	float64	1.7 MB	22,887	10.6%	154,158	71.6%	19,437	9.0%	31.8%	0.0
463	amt_inst_min_regularity_mean	float64	1.7 MB	40,398	18.8%	154,158	71.6%	19,437	9.0%	31.8%	0.0
464	amt_inst_min_regularity_std	float64	1.7 MB	40,484	18.8%	154,590	71.8%	19,359	9.0%	31.9%	0.0
465	amt_inst_min_regularity_median	float64	1.7 MB	16,994	7.9%	154,158	71.6%	33,468	15.5%	54.8%	0.0
466	amt_inst_min_regularity_range	float64	1.7 MB	23,219	10.8%	154,158	71.6%	19,791	9.2%	32.4%	0.0
467	amt_payment_current_min	float64	1.7 MB	7,398	3.4%	172,336	80.1%	26,925	12.5%	62.7%	0.0
468	amt_payment_current_max	float64	1.7 MB	19,208	8.9%	172,336	80.1%	907	0.4%	2.1%	22500.0
469	amt_payment_current_mean	float64	1.7 MB	40,261	18.7%	172,336	80.1%	83	<0.1%	0.2%	0.0
470	amt_payment_current_std	float64	1.7 MB	41,555	19.3%	172,647	80.2%	371	0.2%	0.9%	0.0
471	amt_payment_current_median	float64	1.7 MB	17,066	7.9%	172,336	80.1%	2,689	1.2%	6.3%	9000.0
472	amt_payment_current_range	float64	1.7 MB	22,545	10.5%	172,336	80.1%	682	0.3%	1.6%	0.0
473	amt_payment_total_current_min	float64	1.7 MB	1,131	0.5%	154,158	71.6%	59,285	27.5%	97.0%	0.0
474	amt_payment_total_current_max	float64	1.7 MB	22,332	10.4%	154,158	71.6%	18,441	8.6%	30.2%	0.0
475	amt_payment_total_current_mean	float64	1.7 MB	40,916	19.0%	154,158	71.6%	18,441	8.6%	30.2%	0.0
476	amt_payment_total_current_std	float64	1.7 MB	42,215	19.6%	154,590	71.8%	18,090	8.4%	29.8%	0.0
477	amt_payment_total_current_median	float64	1.7 MB	13,261	6.2%	154,158	71.6%	30,408	14.1%	49.8%	0.0
478	amt_payment_total_current_range	float64	1.7 MB	22,686	10.5%	154,158	71.6%	18,522	8.6%	30.3%	0.0
479	amt_receivable_principal_min	float64	1.7 MB	6,082	2.8%	154,158	71.6%	53,385	24.8%	87.4%	0.0
480	amt_receivable_principal_max	float64	1.7 MB	33,039	15.3%	154,158	71.6%	19,707	9.2%	32.3%	0.0
481	amt_receivable_principal_mean	float64	1.7 MB	41,189	19.1%	154,158	71.6%	19,683	9.1%	32.2%	0.0
482	amt_receivable_principal_std	float64	1.7 MB	41,193	19.1%	154,590	71.8%	19,378	9.0%	31.9%	0.0
483	amt_receivable_principal_median	float64	1.7 MB	25,587	11.9%	154,158	71.6%	34,981	16.3%	57.3%	0.0
484	amt_receivable_principal_range	float64	1.7 MB	33,975	15.8%	154,158	71.6%	19,810	9.2%	32.4%	0.0
485	amt_receivable_min	float64	1.7 MB	14,658	6.8%	154,158	71.6%	43,946	20.4%	71.9%	0.0
486	amt_receivable_max	float64	1.7 MB	39,955	18.6%	154,158	71.6%	19,362	9.0%	31.7%	0.0
487	amt_receivable_mean	float64	1.7 MB	41,873	19.5%	154,158	71.6%	19,064	8.9%	31.2%	0.0
488	amt_receivable_std	float64	1.7 MB	41,816	19.4%	154,590	71.8%	18,748	8.7%	30.9%	0.0
489	amt_receivable_median	float64	1.7 MB	26,844	12.5%	154,158	71.6%	33,993	15.8%	55.6%	0.0
490	amt_receivable_range	float64	1.7 MB	40,943	19.0%	154,158	71.6%	19,180	8.9%	31.4%	0.0
491	amt_total_receivable_min	float64	1.7 MB	14,657	6.8%	154,158	71.6%	43,947	20.4%	71.9%	0.0
492	amt_total_receivable_max	float64	1.7 MB	39,959	18.6%	154,158	71.6%	19,361	9.0%	31.7%	0.0
493	amt_total_receivable_mean	float64	1.7 MB	41,873	19.5%	154,158	71.6%	19,064	8.9%	31.2%	0.0
494	amt_total_receivable_std	float64	1.7 MB	41,817	19.4%	154,590	71.8%	18,748	8.7%	30.9%	0.0
495	amt_total_receivable_median	float64	1.7 MB	26,843	12.5%	154,158	71.6%	33,993	15.8%	55.6%	0.0
496	amt_total_receivable_range	float64	1.7 MB	40,943	19.0%	154,158	71.6%	19,180	8.9%	31.4%	0.0
497	cnt_drawings_atm_current_min	float32	861.0 kB	19	<0.1%	172,254	80.0%	42,402	19.7%	98.6%	0.0
498	cnt_drawings_atm_current_max	float32	861.0 kB	43	<0.1%	172,254	80.0%	6,929	3.2%	16.1%	0.0
499	cnt_drawings_atm_current_mean	float32	861.0 kB	3,073	1.4%	172,254	80.0%	6,929	3.2%	16.1%	0.0
500	cnt_drawings_atm_current_std	float32	861.0 kB	16,770	7.8%	172,561	80.2%	6,817	3.2%	16.0%	0.0
501	cnt_drawings_atm_current_median	float32	861.0 kB	33	<0.1%	172,254	80.0%	36,581	17.0%	85.1%	0.0
502	cnt_drawings_atm_current_range	float32	861.0 kB	43	<0.1%	172,254	80.0%	7,124	3.3%	16.6%	0.0
503	cnt_drawings_current_min	float32	861.0 kB	39	<0.1%	154,158	71.6%	59,278	27.5%	97.0%	0.0
504	cnt_drawings_current_max	float32	861.0 kB	114	0.1%	154,158	71.6%	19,499	9.1%	31.9%	0.0
505	cnt_drawings_current_mean	float32	861.0 kB	5,724	2.7%	154,158	71.6%	19,499	9.1%	31.9%	0.0
506	cnt_drawings_current_std	float32	861.0 kB	25,425	11.8%	154,590	71.8%	19,208	8.9%	31.7%	0.0
507	cnt_drawings_current_median	float32	861.0 kB	113	0.1%	154,158	71.6%	47,629	22.1%	78.0%	0.0
508	cnt_drawings_current_range	float32	861.0 kB	114	0.1%	154,158	71.6%	19,640	9.1%	32.1%	0.0
509	cnt_drawings_other_current_min	float32	861.0 kB	3	<0.1%	172,254	80.0%	43,000	20.0%	>99.9%	0.0
510	cnt_drawings_other_current_max	float32	861.0 kB	11	<0.1%	172,254	80.0%	38,987	18.1%	90.7%	0.0
511	cnt_drawings_other_current_mean	float32	861.0 kB	382	0.2%	172,254	80.0%	38,987	18.1%	90.7%	0.0
512	cnt_drawings_other_current_std	float32	861.0 kB	724	0.3%	172,561	80.2%	38,683	18.0%	90.6%	0.0
513	cnt_drawings_other_current_median	float32	861.0 kB	4	<0.1%	172,254	80.0%	42,965	20.0%	99.9%	0.0
514	cnt_drawings_other_current_range	float32	861.0 kB	11	<0.1%	172,254	80.0%	38,990	18.1%	90.7%	0.0
515	cnt_drawings_pos_current_min	float32	861.0 kB	40	<0.1%	172,254	80.0%	41,083	19.1%	95.5%	0.0
516	cnt_drawings_pos_current_max	float32	861.0 kB	116	0.1%	172,254	80.0%	19,027	8.8%	44.2%	0.0
517	cnt_drawings_pos_current_mean	float32	861.0 kB	4,240	2.0%	172,254	80.0%	19,027	8.8%	44.2%	0.0
518	cnt_drawings_pos_current_std	float32	861.0 kB	13,887	6.5%	172,561	80.2%	18,908	8.8%	44.3%	0.0
519	cnt_drawings_pos_current_median	float32	861.0 kB	113	0.1%	172,254	80.0%	33,721	15.7%	78.4%	0.0
520	cnt_drawings_pos_current_range	float32	861.0 kB	116	0.1%	172,254	80.0%	19,215	8.9%	44.7%	0.0
521	cnt_installment_mature_cum_min	float32	861.0 kB	28	<0.1%	154,158	71.6%	38,853	18.0%	63.6%	0.0
522	cnt_installment_mature_cum_max	float32	861.0 kB	120	0.1%	154,158	71.6%	19,249	8.9%	31.5%	0.0
523	cnt_installment_mature_cum_mean	float32	861.0 kB	11,238	5.2%	154,158	71.6%	19,249	8.9%	31.5%	0.0
524	cnt_installment_mature_cum_std	float32	861.0 kB	12,965	6.0%	154,590	71.8%	19,175	8.9%	31.6%	0.0
525	cnt_installment_mature_cum_median	float32	861.0 kB	144	0.1%	154,158	71.6%	20,299	9.4%	33.2%	0.0
526	cnt_installment_mature_cum_range	float32	861.0 kB	96	<0.1%	154,158	71.6%	19,607	9.1%	32.1%	0.0
527	sk_dpd_credit_card_min	float32	861.0 kB	1	<0.1%	154,158	71.6%	61,099	28.4%	100.0%	0.0
528	sk_dpd_credit_card_max	float32	861.0 kB	353	0.2%	154,158	71.6%	48,474	22.5%	79.3%	0.0
529	sk_dpd_credit_card_mean	float32	861.0 kB	2,882	1.3%	154,158	71.6%	48,474	22.5%	79.3%	0.0
530	sk_dpd_credit_card_std	float32	861.0 kB	3,641	1.7%	154,590	71.8%	48,042	22.3%	79.2%	0.0
531	sk_dpd_credit_card_median	float32	861.0 kB	222	0.1%	154,158	71.6%	60,546	28.1%	99.1%	0.0
532	sk_dpd_credit_card_range	float32	861.0 kB	353	0.2%	154,158	71.6%	48,474	22.5%	79.3%	0.0
533	sk_dpd_def_credit_card_min	float32	861.0 kB	1	<0.1%	154,158	71.6%	61,099	28.4%	100.0%	0.0
534	sk_dpd_def_credit_card_max	float32	861.0 kB	47	<0.1%	154,158	71.6%	50,652	23.5%	82.9%	0.0
535	sk_dpd_def_credit_card_mean	float32	861.0 kB	1,328	0.6%	154,158	71.6%	50,652	23.5%	82.9%	0.0
536	sk_dpd_def_credit_card_std	float32	861.0 kB	1,757	0.8%	154,590	71.8%	50,220	23.3%	82.8%	0.0
537	sk_dpd_def_credit_card_median	float32	861.0 kB	16	<0.1%	154,158	71.6%	61,061	28.4%	99.9%	0.0
538	sk_dpd_def_credit_card_range	float32	861.0 kB	47	<0.1%	154,158	71.6%	50,652	23.5%	82.9%	0.0
539	FLAG_IS_EMERGENCY	Int8	430.5 kB	2	<0.1%	0	0%	213,628	99.2%	99.2%	0
540	ord_education_type	int8	215.3 kB	5	<0.1%	0	0%	152,993	71.1%	71.1%	1
541	flag_has_children	Int8	430.5 kB	2	<0.1%	0	0%	150,641	70.0%	70.0%	0
542	years_employed	float64	1.7 MB	11,769	5.5%	38,756	18.0%	112	0.1%	0.1%	0.6273972602739726
543	amt_income_total_per_family_member	float64	1.7 MB	2,362	1.1%	1	<0.1%	17,111	7.9%	7.9%	67500.0
544	cnt_fam_members_excluding_children	float32	861.0 kB	2	<0.1%	1	<0.1%	158,301	73.5%	73.5%	2.0
545	amt_annuity_to_credit_ratio	float32	861.0 kB	33,148	15.4%	8	<0.1%	20,556	9.5%	9.5%	0.05
546	amt_annuity_to_income_ratio	float64	1.7 MB	71,916	33.4%	8	<0.1%	2,049	1.0%	1.0%	0.1
547	amt_credit_to_income_ratio	float64	1.7 MB	39,372	18.3%	0	0%	3,691	1.7%	1.7%	2.0
548	amt_annuity_to_income_per_family_member	float64	1.7 MB	88,172	41.0%	9	<0.1%	1,500	0.7%	0.7%	0.3

6 Further Pre-Processing

In this chapter, further data pre-processing and pre-selection of features are performed to prepare the data for modeling.

6.1 Identify Redundant and Problematic Features

The purpose of this section is to identify 2 sets of variables:

A set of variables from the merged data table that should be kept and included in the pre-processing (“before pre-processing” set);
A set of variables that should be kept after pre-processing and used for creating a predictive model (“after pre-processing” set).

To achieve this, first, problematic, duplicated, and correlated columns will be identified and then a complement of these will be used.

In this section, only the training set is used.

6.1.1 Steps Before Pre-Processing

Columns that:

have only one unique value or all missing values;
have more than 90% of missing values;
a single value (excluding missing ones) is present in more than 99.9% of cases;

are considered to be problematic and will be excluded before further preprocessing.

problematic_columns = credits_train_col_info.query(
    "n_unique <= 1 or p_missing >= 90.00 or p_dom_excl_na >= 99.85"
)

print(f"N columns to remove: {problematic_columns.shape[0]}")
problematic_columns.pipe(an.style_col_info)

N columns to remove: 45

Table 6.1. Info on the problematic columns to remove before preprocessing.

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
18	FLAG_MOBIL	int8	215.3 kB	2	<0.1%	0	0%	215,256	>99.9%	>99.9%	1
89	FLAG_DOCUMENT_2	int8	215.3 kB	2	<0.1%	0	0%	215,246	>99.9%	>99.9%	0
91	FLAG_DOCUMENT_4	int8	215.3 kB	2	<0.1%	0	0%	215,238	>99.9%	>99.9%	0
94	FLAG_DOCUMENT_7	int8	215.3 kB	2	<0.1%	0	0%	215,221	>99.9%	>99.9%	0
97	FLAG_DOCUMENT_10	int8	215.3 kB	2	<0.1%	0	0%	215,253	>99.9%	>99.9%	0
99	FLAG_DOCUMENT_12	int8	215.3 kB	2	<0.1%	0	0%	215,256	>99.9%	>99.9%	0
102	FLAG_DOCUMENT_15	int8	215.3 kB	2	<0.1%	0	0%	215,015	99.9%	99.9%	0
104	FLAG_DOCUMENT_17	int8	215.3 kB	2	<0.1%	0	0%	215,200	>99.9%	>99.9%	0
106	FLAG_DOCUMENT_19	int8	215.3 kB	2	<0.1%	0	0%	215,124	99.9%	99.9%	0
107	FLAG_DOCUMENT_20	int8	215.3 kB	2	<0.1%	0	0%	215,146	99.9%	99.9%	0
108	FLAG_DOCUMENT_21	int8	215.3 kB	2	<0.1%	0	0%	215,187	>99.9%	>99.9%	0
118	n_credits_bad_debt	float32	861.0 kB	2	<0.1%	30,836	14.3%	184,408	85.7%	>99.9%	0.0
120	mode_credit_currency	category	215.6 kB	3	<0.1%	30,836	14.3%	184,386	85.7%	>99.9%	currency 1
124	n_currency_3	float32	861.0 kB	4	<0.1%	30,836	14.3%	184,319	85.6%	99.9%	0.0
125	n_currency_4	float32	861.0 kB	2	<0.1%	30,836	14.3%	184,414	85.7%	>99.9%	0.0
132	days_credit_overdue_min	float32	861.0 kB	69	<0.1%	30,836	14.3%	184,320	85.6%	99.9%	0.0
156	cnt_credit_prolong_min	float32	861.0 kB	6	<0.1%	30,836	14.3%	184,215	85.6%	99.9%	0.0
184	amt_credit_sum_overdue_min	float32	861.0 kB	81	<0.1%	30,836	14.3%	184,318	85.6%	99.9%	0.0
213	bureau_dpd_status_min	float32	861.0 kB	6	<0.1%	152,586	70.9%	62,638	29.1%	99.9%	0.0
259	rate_interest_primary_min	float32	861.0 kB	119	0.1%	212,016	98.5%	666	0.3%	20.5%	0.18913634
260	rate_interest_primary_max	float32	861.0 kB	119	0.1%	212,016	98.5%	674	0.3%	20.8%	0.18913634
261	rate_interest_primary_mean	float32	861.0 kB	160	0.1%	212,016	98.5%	655	0.3%	20.2%	0.18913634
262	rate_interest_primary_std	float32	861.0 kB	39	<0.1%	215,139	99.9%	37	<0.1%	31.4%	0.0
263	rate_interest_primary_median	float32	861.0 kB	157	0.1%	212,016	98.5%	655	0.3%	20.2%	0.18913634
264	rate_interest_primary_range	float32	861.0 kB	37	<0.1%	212,016	98.5%	3,160	1.5%	97.5%	0.0
266	rate_interest_privileged_min	float32	861.0 kB	21	<0.1%	212,016	98.5%	892	0.4%	27.5%	0.83509517
267	rate_interest_privileged_max	float32	861.0 kB	21	<0.1%	212,016	98.5%	906	0.4%	28.0%	0.83509517
268	rate_interest_privileged_mean	float32	861.0 kB	42	<0.1%	212,016	98.5%	881	0.4%	27.2%	0.83509517
269	rate_interest_privileged_std	float32	861.0 kB	21	<0.1%	215,139	99.9%	50	<0.1%	42.4%	0.0
270	rate_interest_privileged_median	float32	861.0 kB	40	<0.1%	212,016	98.5%	881	0.4%	27.2%	0.83509517
271	rate_interest_privileged_range	float32	861.0 kB	20	<0.1%	212,016	98.5%	3,173	1.5%	97.9%	0.0
285	n_payment_type_cash_from_account	float32	861.0 kB	1	<0.1%	11,456	5.3%	203,801	94.7%	100.0%	0.0
309	n_channel_type_car_dealer	float32	861.0 kB	6	<0.1%	11,456	5.3%	203,580	94.6%	99.9%	0.0
347	any_nflag_insured_on_approval	Int8	430.5 kB	1	<0.1%	0	0%	215,257	100.0%	100.0%	0
406	sk_dpd_pos_applications_min	float32	861.0 kB	44	<0.1%	12,570	5.8%	202,642	94.1%	>99.9%	0.0
412	sk_dpd_def_pos_applications_min	float32	861.0 kB	3	<0.1%	12,570	5.8%	202,685	94.2%	>99.9%	0.0
416	sk_dpd_def_pos_applications_median	float32	861.0 kB	61	<0.1%	12,570	5.8%	202,489	94.1%	99.9%	0.0
442	amt_drawings_atm_current_range	float32	861.0 kB	1	<0.1%	172,254	80.0%	43,003	20.0%	100.0%	0.0
449	amt_drawings_other_current_min	float32	861.0 kB	4	<0.1%	172,254	80.0%	43,000	20.0%	>99.9%	0.0
453	amt_drawings_other_current_median	float64	1.7 MB	33	<0.1%	172,254	80.0%	42,965	20.0%	99.9%	0.0
509	cnt_drawings_other_current_min	float32	861.0 kB	3	<0.1%	172,254	80.0%	43,000	20.0%	>99.9%	0.0
513	cnt_drawings_other_current_median	float32	861.0 kB	4	<0.1%	172,254	80.0%	42,965	20.0%	99.9%	0.0
527	sk_dpd_credit_card_min	float32	861.0 kB	1	<0.1%	154,158	71.6%	61,099	28.4%	100.0%	0.0
533	sk_dpd_def_credit_card_min	float32	861.0 kB	1	<0.1%	154,158	71.6%	61,099	28.4%	100.0%	0.0
537	sk_dpd_def_credit_card_median	float32	861.0 kB	16	<0.1%	154,158	71.6%	61,061	28.4%	99.9%	0.0

Code

# Create list of columns to keep
cols_to_keep_1 = list(
    set(credits_train.columns) - set(problematic_columns.column) - set(["TARGET"])
)

The following steps are to:

manually remove the identified problematic columns;
drop duplicated columns;
use SmartCorrelatedSelection algorithm to identify the groups of correlated variables and to leave only a single variable from each group.

Code

pipeline_selection_before_preprec = Pipeline(
    steps=[
        ("column_selector_1", ColumnSelector(cols_to_keep_1)),
        ("drop_duplicate_features", DropDuplicateFeatures()),
        (
            "drop_corr_features",
            SmartCorrelatedSelection(selection_method="variance"),
        ),
    ]
)

pipeline_selection_before_preprec.fit(credits_train)
# Time: 5m 36.1s

Pipeline(steps=[('column_selector_1',
                 ColumnSelector(keep=['days_credit_update_max',
                                      'cnt_installment_mature_cum_min',
                                      'cnt_drawings_current_min',
                                      'sk_dpd_pos_applications_mean',
                                      'ord_education_type',
                                      'REG_REGION_NOT_WORK_REGION',
                                      'amt_goods_price_mean', 'FLOORSMIN_MEDI',
                                      'cnt_installment_future_std',
                                      'n_previous_pos_applications',
                                      'diff_percent_installment_p...
                                      'amt_credit_sum_std',
                                      'amt_credit_sum_debt_sum',
                                      'DAYS_ID_PUBLISH', 'FLAG_DOCUMENT_11',
                                      'LIVINGAPARTMENTS_MODE',
                                      'amt_payment_total_current_std',
                                      'cnt_payment_min',
                                      'sk_dpd_def_pos_applications_max',
                                      'n_channel_type_regional_and_local', ...])),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance'))])

Pipeline

Pipeline(steps=[('column_selector_1',
                 ColumnSelector(keep=['days_credit_update_max',
                                      'cnt_installment_mature_cum_min',
                                      'cnt_drawings_current_min',
                                      'sk_dpd_pos_applications_mean',
                                      'ord_education_type',
                                      'REG_REGION_NOT_WORK_REGION',
                                      'amt_goods_price_mean', 'FLOORSMIN_MEDI',
                                      'cnt_installment_future_std',
                                      'n_previous_pos_applications',
                                      'diff_percent_installment_p...
                                      'amt_credit_sum_std',
                                      'amt_credit_sum_debt_sum',
                                      'DAYS_ID_PUBLISH', 'FLAG_DOCUMENT_11',
                                      'LIVINGAPARTMENTS_MODE',
                                      'amt_payment_total_current_std',
                                      'cnt_payment_min',
                                      'sk_dpd_def_pos_applications_max',
                                      'n_channel_type_regional_and_local', ...])),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance'))])

ColumnSelector

ColumnSelector(keep=['days_credit_update_max', 'cnt_installment_mature_cum_min',
                     'cnt_drawings_current_min', 'sk_dpd_pos_applications_mean',
                     'ord_education_type', 'REG_REGION_NOT_WORK_REGION',
                     'amt_goods_price_mean', 'FLOORSMIN_MEDI',
                     'cnt_installment_future_std',
                     'n_previous_pos_applications',
                     'diff_percent_installment_payment_median',
                     'amt_drawings_pos_curr...
                     'cnt_drawings_atm_current_mean',
                     'amt_drawings_atm_current_min', 'days_credit_max',
                     'NONLIVINGAREA_MEDI', 'amt_credit_sum_debt_min',
                     'amt_credit_sum_std', 'amt_credit_sum_debt_sum',
                     'DAYS_ID_PUBLISH', 'FLAG_DOCUMENT_11',
                     'LIVINGAPARTMENTS_MODE', 'amt_payment_total_current_std',
                     'cnt_payment_min', 'sk_dpd_def_pos_applications_max',
                     'n_channel_type_regional_and_local', ...])

DropDuplicateFeatures

DropDuplicateFeatures()

SmartCorrelatedSelection

SmartCorrelatedSelection(selection_method='variance')

Code

df_before_preproc = pipeline_selection_before_preprec.transform(credits_train)
df_before_preproc.shape

(215257, 251)

df_before_preproc = df_before_preproc.sort_index(axis=1)
before_preproc_col_info = an.col_info(df_before_preproc)
before_preproc_col_info.pipe(an.style_col_info)

Table 6.2. Info on the selected columns before preprocessing.

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	AMT_ANNUITY	float32	861.0 kB	12,801	5.9%	8	<0.1%	4,499	2.1%	2.1%	9000.0
2	AMT_CREDIT	float32	861.0 kB	5,097	2.4%	0	0%	6,823	3.2%	3.2%	450000.0
3	AMT_INCOME_TOTAL	float64	1.7 MB	1,949	0.9%	0	0%	24,982	11.6%	11.6%	135000.0
4	AMT_REQ_CREDIT_BUREAU_DAY	float32	861.0 kB	9	<0.1%	29,081	13.5%	185,147	86.0%	99.4%	0.0
5	AMT_REQ_CREDIT_BUREAU_HOUR	float32	861.0 kB	5	<0.1%	29,081	13.5%	185,061	86.0%	99.4%	0.0
6	AMT_REQ_CREDIT_BUREAU_MON	float32	861.0 kB	22	<0.1%	29,081	13.5%	155,679	72.3%	83.6%	0.0
7	AMT_REQ_CREDIT_BUREAU_QRT	float32	861.0 kB	10	<0.1%	29,081	13.5%	150,895	70.1%	81.0%	0.0
8	AMT_REQ_CREDIT_BUREAU_WEEK	float32	861.0 kB	9	<0.1%	29,081	13.5%	180,246	83.7%	96.8%	0.0
9	AMT_REQ_CREDIT_BUREAU_YEAR	float32	861.0 kB	24	<0.1%	29,081	13.5%	50,313	23.4%	27.0%	0.0
10	BASEMENTAREA_MODE	float32	861.0 kB	3,687	1.7%	125,793	58.4%	11,561	5.4%	12.9%	0.0
11	CNT_FAM_MEMBERS	float32	861.0 kB	12	<0.1%	1	<0.1%	110,671	51.4%	51.4%	2.0
12	COMMONAREA_MEDI	float32	861.0 kB	2,982	1.4%	150,300	69.8%	6,068	2.8%	9.3%	0.0
13	DAYS_ID_PUBLISH	int16	430.5 kB	6,122	2.8%	0	0%	119	0.1%	0.1%	-4074
14	DAYS_LAST_PHONE_CHANGE	float32	861.0 kB	3,720	1.7%	1	<0.1%	26,201	12.2%	12.2%	0.0
15	DAYS_REGISTRATION	float32	861.0 kB	15,249	7.1%	0	0%	79	<0.1%	<0.1%	-7.0
16	DEF_30_CNT_SOCIAL_CIRCLE	float32	861.0 kB	10	<0.1%	714	0.3%	189,988	88.3%	88.6%	0.0
17	ELEVATORS_AVG	float32	861.0 kB	241	0.1%	114,570	53.2%	60,109	27.9%	59.7%	0.0
18	ELEVATORS_MEDI	float32	861.0 kB	46	<0.1%	114,570	53.2%	61,040	28.4%	60.6%	0.0
19	ENTRANCES_MODE	float32	861.0 kB	30	<0.1%	108,270	50.3%	25,310	11.8%	23.7%	0.1379
20	EXT_SOURCE_1	float32	861.0 kB	83,961	39.0%	121,373	56.4%	5	<0.1%	<0.1%	0.44398212
21	EXT_SOURCE_2	float32	861.0 kB	102,229	47.5%	464	0.2%	503	0.2%	0.2%	0.28589788
22	EXT_SOURCE_3	float32	861.0 kB	804	0.4%	42,680	19.8%	985	0.5%	0.6%	0.7463002
23	FLAG_CONT_MOBILE	int8	215.3 kB	2	<0.1%	0	0%	214,855	99.8%	99.8%	1
24	FLAG_DOCUMENT_11	int8	215.3 kB	2	<0.1%	0	0%	214,448	99.6%	99.6%	0
25	FLAG_DOCUMENT_13	int8	215.3 kB	2	<0.1%	0	0%	214,541	99.7%	99.7%	0
26	FLAG_DOCUMENT_14	int8	215.3 kB	2	<0.1%	0	0%	214,614	99.7%	99.7%	0
27	FLAG_DOCUMENT_16	int8	215.3 kB	2	<0.1%	0	0%	213,089	99.0%	99.0%	0
28	FLAG_DOCUMENT_18	int8	215.3 kB	2	<0.1%	0	0%	213,525	99.2%	99.2%	0
29	FLAG_DOCUMENT_3	int8	215.3 kB	2	<0.1%	0	0%	152,845	71.0%	71.0%	1
30	FLAG_DOCUMENT_5	int8	215.3 kB	2	<0.1%	0	0%	212,025	98.5%	98.5%	0
31	FLAG_DOCUMENT_6	int8	215.3 kB	2	<0.1%	0	0%	196,348	91.2%	91.2%	0
32	FLAG_DOCUMENT_8	int8	215.3 kB	2	<0.1%	0	0%	197,689	91.8%	91.8%	0
33	FLAG_DOCUMENT_9	int8	215.3 kB	2	<0.1%	0	0%	214,440	99.6%	99.6%	0
34	FLAG_EMAIL	int8	215.3 kB	2	<0.1%	0	0%	203,006	94.3%	94.3%	0
35	FLAG_EMP_PHONE	int8	215.3 kB	2	<0.1%	0	0%	176,491	82.0%	82.0%	1
36	FLAG_IS_EMERGENCY	Int8	430.5 kB	2	<0.1%	0	0%	213,628	99.2%	99.2%	0
37	FLAG_OWN_CAR	Int8	430.5 kB	2	<0.1%	0	0%	142,086	66.0%	66.0%	0
38	FLAG_OWN_REALTY	Int8	430.5 kB	2	<0.1%	0	0%	149,412	69.4%	69.4%	1
39	FLAG_PHONE	int8	215.3 kB	2	<0.1%	0	0%	154,906	72.0%	72.0%	0
40	FLAG_WORK_PHONE	int8	215.3 kB	2	<0.1%	0	0%	172,406	80.1%	80.1%	0
41	FLOORSMAX_MEDI	float32	861.0 kB	49	<0.1%	106,970	49.7%	44,659	20.7%	41.2%	0.1667
42	FLOORSMIN_MEDI	float32	861.0 kB	47	<0.1%	146,054	67.9%	23,733	11.0%	34.3%	0.2083
43	FONDKAPREMONT_MODE	category	215.7 kB	4	<0.1%	147,099	68.3%	51,785	24.1%	76.0%	reg oper account
44	HOUSETYPE_MODE	category	215.6 kB	3	<0.1%	107,834	50.1%	105,515	49.0%	98.2%	block of flats
45	LANDAREA_MEDI	float32	861.0 kB	3,393	1.6%	127,644	59.3%	11,058	5.1%	12.6%	0.0
46	NAME_CONTRACT_TYPE	category	215.5 kB	2	<0.1%	0	0%	194,675	90.4%	90.4%	Cash loans
47	NAME_EDUCATION_TYPE	category	215.8 kB	5	<0.1%	0	0%	152,993	71.1%	71.1%	Secondary / secondary special
48	NAME_HOUSING_TYPE	category	215.9 kB	6	<0.1%	0	0%	191,159	88.8%	88.8%	House / apartment
49	NAME_INCOME_TYPE	category	216.1 kB	8	<0.1%	0	0%	110,984	51.6%	51.6%	Working
50	NAME_TYPE_SUITE	category	216.0 kB	7	<0.1%	901	0.4%	174,089	80.9%	81.2%	Unaccompanied
51	NONLIVINGAPARTMENTS_AVG	float32	861.0 kB	345	0.2%	149,354	69.4%	38,319	17.8%	58.1%	0.0
52	NONLIVINGAREA_MODE	float32	861.0 kB	3,090	1.4%	118,577	55.1%	46,933	21.8%	48.5%	0.0
53	OBS_30_CNT_SOCIAL_CIRCLE	float32	861.0 kB	32	<0.1%	714	0.3%	114,550	53.2%	53.4%	0.0
54	OCCUPATION_TYPE	category	217.1 kB	18	<0.1%	67,480	31.3%	38,591	17.9%	26.1%	Laborers
55	ORGANIZATION_TYPE	category	221.3 kB	57	<0.1%	38,756	18.0%	47,582	22.1%	27.0%	Business Entity Type 3
56	OWN_CAR_AGE	float32	861.0 kB	61	<0.1%	142,091	66.0%	5,232	2.4%	7.2%	7.0
57	REGION_POPULATION_RELATIVE	float32	861.0 kB	81	<0.1%	0	0%	11,494	5.3%	5.3%	0.035792
58	REGION_RATING_CLIENT	int8	215.3 kB	3	<0.1%	0	0%	158,846	73.8%	73.8%	2
59	REG_CITY_NOT_LIVE_CITY	int8	215.3 kB	2	<0.1%	0	0%	198,549	92.2%	92.2%	0
60	REG_CITY_NOT_WORK_CITY	int8	215.3 kB	2	<0.1%	0	0%	165,697	77.0%	77.0%	0
61	REG_REGION_NOT_LIVE_REGION	int8	215.3 kB	2	<0.1%	0	0%	211,999	98.5%	98.5%	0
62	REG_REGION_NOT_WORK_REGION	int8	215.3 kB	2	<0.1%	0	0%	204,222	94.9%	94.9%	0
63	WALLSMATERIAL_MODE	category	216.0 kB	7	<0.1%	109,329	50.8%	46,298	21.5%	43.7%	Panel
64	YEARS_BEGINEXPLUATATION_MODE	float32	861.0 kB	210	0.1%	104,910	48.7%	3,039	1.4%	2.8%	0.9871
65	YEARS_BUILD_AVG	float32	861.0 kB	146	0.1%	143,036	66.4%	2,132	1.0%	3.0%	0.8232
66	amt_annuity_max	float64	1.7 MB	18,638	8.7%	159,480	74.1%	13,781	6.4%	24.7%	0.0
67	amt_annuity_max_previous_application	float64	1.7 MB	110,598	51.4%	11,752	5.5%	2,363	1.1%	1.2%	22500.0
68	amt_annuity_median	float64	1.7 MB	16,441	7.6%	159,480	74.1%	23,785	11.0%	42.6%	0.0
69	amt_annuity_median_previous_application	float64	1.7 MB	157,063	73.0%	11,752	5.5%	1,357	0.6%	0.7%	11250.0
70	amt_annuity_min	float64	1.7 MB	9,921	4.6%	159,480	74.1%	36,975	17.2%	66.3%	0.0
71	amt_annuity_min_previous_application	float64	1.7 MB	113,816	52.9%	11,752	5.5%	16,017	7.4%	7.9%	2250.0
72	amt_annuity_to_credit_ratio	float32	861.0 kB	33,148	15.4%	8	<0.1%	20,556	9.5%	9.5%	0.05
73	amt_annuity_to_income_per_family_member	float64	1.7 MB	88,172	41.0%	9	<0.1%	1,500	0.7%	0.7%	0.3
74	amt_annuity_to_income_ratio	float64	1.7 MB	71,916	33.4%	8	<0.1%	2,049	1.0%	1.0%	0.1
75	amt_balance_credit_card_max	float64	1.7 MB	40,175	18.7%	154,158	71.6%	19,232	8.9%	31.5%	0.0
76	amt_balance_credit_card_median	float64	1.7 MB	27,685	12.9%	154,158	71.6%	33,027	15.3%	54.1%	0.0
77	amt_balance_credit_card_min	float64	1.7 MB	8,310	3.9%	154,158	71.6%	52,144	24.2%	85.3%	0.0
78	amt_credit_limit_actual_median	float32	861.0 kB	151	0.1%	154,158	71.6%	7,600	3.5%	12.4%	0.0
79	amt_credit_limit_actual_range	float32	861.0 kB	147	0.1%	154,158	71.6%	26,300	12.2%	43.0%	0.0
80	amt_credit_max	float64	1.7 MB	49,618	23.1%	11,456	5.3%	4,696	2.2%	2.3%	450000.0
81	amt_credit_max_overdue_max	float64	1.7 MB	32,871	15.3%	86,638	40.2%	79,549	37.0%	61.8%	0.0
82	amt_credit_max_overdue_range	float64	1.7 MB	27,267	12.7%	86,638	40.2%	88,957	41.3%	69.2%	0.0
83	amt_credit_median	float64	1.7 MB	73,966	34.4%	11,456	5.3%	8,095	3.8%	4.0%	0.0
84	amt_credit_min	float64	1.7 MB	33,220	15.4%	11,456	5.3%	79,660	37.0%	39.1%	0.0
85	amt_credit_range	float64	1.7 MB	71,950	33.4%	11,456	5.3%	37,038	17.2%	18.2%	0.0
86	amt_credit_sum_debt_mean	float64	1.7 MB	121,544	56.5%	36,039	16.7%	48,543	22.6%	27.1%	0.0
87	amt_credit_sum_debt_median	float64	1.7 MB	48,592	22.6%	36,039	16.7%	120,818	56.1%	67.4%	0.0
88	amt_credit_sum_debt_sum	float64	1.7 MB	113,811	52.9%	30,836	14.3%	53,746	25.0%	29.1%	0.0
89	amt_credit_sum_limit_min	float64	1.7 MB	2,121	1.0%	45,585	21.2%	167,209	77.7%	98.5%	0.0
90	amt_credit_sum_limit_std	float64	1.7 MB	26,937	12.5%	80,896	37.6%	102,265	47.5%	76.1%	0.0
91	amt_credit_sum_limit_sum	float64	1.7 MB	26,367	12.2%	30,836	14.3%	150,348	69.8%	81.5%	0.0
92	amt_credit_sum_median	float64	1.7 MB	77,800	36.1%	30,836	14.3%	5,011	2.3%	2.7%	225000.0
93	amt_credit_sum_overdue_std	float64	1.7 MB	1,618	0.8%	55,965	26.0%	157,060	73.0%	98.6%	0.0
94	amt_credit_sum_overdue_sum	float64	1.7 MB	930	0.4%	30,836	14.3%	182,090	84.6%	98.7%	0.0
95	amt_credit_sum_std	float64	1.7 MB	148,439	69.0%	55,965	26.0%	1,156	0.5%	0.7%	0.0
96	amt_credit_sum_sum	float64	1.7 MB	147,742	68.6%	30,836	14.3%	924	0.4%	0.5%	225000.0
97	amt_credit_to_income_ratio	float64	1.7 MB	39,372	18.3%	0	0%	3,691	1.7%	1.7%	2.0
98	amt_down_payment_max	float64	1.7 MB	17,607	8.2%	23,703	11.0%	53,725	25.0%	28.0%	0.0
99	amt_down_payment_mean	float64	1.7 MB	42,577	19.8%	23,703	11.0%	53,725	25.0%	28.0%	0.0
100	amt_drawings_atm_current_max	float64	1.7 MB	1,131	0.5%	172,254	80.0%	6,929	3.2%	16.1%	0.0
101	amt_drawings_atm_current_median	float64	1.7 MB	378	0.2%	172,254	80.0%	36,581	17.0%	85.1%	0.0
102	amt_drawings_atm_current_min	float32	861.0 kB	114	0.1%	172,254	80.0%	42,401	19.7%	98.6%	0.0
103	amt_drawings_current_max	float64	1.7 MB	17,325	8.0%	154,158	71.6%	19,196	8.9%	31.4%	0.0
104	amt_drawings_current_mean	float64	1.7 MB	35,095	16.3%	154,158	71.6%	19,196	8.9%	31.4%	0.0
105	amt_drawings_current_min	float64	1.7 MB	1,475	0.7%	154,158	71.6%	59,264	27.5%	97.0%	0.0
106	amt_drawings_other_current_max	float64	1.7 MB	1,084	0.5%	172,254	80.0%	38,999	18.1%	90.7%	0.0
107	amt_drawings_pos_current_max	float64	1.7 MB	20,726	9.6%	172,254	80.0%	19,027	8.8%	44.2%	0.0
108	amt_drawings_pos_current_mean	float64	1.7 MB	23,516	10.9%	172,254	80.0%	19,027	8.8%	44.2%	0.0
109	amt_drawings_pos_current_min	float64	1.7 MB	1,772	0.8%	172,254	80.0%	41,083	19.1%	95.5%	0.0
110	amt_goods_price_min	float64	1.7 MB	39,170	18.2%	12,169	5.7%	11,596	5.4%	5.7%	45000.0
111	amt_inst_min_regularity_min	float64	1.7 MB	1,664	0.8%	154,158	71.6%	57,788	26.8%	94.6%	0.0
112	amt_payment_current_median	float64	1.7 MB	17,066	7.9%	172,336	80.1%	2,689	1.2%	6.3%	9000.0
113	amt_payment_current_min	float64	1.7 MB	7,398	3.4%	172,336	80.1%	26,925	12.5%	62.7%	0.0
114	amt_payment_current_range	float64	1.7 MB	22,545	10.5%	172,336	80.1%	682	0.3%	1.6%	0.0
115	amt_payment_total_current_min	float64	1.7 MB	1,131	0.5%	154,158	71.6%	59,285	27.5%	97.0%	0.0
116	any_installments_late_30	float32	861.0 kB	2	<0.1%	11,034	5.1%	190,963	88.7%	93.5%	0.0
117	any_installments_late_60	float32	861.0 kB	2	<0.1%	11,034	5.1%	198,146	92.1%	97.0%	0.0
118	any_installments_late_7	float32	861.0 kB	2	<0.1%	11,034	5.1%	147,558	68.5%	72.3%	0.0
119	bureau_dpd_status_max	float32	861.0 kB	6	<0.1%	152,586	70.9%	41,042	19.1%	65.5%	0.0
120	bureau_dpd_status_median	float32	861.0 kB	11	<0.1%	152,586	70.9%	61,726	28.7%	98.5%	0.0
121	bureau_months_balance_max	float32	861.0 kB	89	<0.1%	152,586	70.9%	59,695	27.7%	95.3%	0.0
122	cnt_credit_prolong_mean	float32	861.0 kB	100	<0.1%	30,836	14.3%	178,412	82.9%	96.7%	0.0
123	cnt_credit_prolong_sum	float32	861.0 kB	10	<0.1%	30,836	14.3%	178,412	82.9%	96.7%	0.0
124	cnt_drawings_atm_current_max	float32	861.0 kB	43	<0.1%	172,254	80.0%	6,929	3.2%	16.1%	0.0
125	cnt_drawings_atm_current_std	float32	861.0 kB	16,770	7.8%	172,561	80.2%	6,817	3.2%	16.0%	0.0
126	cnt_drawings_current_min	float32	861.0 kB	39	<0.1%	154,158	71.6%	59,278	27.5%	97.0%	0.0
127	cnt_drawings_other_current_max	float32	861.0 kB	11	<0.1%	172,254	80.0%	38,987	18.1%	90.7%	0.0
128	cnt_drawings_pos_current_max	float32	861.0 kB	116	0.1%	172,254	80.0%	19,027	8.8%	44.2%	0.0
129	cnt_drawings_pos_current_median	float32	861.0 kB	113	0.1%	172,254	80.0%	33,721	15.7%	78.4%	0.0
130	cnt_drawings_pos_current_min	float32	861.0 kB	40	<0.1%	172,254	80.0%	41,083	19.1%	95.5%	0.0
131	cnt_fam_members_excluding_children	float32	861.0 kB	2	<0.1%	1	<0.1%	158,301	73.5%	73.5%	2.0
132	cnt_installment_future_min	float32	861.0 kB	61	<0.1%	12,588	5.8%	183,466	85.2%	90.5%	0.0
133	cnt_installment_mature_cum_max	float32	861.0 kB	120	0.1%	154,158	71.6%	19,249	8.9%	31.5%	0.0
134	cnt_installment_mature_cum_min	float32	861.0 kB	28	<0.1%	154,158	71.6%	38,853	18.0%	63.6%	0.0
135	cnt_installment_median	float32	861.0 kB	103	<0.1%	12,588	5.8%	61,162	28.4%	30.2%	12.0
136	cnt_installment_min	float32	861.0 kB	53	<0.1%	12,588	5.8%	42,362	19.7%	20.9%	6.0
137	cnt_installment_range	float32	861.0 kB	69	<0.1%	12,588	5.8%	49,692	23.1%	24.5%	0.0
138	cnt_installments_diff_mean	float32	861.0 kB	20,290	9.4%	12,588	5.8%	9,014	4.2%	4.4%	3.0
139	cnt_installments_diff_min	float32	861.0 kB	58	<0.1%	12,588	5.8%	198,083	92.0%	97.7%	0.0
140	cnt_installments_diff_range	float32	861.0 kB	82	<0.1%	12,588	5.8%	35,742	16.6%	17.6%	12.0
141	cnt_payment_median	float32	861.0 kB	87	<0.1%	11,752	5.5%	53,998	25.1%	26.5%	12.0
142	cnt_payment_min	float32	861.0 kB	31	<0.1%	11,752	5.5%	68,588	31.9%	33.7%	0.0
143	cnt_payment_range	float32	861.0 kB	69	<0.1%	11,752	5.5%	54,639	25.4%	26.8%	0.0
144	days_credit_enddate_max	Int32	1.1 MB	12,274	5.7%	32,432	15.1%	187	0.1%	0.1%	31060
145	days_credit_enddate_min	Int32	1.1 MB	6,266	2.9%	32,432	15.1%	119	0.1%	0.1%	-2359
146	days_credit_enddate_std	Float64	1.9 MB	134,001	62.3%	59,197	27.5%	1,369	0.6%	0.9%	0.0
147	days_credit_max	float32	861.0 kB	2,922	1.4%	30,836	14.3%	480	0.2%	0.3%	-91.0
148	days_credit_median	float32	861.0 kB	5,711	2.7%	30,836	14.3%	118	0.1%	0.1%	-561.0
149	days_credit_overdue_max	float32	861.0 kB	671	0.3%	30,836	14.3%	182,056	84.6%	98.7%	0.0
150	days_credit_overdue_mean	float32	861.0 kB	1,195	0.6%	30,836	14.3%	182,056	84.6%	98.7%	0.0
151	days_credit_overdue_median	float32	861.0 kB	168	0.1%	30,836	14.3%	184,119	85.5%	99.8%	0.0
152	days_credit_range	float32	861.0 kB	2,913	1.4%	30,836	14.3%	26,512	12.3%	14.4%	0.0
153	days_credit_std	float32	861.0 kB	133,052	61.8%	55,965	26.0%	1,383	0.6%	0.9%	0.0
154	days_credit_update_max	float32	861.0 kB	2,585	1.2%	30,836	14.3%	7,529	3.5%	4.1%	-7.0
155	days_credit_update_median	float32	861.0 kB	4,779	2.2%	30,836	14.3%	1,055	0.5%	0.6%	-22.0
156	days_credit_update_range	float32	861.0 kB	2,925	1.4%	30,836	14.3%	27,014	12.5%	14.6%	0.0
157	days_decision_max	float32	861.0 kB	2,921	1.4%	11,456	5.3%	598	0.3%	0.3%	-7.0
158	days_decision_median	float32	861.0 kB	5,656	2.6%	11,456	5.3%	255	0.1%	0.1%	-364.0
159	days_decision_range	float32	861.0 kB	2,919	1.4%	11,456	5.3%	40,565	18.8%	19.9%	0.0
160	days_enddate_fact_max	Int16	645.8 kB	2,793	1.3%	53,870	25.0%	340	0.2%	0.2%	-84
161	days_enddate_fact_median	Float32	1.1 MB	5,341	2.5%	53,870	25.0%	135	0.1%	0.1%	-919.0
162	days_enddate_fact_range	Int32	1.1 MB	2,796	1.3%	53,870	25.0%	38,623	17.9%	23.9%	0
163	days_first_draw_min	float32	861.0 kB	2,718	1.3%	12,377	5.7%	165,404	76.8%	81.5%	365243.0
164	days_last_due_1st_version_max	float32	861.0 kB	4,521	2.1%	12,377	5.7%	55,263	25.7%	27.2%	365243.0
165	days_last_due_1st_version_mean	float32	861.0 kB	51,499	23.9%	12,377	5.7%	1,911	0.9%	0.9%	365243.0
166	days_last_due_1st_version_median	float32	861.0 kB	10,719	5.0%	12,377	5.7%	1,937	0.9%	1.0%	365243.0
167	days_last_due_1st_version_min	float32	861.0 kB	4,081	1.9%	12,377	5.7%	1,911	0.9%	0.9%	365243.0
168	days_last_due_max	float32	861.0 kB	2,761	1.3%	12,377	5.7%	98,527	45.8%	48.6%	365243.0
169	days_last_due_range	float32	861.0 kB	5,592	2.6%	12,377	5.7%	58,659	27.3%	28.9%	0.0
170	days_termination_median	float32	861.0 kB	7,716	3.6%	12,377	5.7%	23,269	10.8%	11.5%	365243.0
171	days_termination_min	float32	861.0 kB	2,797	1.3%	12,377	5.7%	15,833	7.4%	7.8%	365243.0
172	diff_amt_installment_payment_max	float64	1.7 MB	75,445	35.0%	11,037	5.1%	116,518	54.1%	57.1%	0.0
173	diff_amt_installment_payment_mean	float64	1.7 MB	97,257	45.2%	11,037	5.1%	103,060	47.9%	50.5%	0.0
174	diff_amt_installment_payment_median	float64	1.7 MB	6,855	3.2%	11,037	5.1%	195,960	91.0%	96.0%	0.0
175	diff_amt_installment_payment_range	float64	1.7 MB	90,195	41.9%	11,037	5.1%	103,062	47.9%	50.5%	0.0
176	diff_days_installment_payment_max	float32	861.0 kB	409	0.2%	11,037	5.1%	15,321	7.1%	7.5%	30.0
177	diff_days_installment_payment_mean	float32	861.0 kB	50,246	23.3%	11,037	5.1%	761	0.4%	0.4%	9.0
178	diff_days_installment_payment_median	float32	861.0 kB	320	0.1%	11,037	5.1%	21,620	10.0%	10.6%	0.0
179	diff_days_installment_payment_range	float32	861.0 kB	1,465	0.7%	11,037	5.1%	5,349	2.5%	2.6%	30.0
180	diff_days_installment_payment_sum	float32	861.0 kB	4,383	2.0%	11,034	5.1%	540	0.3%	0.3%	66.0
181	diff_days_installment_payment_sum_late_only	float32	861.0 kB	1,815	0.8%	11,034	5.1%	95,670	44.4%	46.8%	0.0
182	diff_percent_installment_payment_mean	float64	1.7 MB	87,934	40.9%	11,037	5.1%	103,191	47.9%	50.5%	1.0
183	diff_percent_installment_payment_median	float32	861.0 kB	7,969	3.7%	11,037	5.1%	195,960	91.0%	96.0%	1.0
184	diff_percent_installment_payment_min	float32	861.0 kB	25,589	11.9%	11,037	5.1%	177,973	82.7%	87.1%	1.0
185	diff_percent_installment_payment_range	float64	1.7 MB	97,055	45.1%	11,037	5.1%	103,190	47.9%	50.5%	0.0
186	mode_credit_type	category	215.8 kB	6	<0.1%	30,836	14.3%	160,802	74.7%	87.2%	Consumer credit
187	n_car_loans	float32	861.0 kB	9	<0.1%	30,836	14.3%	170,683	79.3%	92.6%	0.0
188	n_cash_loans	float32	861.0 kB	55	<0.1%	11,456	5.3%	83,697	38.9%	41.1%	0.0
189	n_channel_type_ap_minus	float32	861.0 kB	33	<0.1%	11,456	5.3%	187,751	87.2%	92.1%	0.0
190	n_channel_type_channel_corporate_sales	float32	861.0 kB	20	<0.1%	11,456	5.3%	202,289	94.0%	99.3%	0.0
191	n_channel_type_contact_center	float32	861.0 kB	19	<0.1%	11,456	5.3%	175,621	81.6%	86.2%	0.0
192	n_channel_type_countrywide	float32	861.0 kB	34	<0.1%	11,456	5.3%	67,466	31.3%	33.1%	1.0
193	n_channel_type_credit_and_cash	float32	861.0 kB	52	<0.1%	11,456	5.3%	96,482	44.8%	47.3%	0.0
194	n_channel_type_regional_and_local	float32	861.0 kB	19	<0.1%	11,456	5.3%	158,328	73.6%	77.7%	0.0
195	n_channel_type_stone	float32	861.0 kB	22	<0.1%	11,456	5.3%	121,683	56.5%	59.7%	0.0
196	n_client_type_new	float32	861.0 kB	14	<0.1%	11,456	5.3%	154,064	71.6%	75.6%	1.0
197	n_client_type_refreshed	float32	861.0 kB	23	<0.1%	11,456	5.3%	150,108	69.7%	73.7%	0.0
198	n_client_type_repeater	float32	861.0 kB	61	<0.1%	11,456	5.3%	49,122	22.8%	24.1%	0.0
199	n_consumer_loans	float32	861.0 kB	36	<0.1%	11,456	5.3%	78,331	36.4%	38.4%	1.0
200	n_contract_status_refused	float32	861.0 kB	44	<0.1%	11,456	5.3%	133,394	62.0%	65.5%	0.0
201	n_contract_status_unused_offer	float32	861.0 kB	11	<0.1%	11,456	5.3%	190,553	88.5%	93.5%	0.0
202	n_contracts_credit_card_completed	float32	861.0 kB	40	<0.1%	154,158	71.6%	53,625	24.9%	87.8%	0.0
203	n_credit_card_credits	float32	861.0 kB	22	<0.1%	30,836	14.3%	63,863	29.7%	34.6%	0.0
204	n_credits_active	float32	861.0 kB	22	<0.1%	30,836	14.3%	51,735	24.0%	28.1%	1.0
205	n_credits_sold	float32	861.0 kB	7	<0.1%	30,836	14.3%	180,711	84.0%	98.0%	0.0
206	n_credits_total	float32	861.0 kB	57	<0.1%	30,836	14.3%	25,129	11.7%	13.6%	1.0
207	n_currency_2	float32	861.0 kB	7	<0.1%	30,836	14.3%	183,835	85.4%	99.7%	0.0
208	n_different_channels	float32	861.0 kB	7	<0.1%	11,456	5.3%	79,085	36.7%	38.8%	2.0
209	n_different_contract_types	float32	861.0 kB	4	<0.1%	11,456	5.3%	77,974	36.2%	38.3%	2.0
210	n_different_credit_types	float32	861.0 kB	5	<0.1%	30,836	14.3%	100,733	46.8%	54.6%	2.0
211	n_different_currencies	float32	861.0 kB	3	<0.1%	30,836	14.3%	183,765	85.4%	99.6%	1.0
212	n_installments_late	float32	861.0 kB	99	<0.1%	11,034	5.1%	95,670	44.4%	46.8%	0.0
213	n_installments_late_30	float32	861.0 kB	42	<0.1%	11,034	5.1%	190,963	88.7%	93.5%	0.0
214	n_installments_late_7	float32	861.0 kB	59	<0.1%	11,034	5.1%	147,558	68.5%	72.3%	0.0
215	n_installments_total	float32	861.0 kB	310	0.1%	11,034	5.1%	8,624	4.0%	4.2%	12.0
216	n_microloans	float32	861.0 kB	28	<0.1%	30,836	14.3%	181,975	84.5%	98.7%	0.0
217	n_mortgages	float32	861.0 kB	7	<0.1%	30,836	14.3%	174,434	81.0%	94.6%	0.0
218	n_nflag_insured_on_approval_mean	float32	861.0 kB	102	<0.1%	12,377	5.7%	95,675	44.4%	47.2%	0.0
219	n_nflag_insured_on_approval_sum	float32	861.0 kB	19	<0.1%	11,456	5.3%	96,596	44.9%	47.4%	0.0
220	n_other_type_credit	float32	861.0 kB	9	<0.1%	30,836	14.3%	182,373	84.7%	98.9%	0.0
221	n_payment_type_cash_through_bank	float32	861.0 kB	44	<0.1%	11,456	5.3%	54,943	25.5%	27.0%	1.0
222	n_payment_type_not_available	float32	861.0 kB	46	<0.1%	11,456	5.3%	71,796	33.4%	35.2%	0.0
223	n_previous_credit_card_applications	float32	861.0 kB	126	0.1%	154,158	71.6%	4,332	2.0%	7.1%	96.0
224	n_previous_credit_card_applications_signed	float32	861.0 kB	37	<0.1%	154,158	71.6%	58,091	27.0%	95.1%	0.0
225	n_previous_pos_applications	float32	861.0 kB	221	0.1%	12,570	5.8%	9,559	4.4%	4.7%	13.0
226	n_previous_pos_applications_completed	float32	861.0 kB	45	<0.1%	12,570	5.8%	73,226	34.0%	36.1%	1.0
227	n_previous_pos_applications_signed	float32	861.0 kB	31	<0.1%	12,570	5.8%	162,017	75.3%	79.9%	0.0
228	n_product_type_walk_in	float32	861.0 kB	28	<0.1%	11,456	5.3%	152,783	71.0%	75.0%	0.0
229	n_reject_reason_limit	float32	861.0 kB	22	<0.1%	11,456	5.3%	183,819	85.4%	90.2%	0.0
230	n_reject_reason_scoc	float32	861.0 kB	20	<0.1%	11,456	5.3%	188,558	87.6%	92.5%	0.0
231	n_reject_reason_scofr	float32	861.0 kB	16	<0.1%	11,456	5.3%	199,055	92.5%	97.7%	0.0
232	n_revolving_loans	float32	861.0 kB	25	<0.1%	11,456	5.3%	130,792	60.8%	64.2%	0.0
233	n_yield_group_high	float32	861.0 kB	30	<0.1%	11,456	5.3%	89,153	41.4%	43.7%	0.0
234	n_yield_group_low_action	float32	861.0 kB	22	<0.1%	11,456	5.3%	163,415	75.9%	80.2%	0.0
235	n_yield_group_low_normal	float32	861.0 kB	23	<0.1%	11,456	5.3%	94,724	44.0%	46.5%	0.0
236	n_yield_group_middle	float32	861.0 kB	25	<0.1%	11,456	5.3%	80,043	37.2%	39.3%	0.0
237	ord_education_type	int8	215.3 kB	5	<0.1%	0	0%	152,993	71.1%	71.1%	1
238	percent_installments_early	float32	861.0 kB	7,892	3.7%	11,034	5.1%	64,688	30.1%	31.7%	1.0
239	percent_installments_late	float32	861.0 kB	4,464	2.1%	11,034	5.1%	95,670	44.4%	46.8%	0.0
240	percent_installments_late_30	float32	861.0 kB	894	0.4%	11,034	5.1%	190,963	88.7%	93.5%	0.0
241	percent_installments_late_60	float32	861.0 kB	629	0.3%	11,034	5.1%	198,146	92.1%	97.0%	0.0
242	percent_installments_late_7	float32	861.0 kB	2,595	1.2%	11,034	5.1%	147,558	68.5%	72.3%	0.0
243	rate_down_payment_max	float32	861.0 kB	84,883	39.4%	23,703	11.0%	53,725	25.0%	28.0%	0.0
244	rate_down_payment_range	float32	861.0 kB	73,615	34.2%	23,703	11.0%	94,887	44.1%	49.5%	0.0
245	rate_interest_privileged_count	float32	861.0 kB	4	<0.1%	11,456	5.3%	200,560	93.2%	98.4%	0.0
246	sk_dpd_credit_card_max	float32	861.0 kB	353	0.2%	154,158	71.6%	48,474	22.5%	79.3%	0.0
247	sk_dpd_credit_card_median	float32	861.0 kB	222	0.1%	154,158	71.6%	60,546	28.1%	99.1%	0.0
248	sk_dpd_def_credit_card_max	float32	861.0 kB	47	<0.1%	154,158	71.6%	50,652	23.5%	82.9%	0.0
249	sk_dpd_def_pos_applications_max	float32	861.0 kB	173	0.1%	12,570	5.8%	174,617	81.1%	86.2%	0.0
250	sk_dpd_pos_applications_max	float32	861.0 kB	1,595	0.7%	12,570	5.8%	164,332	76.3%	81.1%	0.0
251	years_employed	float64	1.7 MB	11,769	5.5%	38,756	18.0%	112	0.1%	0.1%	0.6273972602739726

Code

# Save to file
file_path = dir_interim + "colnames--cols_to_include_in_preprocessing.csv"
before_preproc_col_info.column.to_csv(file_path, index=False)

# Read from file (to check)
cols_to_include_in_preprocessing = pd.read_csv(file_path).column.tolist()
del file_path

6.1.2 Pre-Processing

Next, data will be pre-processed in the following pipeline:

Remove the columns identified in the previous step.
Use different pre-processing steps for different data types:
1. Use SimpleImputer to impute missing values and create missing value indicators for numeric data;
2. Use OneHotEncoder to encode categorical data and after that fix names to be in the snake case;
3. Other types of data (if any) are left unchanged.

Code

pipeline_pre_processing = Pipeline(
    steps=[
        ("selector", ColumnSelector(cols_to_include_in_preprocessing)),
        ("preprocessor", clone(pre_processing)),
    ]
)

pipeline_pre_processing

Code

credits_train_transformed = pipeline_pre_processing.fit_transform(credits_train)

Let’s look at the transformed data:

Code

credits_train_transformed.shape

(215257, 580)

Code

credits_train_transformed.head()

	AMT_ANNUITY	AMT_CREDIT	AMT_INCOME_TOTAL	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_YEAR	BASEMENTAREA_MODE	CNT_FAM_MEMBERS	COMMONAREA_MEDI	DAYS_ID_PUBLISH	DAYS_LAST_PHONE_CHANGE	DAYS_REGISTRATION	DEF_30_CNT_SOCIAL_CIRCLE	ELEVATORS_AVG	ELEVATORS_MEDI	ENTRANCES_MODE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	FLAG_CONT_MOBILE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	FLAG_DOCUMENT_9	FLAG_EMAIL	FLAG_EMP_PHONE	FLAG_OWN_CAR	FLAG_OWN_REALTY	FLAG_PHONE	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	NONLIVINGAREA_MODE	OBS_30_CNT_SOCIAL_CIRCLE	OWN_CAR_AGE	REGION_POPULATION_RELATIVE	REGION_RATING_CLIENT	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_AVG	amt_annuity_max	amt_annuity_max_previous_application	amt_annuity_median	amt_annuity_median_previous_application	amt_annuity_min_previous_application	amt_annuity_to_credit_ratio	amt_annuity_to_income_per_family_member	amt_annuity_to_income_ratio	amt_balance_credit_card_max	amt_credit_limit_actual_median	amt_credit_limit_actual_range	amt_credit_max	amt_credit_max_overdue_max	amt_credit_median	amt_credit_min	amt_credit_range	amt_credit_sum_debt_mean	amt_credit_sum_debt_median	amt_credit_sum_debt_sum	amt_credit_sum_median	amt_credit_sum_std	amt_credit_sum_sum	amt_credit_to_income_ratio	amt_down_payment_max	amt_down_payment_mean	amt_drawings_atm_current_max	amt_drawings_current_max	amt_drawings_current_mean	amt_drawings_pos_current_max	amt_drawings_pos_current_mean	amt_goods_price_min	amt_payment_current_median	amt_payment_current_range	any_installments_late_7	bureau_dpd_status_max	cnt_drawings_atm_current_max	cnt_drawings_atm_current_std	cnt_drawings_pos_current_max	cnt_fam_members_excluding_children	cnt_installment_mature_cum_max	cnt_installment_median	cnt_installment_min	cnt_installment_range	cnt_installments_diff_mean	cnt_installments_diff_range	cnt_payment_median	cnt_payment_min	cnt_payment_range	days_credit_enddate_max	days_credit_enddate_min	days_credit_enddate_std	days_credit_max	days_credit_median	days_credit_range	days_credit_std	days_credit_update_max	days_credit_update_median	days_credit_update_range	days_decision_max	days_decision_median	days_decision_range	days_enddate_fact_max	...	missingindicator_rate_down_payment_max	missingindicator_rate_down_payment_range	missingindicator_sk_dpd_credit_card_max	missingindicator_sk_dpd_credit_card_median	missingindicator_sk_dpd_def_credit_card_max	FONDKAPREMONT_MODE_reg_oper_account	FONDKAPREMONT_MODE_nan	HOUSETYPE_MODE_block_of_flats	HOUSETYPE_MODE_nan	NAME_CONTRACT_TYPE_Cash_loans	NAME_EDUCATION_TYPE_Higher_education	NAME_EDUCATION_TYPE_Secondary_secondary_special	NAME_HOUSING_TYPE_House_apartment	NAME_INCOME_TYPE_Commercial_associate	NAME_INCOME_TYPE_State_servant	NAME_TYPE_SUITE_Family	NAME_TYPE_SUITE_Unaccompanied	OCCUPATION_TYPE_Accountants	OCCUPATION_TYPE_Drivers	OCCUPATION_TYPE_Laborers	OCCUPATION_TYPE_Managers	OCCUPATION_TYPE_Sales_staff	ORGANIZATION_TYPE_Agriculture	ORGANIZATION_TYPE_Business_Entity_Type_3	ORGANIZATION_TYPE_Construction	ORGANIZATION_TYPE_Self_employed	WALLSMATERIAL_MODE_Stone_brick	WALLSMATERIAL_MODE_nan	mode_credit_type_Consumer_credit	mode_credit_type_nan
0	68643.00	1971072.00	405000.00	0.00	0.00	0.10	4.00	0.02	-1823.00	-2169.00	-7460.00	0.00	0.00	0.00	0.24	0.68	0.33	0.64	1.00	1.00	0.00	0.00	0.00	1.00	1.00	1.00	0.00	0.17	0.21	0.00	0.03	4.00	13.00	0.01	3.00	0.98	0.78	45459.00	5920.02	27009.00	5920.02	5920.02	0.03	0.68	0.17	97790.49	157500.00	45000.00	51034.50	0.00	51034.50	51034.50	0.00	297855.00	161358.75	1191420.00	346479.75	522819.33	2141271.18	4.87	5175.00	5175.00	90000.00	69750.00	3498.70	6300.00	303.43	51610.50	5850.00	63000.00	0.00	1.00	3.00	0.75	1.00	2.00	7.00	12.00	12.00	0.00	6.00	12.00	12.00	12.00	0.00	934.00	-746.00	698.62	-145.00	-1001.50	1094.00	489.28	-7.00	-189.50	734.00	-2169.00	-2169.00	0.00	-362.00	...	0.00	0.00	1.00	1.00	1.00	1.00	0.00	1.00	0.00	1.00	1.00	0.00	1.00	1.00	0.00	0.00	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	1.00	0.00	1.00	0.00
1	38146.50	508495.50	337500.00	0.00	6.00	0.07	2.00	0.02	-1090.00	-659.00	-4054.00	1.00	0.00	0.00	0.14	0.51	0.62	0.44	1.00	0.00	1.00	0.00	0.00	1.00	0.00	1.00	0.00	0.17	0.21	0.05	0.00	2.00	9.00	0.01	2.00	0.98	0.76	12500.01	38443.23	3942.00	38250.00	28879.88	0.08	0.23	0.11	0.00	765000.00	0.00	765000.00	0.00	404878.50	0.00	765000.00	44370.00	0.00	169746.66	133852.50	183202.89	964161.00	1.51	5853.24	3375.00	90000.00	0.00	0.00	6300.00	303.43	337500.00	5850.00	63000.00	0.00	0.00	3.00	0.75	1.00	2.00	0.00	12.00	11.00	13.00	5.41	11.00	12.00	0.00	24.00	911.00	-1267.00	1014.06	-300.00	-957.00	1262.00	621.29	-19.00	-360.00	904.00	-330.00	-361.00	329.00	-345.00	...	1.00	1.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00	1.00	1.00	0.00	1.00	0.00	1.00	1.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00
2	13068.00	110146.50	112500.00	0.00	1.00	0.07	3.00	0.02	-4130.00	-172.00	-5554.00	0.00	0.00	0.00	0.14	0.36	0.65	0.54	1.00	0.00	0.00	1.00	1.00	1.00	0.00	1.00	1.00	0.17	0.21	0.05	0.00	0.00	9.00	0.01	2.00	0.98	0.76	12500.01	29840.31	3942.00	10251.99	7074.85	0.12	0.35	0.12	97790.49	157500.00	45000.00	808650.00	0.00	40045.50	0.00	808650.00	44370.00	0.00	169746.66	133852.50	183202.89	964161.00	0.98	24750.00	11407.50	90000.00	69750.00	3498.70	6300.00	303.43	37800.00	5850.00	63000.00	0.00	0.00	3.00	0.75	1.00	2.00	7.00	8.00	2.00	58.00	3.23	10.00	10.00	4.00	56.00	911.00	-1267.00	1014.06	-300.00	-957.00	1262.00	621.29	-19.00	-360.00	904.00	-121.00	-172.00	2606.00	-345.00	...	0.00	0.00	1.00	1.00	1.00	0.00	1.00	0.00	1.00	1.00	0.00	1.00	1.00	1.00	0.00	0.00	1.00	0.00	0.00	1.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	1.00	0.00	1.00
3	3519.00	66384.00	40500.00	1.00	2.00	0.07	4.00	0.02	-5290.00	-1576.00	-5285.00	0.00	0.00	0.00	0.14	0.39	0.60	0.45	1.00	1.00	0.00	0.00	0.00	1.00	0.00	1.00	0.00	0.17	0.21	0.05	0.00	0.00	9.00	0.03	2.00	0.98	0.76	14647.50	33316.83	4387.50	10444.18	8532.81	0.05	0.35	0.09	97790.49	157500.00	45000.00	593460.00	0.00	102568.50	43321.50	550138.50	69847.88	46305.00	279391.50	136719.00	88112.62	800424.00	1.64	6268.50	3134.25	90000.00	69750.00	3498.70	6300.00	303.43	36540.00	5850.00	63000.00	1.00	1.00	3.00	0.75	1.00	2.00	7.00	24.00	6.00	18.00	8.53	24.00	15.00	6.00	24.00	30905.00	-679.00	13897.16	-325.00	-545.00	1020.00	398.50	-14.00	-20.00	629.00	-575.00	-1190.00	2293.00	-518.00	...	0.00	0.00	1.00	1.00	1.00	0.00	1.00	0.00	1.00	1.00	0.00	1.00	1.00	1.00	0.00	0.00	1.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	1.00	0.00	1.00	1.00	0.00
4	31801.50	298512.00	225000.00	0.00	0.00	0.14	2.00	0.10	-3033.00	-624.00	-86.00	0.00	0.40	0.40	0.17	0.74	0.66	0.72	1.00	1.00	0.00	0.00	0.00	1.00	1.00	0.00	0.00	0.46	0.00	0.00	0.00	3.00	11.00	0.02	2.00	1.00	0.99	12500.01	18041.58	3942.00	18041.58	18041.58	0.11	0.28	0.14	97790.49	157500.00	45000.00	162405.00	41400.00	162405.00	162405.00	0.00	9328.50	0.00	27985.50	120690.00	70766.58	435690.00	1.33	18045.00	18045.00	90000.00	69750.00	3498.70	6300.00	303.43	180450.00	5850.00	63000.00	0.00	0.00	3.00	0.75	1.00	2.00	7.00	10.00	5.00	5.00	2.50	5.00	10.00	10.00	0.00	703.00	-2526.00	1719.64	-965.00	-1106.00	1896.00	1056.31	-50.00	-696.00	2445.00	-624.00	-624.00	0.00	-723.00	...	0.00	0.00	1.00	1.00	1.00	1.00	0.00	1.00	0.00	1.00	0.00	1.00	1.00	1.00	0.00	0.00	1.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00	0.00	1.00	0.00

5 rows × 580 columns

Code

df_transformed_col_info = an.col_info(credits_train_transformed)

Column info (pre-processed data)

df_processed_col_info.pipe(an.style_col_info)

Table 6.3. Info on all columns after preprocessing.

	column	data_type	memory_size	n_unique	p_unique	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	AMT_ANNUITY	float64	1.7 MB	12,801	5.9%	0%	4,499	2.1%	2.1%	9000.0
2	AMT_CREDIT	float64	1.7 MB	5,097	2.4%	0%	6,823	3.2%	3.2%	450000.0
3	AMT_INCOME_TOTAL	float64	1.7 MB	1,949	0.9%	0%	24,982	11.6%	11.6%	135000.0
4	AMT_REQ_CREDIT_BUREAU_DAY	float64	1.7 MB	9	<0.1%	0%	214,228	99.5%	99.5%	0.0
5	AMT_REQ_CREDIT_BUREAU_HOUR	float64	1.7 MB	5	<0.1%	0%	214,142	99.5%	99.5%	0.0
6	AMT_REQ_CREDIT_BUREAU_MON	float64	1.7 MB	22	<0.1%	0%	184,760	85.8%	85.8%	0.0
7	AMT_REQ_CREDIT_BUREAU_QRT	float64	1.7 MB	10	<0.1%	0%	179,976	83.6%	83.6%	0.0
8	AMT_REQ_CREDIT_BUREAU_WEEK	float64	1.7 MB	9	<0.1%	0%	209,327	97.2%	97.2%	0.0
9	AMT_REQ_CREDIT_BUREAU_YEAR	float64	1.7 MB	24	<0.1%	0%	73,441	34.1%	34.1%	1.0
10	BASEMENTAREA_MODE	float64	1.7 MB	3,687	1.7%	0%	125,860	58.5%	58.5%	0.07460000365972519
11	CNT_FAM_MEMBERS	float64	1.7 MB	12	<0.1%	0%	110,672	51.4%	51.4%	2.0
12	COMMONAREA_MEDI	float64	1.7 MB	2,982	1.4%	0%	150,382	69.9%	69.9%	0.020899999886751175
13	DAYS_ID_PUBLISH	float64	1.7 MB	6,122	2.8%	0%	119	0.1%	0.1%	-4074.0
14	DAYS_LAST_PHONE_CHANGE	float64	1.7 MB	3,720	1.7%	0%	26,201	12.2%	12.2%	0.0
15	DAYS_REGISTRATION	float64	1.7 MB	15,249	7.1%	0%	79	<0.1%	<0.1%	-7.0
16	DEF_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	10	<0.1%	0%	190,702	88.6%	88.6%	0.0
17	ELEVATORS_AVG	float64	1.7 MB	241	0.1%	0%	174,679	81.1%	81.1%	0.0
18	ELEVATORS_MEDI	float64	1.7 MB	46	<0.1%	0%	175,610	81.6%	81.6%	0.0
19	ENTRANCES_MODE	float64	1.7 MB	30	<0.1%	0%	133,580	62.1%	62.1%	0.1378999948501587
20	EXT_SOURCE_1	float64	1.7 MB	83,962	39.0%	0%	121,373	56.4%	56.4%	0.5052886605262756
21	EXT_SOURCE_2	float64	1.7 MB	102,229	47.5%	0%	503	0.2%	0.2%	0.2858978807926178
22	EXT_SOURCE_3	float64	1.7 MB	804	0.4%	0%	43,202	20.1%	20.1%	0.5352762341499329
23	FLAG_CONT_MOBILE	float64	1.7 MB	2	<0.1%	0%	214,855	99.8%	99.8%	1.0
24	FLAG_DOCUMENT_11	float64	1.7 MB	2	<0.1%	0%	214,448	99.6%	99.6%	0.0
25	FLAG_DOCUMENT_13	float64	1.7 MB	2	<0.1%	0%	214,541	99.7%	99.7%	0.0
26	FLAG_DOCUMENT_14	float64	1.7 MB	2	<0.1%	0%	214,614	99.7%	99.7%	0.0
27	FLAG_DOCUMENT_15	float64	1.7 MB	2	<0.1%	0%	215,015	99.9%	99.9%	0.0
28	FLAG_DOCUMENT_16	float64	1.7 MB	2	<0.1%	0%	213,089	99.0%	99.0%	0.0
29	FLAG_DOCUMENT_18	float64	1.7 MB	2	<0.1%	0%	213,525	99.2%	99.2%	0.0
30	FLAG_DOCUMENT_3	float64	1.7 MB	2	<0.1%	0%	152,845	71.0%	71.0%	1.0
31	FLAG_DOCUMENT_5	float64	1.7 MB	2	<0.1%	0%	212,025	98.5%	98.5%	0.0
32	FLAG_DOCUMENT_6	float64	1.7 MB	2	<0.1%	0%	196,348	91.2%	91.2%	0.0
33	FLAG_DOCUMENT_8	float64	1.7 MB	2	<0.1%	0%	197,689	91.8%	91.8%	0.0
34	FLAG_DOCUMENT_9	float64	1.7 MB	2	<0.1%	0%	214,440	99.6%	99.6%	0.0
35	FLAG_EMAIL	float64	1.7 MB	2	<0.1%	0%	203,006	94.3%	94.3%	0.0
36	FLAG_EMP_PHONE	float64	1.7 MB	2	<0.1%	0%	176,491	82.0%	82.0%	1.0
37	FLAG_OWN_CAR	float64	1.7 MB	2	<0.1%	0%	142,086	66.0%	66.0%	0.0
38	FLAG_OWN_REALTY	float64	1.7 MB	2	<0.1%	0%	149,412	69.4%	69.4%	1.0
39	FLAG_PHONE	float64	1.7 MB	2	<0.1%	0%	154,906	72.0%	72.0%	0.0
40	FLAG_WORK_PHONE	float64	1.7 MB	2	<0.1%	0%	172,406	80.1%	80.1%	0.0
41	FLOORSMAX_MEDI	float64	1.7 MB	49	<0.1%	0%	151,629	70.4%	70.4%	0.16670000553131104
42	FLOORSMIN_MEDI	float64	1.7 MB	47	<0.1%	0%	169,787	78.9%	78.9%	0.20829999446868896
43	LANDAREA_MEDI	float64	1.7 MB	3,393	1.6%	0%	127,718	59.3%	59.3%	0.048700001090765
44	NONLIVINGAPARTMENTS_AVG	float64	1.7 MB	345	0.2%	0%	187,673	87.2%	87.2%	0.0
45	NONLIVINGAREA_MODE	float64	1.7 MB	3,090	1.4%	0%	118,905	55.2%	55.2%	0.0010999999940395355
46	OBS_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	32	<0.1%	0%	115,264	53.5%	53.5%	0.0
47	OWN_CAR_AGE	float64	1.7 MB	61	<0.1%	0%	145,584	67.6%	67.6%	9.0
48	REGION_POPULATION_RELATIVE	float64	1.7 MB	81	<0.1%	0%	11,494	5.3%	5.3%	0.03579200059175491
49	REGION_RATING_CLIENT	float64	1.7 MB	3	<0.1%	0%	158,846	73.8%	73.8%	2.0
50	REG_CITY_NOT_LIVE_CITY	float64	1.7 MB	2	<0.1%	0%	198,549	92.2%	92.2%	0.0
51	REG_CITY_NOT_WORK_CITY	float64	1.7 MB	2	<0.1%	0%	165,697	77.0%	77.0%	0.0
52	REG_REGION_NOT_LIVE_REGION	float64	1.7 MB	2	<0.1%	0%	211,999	98.5%	98.5%	0.0
53	REG_REGION_NOT_WORK_REGION	float64	1.7 MB	2	<0.1%	0%	204,222	94.9%	94.9%	0.0
54	YEARS_BEGINEXPLUATATION_MODE	float64	1.7 MB	210	0.1%	0%	107,681	50.0%	50.0%	0.9815999865531921
55	YEARS_BUILD_AVG	float64	1.7 MB	146	0.1%	0%	144,837	67.3%	67.3%	0.7552000284194946
56	amt_annuity_max	float64	1.7 MB	18,638	8.7%	0%	159,516	74.1%	74.1%	12500.01
57	amt_annuity_max_previous_application	float64	1.7 MB	110,598	51.4%	0%	11,756	5.5%	5.5%	17954.865
58	amt_annuity_median	float64	1.7 MB	16,441	7.6%	0%	159,485	74.1%	74.1%	3942.0
59	amt_annuity_median_previous_application	float64	1.7 MB	157,063	73.0%	0%	11,753	5.5%	5.5%	10773.157500000001
60	amt_annuity_min	float64	1.7 MB	9,921	4.6%	0%	196,455	91.3%	91.3%	0.0
61	amt_annuity_min_previous_application	float64	1.7 MB	113,816	52.9%	0%	16,017	7.4%	7.4%	2250.0
62	amt_annuity_to_credit_ratio	float64	1.7 MB	33,148	15.4%	0%	20,564	9.6%	9.6%	0.05000000074505806
63	amt_annuity_to_income_per_family_member	float64	1.7 MB	88,172	41.0%	0%	1,500	0.7%	0.7%	0.3
64	amt_annuity_to_income_ratio	float64	1.7 MB	71,916	33.4%	0%	2,049	1.0%	1.0%	0.1
65	amt_balance_credit_card_max	float64	1.7 MB	40,175	18.7%	0%	154,159	71.6%	71.6%	97790.49
66	amt_balance_credit_card_median	float64	1.7 MB	27,685	12.9%	0%	187,185	87.0%	87.0%	0.0
67	amt_balance_credit_card_min	float64	1.7 MB	8,310	3.9%	0%	206,302	95.8%	95.8%	0.0
68	amt_credit_limit_actual_median	float64	1.7 MB	151	0.1%	0%	155,593	72.3%	72.3%	157500.0
69	amt_credit_limit_actual_range	float64	1.7 MB	147	0.1%	0%	157,689	73.3%	73.3%	45000.0
70	amt_credit_max	float64	1.7 MB	49,618	23.1%	0%	14,581	6.8%	6.8%	225000.0
71	amt_credit_max_overdue_max	float64	1.7 MB	32,871	15.3%	0%	166,187	77.2%	77.2%	0.0
72	amt_credit_max_overdue_range	float64	1.7 MB	27,267	12.7%	0%	175,595	81.6%	81.6%	0.0
73	amt_credit_median	float64	1.7 MB	73,966	34.4%	0%	11,457	5.3%	5.3%	83054.25
74	amt_credit_min	float64	1.7 MB	33,220	15.4%	0%	79,660	37.0%	37.0%	0.0
75	amt_credit_range	float64	1.7 MB	71,950	33.4%	0%	37,038	17.2%	17.2%	0.0
76	amt_credit_sum_debt_mean	float64	1.7 MB	121,544	56.5%	0%	48,543	22.6%	22.6%	0.0
77	amt_credit_sum_debt_median	float64	1.7 MB	48,592	22.6%	0%	156,857	72.9%	72.9%	0.0
78	amt_credit_sum_debt_sum	float64	1.7 MB	113,811	52.9%	0%	53,746	25.0%	25.0%	0.0
79	amt_credit_sum_limit_min	float64	1.7 MB	2,121	1.0%	0%	212,794	98.9%	98.9%	0.0
80	amt_credit_sum_limit_std	float64	1.7 MB	26,937	12.5%	0%	183,161	85.1%	85.1%	0.0
81	amt_credit_sum_limit_sum	float64	1.7 MB	26,367	12.2%	0%	181,184	84.2%	84.2%	0.0
82	amt_credit_sum_median	float64	1.7 MB	77,800	36.1%	0%	30,841	14.3%	14.3%	133852.5
83	amt_credit_sum_overdue_std	float64	1.7 MB	1,618	0.8%	0%	213,025	99.0%	99.0%	0.0
84	amt_credit_sum_overdue_sum	float64	1.7 MB	930	0.4%	0%	212,926	98.9%	98.9%	0.0
85	amt_credit_sum_std	float64	1.7 MB	148,440	69.0%	0%	55,965	26.0%	26.0%	183202.88926385253
86	amt_credit_sum_sum	float64	1.7 MB	147,742	68.6%	0%	30,837	14.3%	14.3%	964161.0
87	amt_credit_to_income_ratio	float64	1.7 MB	39,372	18.3%	0%	3,691	1.7%	1.7%	2.0
88	amt_down_payment_max	float64	1.7 MB	17,608	8.2%	0%	53,725	25.0%	25.0%	0.0
89	amt_down_payment_mean	float64	1.7 MB	42,577	19.8%	0%	53,725	25.0%	25.0%	0.0
90	amt_drawings_atm_current_max	float64	1.7 MB	1,131	0.5%	0%	175,102	81.3%	81.3%	90000.0
91	amt_drawings_atm_current_median	float64	1.7 MB	378	0.2%	0%	208,835	97.0%	97.0%	0.0
92	amt_drawings_atm_current_min	float64	1.7 MB	114	0.1%	0%	214,655	99.7%	99.7%	0.0
93	amt_drawings_current_max	float64	1.7 MB	17,325	8.0%	0%	154,198	71.6%	71.6%	69750.0
94	amt_drawings_current_mean	float64	1.7 MB	35,095	16.3%	0%	154,159	71.6%	71.6%	3498.702077922078
95	amt_drawings_current_min	float64	1.7 MB	1,475	0.7%	0%	213,422	99.1%	99.1%	0.0
96	amt_drawings_other_current_max	float64	1.7 MB	1,084	0.5%	0%	211,253	98.1%	98.1%	0.0
97	amt_drawings_pos_current_max	float64	1.7 MB	20,726	9.6%	0%	172,260	80.0%	80.0%	6300.0
98	amt_drawings_pos_current_mean	float64	1.7 MB	23,516	10.9%	0%	172,255	80.0%	80.0%	303.42857142857144
99	amt_drawings_pos_current_min	float64	1.7 MB	1,772	0.8%	0%	213,337	99.1%	99.1%	0.0
100	amt_goods_price_min	float64	1.7 MB	39,171	18.2%	0%	12,169	5.7%	5.7%	45735.75
101	amt_inst_min_regularity_min	float64	1.7 MB	1,664	0.8%	0%	211,946	98.5%	98.5%	0.0
102	amt_payment_current_median	float64	1.7 MB	17,066	7.9%	0%	172,523	80.1%	80.1%	5850.0
103	amt_payment_current_min	float64	1.7 MB	7,398	3.4%	0%	199,261	92.6%	92.6%	0.0
104	amt_payment_current_range	float64	1.7 MB	22,545	10.5%	0%	172,454	80.1%	80.1%	63000.0
105	amt_payment_total_current_min	float64	1.7 MB	1,131	0.5%	0%	213,443	99.2%	99.2%	0.0
106	any_installments_late_30	float64	1.7 MB	2	<0.1%	0%	201,997	93.8%	93.8%	0.0
107	any_installments_late_60	float64	1.7 MB	2	<0.1%	0%	209,180	97.2%	97.2%	0.0
108	any_installments_late_7	float64	1.7 MB	2	<0.1%	0%	158,592	73.7%	73.7%	0.0
109	bureau_dpd_status_max	float64	1.7 MB	6	<0.1%	0%	193,628	90.0%	90.0%	0.0
110	bureau_dpd_status_median	float64	1.7 MB	11	<0.1%	0%	214,312	99.6%	99.6%	0.0
111	bureau_months_balance_max	float64	1.7 MB	89	<0.1%	0%	212,281	98.6%	98.6%	0.0
112	cnt_credit_prolong_mean	float64	1.7 MB	100	<0.1%	0%	209,248	97.2%	97.2%	0.0
113	cnt_credit_prolong_sum	float64	1.7 MB	10	<0.1%	0%	209,248	97.2%	97.2%	0.0
114	cnt_drawings_atm_current_max	float64	1.7 MB	43	<0.1%	0%	178,554	82.9%	82.9%	3.0
115	cnt_drawings_atm_current_std	float64	1.7 MB	16,771	7.8%	0%	172,561	80.2%	80.2%	0.7457481920719147
116	cnt_drawings_current_min	float64	1.7 MB	39	<0.1%	0%	213,436	99.2%	99.2%	0.0
117	cnt_drawings_other_current_max	float64	1.7 MB	11	<0.1%	0%	211,241	98.1%	98.1%	0.0
118	cnt_drawings_pos_current_max	float64	1.7 MB	116	0.1%	0%	176,434	82.0%	82.0%	1.0
119	cnt_drawings_pos_current_median	float64	1.7 MB	113	0.1%	0%	205,975	95.7%	95.7%	0.0
120	cnt_drawings_pos_current_min	float64	1.7 MB	40	<0.1%	0%	213,337	99.1%	99.1%	0.0
121	cnt_fam_members_excluding_children	float64	1.7 MB	2	<0.1%	0%	158,302	73.5%	73.5%	2.0
122	cnt_installment_future_min	float64	1.7 MB	61	<0.1%	0%	196,054	91.1%	91.1%	0.0
123	cnt_installment_mature_cum_max	float64	1.7 MB	120	0.1%	0%	156,329	72.6%	72.6%	7.0
124	cnt_installment_mature_cum_min	float64	1.7 MB	28	<0.1%	0%	193,011	89.7%	89.7%	0.0
125	cnt_installment_median	float64	1.7 MB	103	<0.1%	0%	73,750	34.3%	34.3%	12.0
126	cnt_installment_min	float64	1.7 MB	53	<0.1%	0%	54,950	25.5%	25.5%	6.0
127	cnt_installment_range	float64	1.7 MB	69	<0.1%	0%	49,692	23.1%	23.1%	0.0
128	cnt_installments_diff_mean	float64	1.7 MB	20,290	9.4%	0%	19,490	9.1%	9.1%	5.0
129	cnt_installments_diff_min	float64	1.7 MB	58	<0.1%	0%	210,671	97.9%	97.9%	0.0
130	cnt_installments_diff_range	float64	1.7 MB	82	<0.1%	0%	48,330	22.5%	22.5%	12.0
131	cnt_payment_median	float64	1.7 MB	87	<0.1%	0%	65,750	30.5%	30.5%	12.0
132	cnt_payment_min	float64	1.7 MB	31	<0.1%	0%	68,588	31.9%	31.9%	0.0
133	cnt_payment_range	float64	1.7 MB	69	<0.1%	0%	54,639	25.4%	25.4%	0.0
134	days_credit_enddate_max	float64	1.7 MB	12,274	5.7%	0%	32,491	15.1%	15.1%	911.0
135	days_credit_enddate_min	float64	1.7 MB	6,266	2.9%	0%	32,492	15.1%	15.1%	-1267.0
136	days_credit_enddate_std	float64	1.7 MB	134,002	62.3%	0%	59,197	27.5%	27.5%	1014.057521898929
137	days_credit_max	float64	1.7 MB	2,922	1.4%	0%	31,067	14.4%	14.4%	-300.0
138	days_credit_median	float64	1.7 MB	5,711	2.7%	0%	30,932	14.4%	14.4%	-957.0
139	days_credit_overdue_max	float64	1.7 MB	671	0.3%	0%	212,892	98.9%	98.9%	0.0
140	days_credit_overdue_mean	float64	1.7 MB	1,195	0.6%	0%	212,892	98.9%	98.9%	0.0
141	days_credit_overdue_median	float64	1.7 MB	168	0.1%	0%	214,955	99.9%	99.9%	0.0
142	days_credit_range	float64	1.7 MB	2,913	1.4%	0%	30,890	14.4%	14.4%	1262.0
143	days_credit_std	float64	1.7 MB	133,053	61.8%	0%	55,965	26.0%	26.0%	621.2873840332031
144	days_credit_update_max	float64	1.7 MB	2,585	1.2%	0%	34,359	16.0%	16.0%	-19.0
145	days_credit_update_median	float64	1.7 MB	4,779	2.2%	0%	30,948	14.4%	14.4%	-360.0
146	days_credit_update_range	float64	1.7 MB	2,925	1.4%	0%	30,911	14.4%	14.4%	904.0
147	days_decision_max	float64	1.7 MB	2,921	1.4%	0%	11,697	5.4%	5.4%	-299.0
148	days_decision_median	float64	1.7 MB	5,656	2.6%	0%	11,546	5.4%	5.4%	-647.0
149	days_decision_range	float64	1.7 MB	2,919	1.4%	0%	40,565	18.8%	18.8%	0.0
150	days_enddate_fact_max	float64	1.7 MB	2,793	1.3%	0%	54,020	25.1%	25.1%	-345.0
151	days_enddate_fact_median	float64	1.7 MB	5,341	2.5%	0%	53,910	25.0%	25.0%	-872.5
152	days_enddate_fact_range	float64	1.7 MB	2,796	1.3%	0%	53,924	25.1%	25.1%	821.0
153	days_first_draw_min	float64	1.7 MB	2,718	1.3%	0%	177,781	82.6%	82.6%	365243.0
154	days_last_due_1st_version_max	float64	1.7 MB	4,521	2.1%	0%	55,263	25.7%	25.7%	365243.0
155	days_last_due_1st_version_mean	float64	1.7 MB	51,499	23.9%	0%	12,398	5.8%	5.8%	-207.5
156	days_last_due_1st_version_median	float64	1.7 MB	10,719	5.0%	0%	12,497	5.8%	5.8%	-325.0
157	days_last_due_1st_version_min	float64	1.7 MB	4,081	1.9%	0%	12,430	5.8%	5.8%	-1089.0
158	days_last_due_max	float64	1.7 MB	2,761	1.3%	0%	98,527	45.8%	45.8%	365243.0
159	days_last_due_range	float64	1.7 MB	5,592	2.6%	0%	58,659	27.3%	27.3%	0.0
160	days_termination_median	float64	1.7 MB	7,716	3.6%	0%	23,269	10.8%	10.8%	365243.0
161	days_termination_min	float64	1.7 MB	2,797	1.3%	0%	15,833	7.4%	7.4%	365243.0
162	diff_amt_installment_payment_max	float64	1.7 MB	75,445	35.0%	0%	127,555	59.3%	59.3%	0.0
163	diff_amt_installment_payment_mean	float64	1.7 MB	97,257	45.2%	0%	114,097	53.0%	53.0%	0.0
164	diff_amt_installment_payment_median	float64	1.7 MB	6,855	3.2%	0%	206,997	96.2%	96.2%	0.0
165	diff_amt_installment_payment_range	float64	1.7 MB	90,195	41.9%	0%	114,099	53.0%	53.0%	0.0
166	diff_days_installment_payment_max	float64	1.7 MB	409	0.2%	0%	18,396	8.5%	8.5%	31.0
167	diff_days_installment_payment_mean	float64	1.7 MB	50,247	23.3%	0%	11,037	5.1%	5.1%	9.524199962615967
168	diff_days_installment_payment_median	float64	1.7 MB	320	0.1%	0%	21,620	10.0%	10.0%	0.0
169	diff_days_installment_payment_range	float64	1.7 MB	1,465	0.7%	0%	14,802	6.9%	6.9%	37.0
170	diff_days_installment_payment_sum	float64	1.7 MB	4,383	2.0%	0%	11,369	5.3%	5.3%	240.0
171	diff_days_installment_payment_sum_late_only	float64	1.7 MB	1,815	0.8%	0%	95,670	44.4%	44.4%	0.0
172	diff_percent_installment_payment_mean	float64	1.7 MB	87,934	40.9%	0%	114,228	53.1%	53.1%	1.0
173	diff_percent_installment_payment_median	float64	1.7 MB	7,969	3.7%	0%	206,997	96.2%	96.2%	1.0
174	diff_percent_installment_payment_min	float64	1.7 MB	25,589	11.9%	0%	189,010	87.8%	87.8%	1.0
175	diff_percent_installment_payment_range	float64	1.7 MB	97,055	45.1%	0%	114,227	53.1%	53.1%	0.0
176	flag_emergency_state	float64	1.7 MB	2	<0.1%	0%	213,628	99.2%	99.2%	0.0
177	n_car_loans	float64	1.7 MB	9	<0.1%	0%	201,519	93.6%	93.6%	0.0
178	n_cash_loans	float64	1.7 MB	55	<0.1%	0%	83,697	38.9%	38.9%	0.0
179	n_channel_type_ap_minus	float64	1.7 MB	33	<0.1%	0%	199,207	92.5%	92.5%	0.0
180	n_channel_type_car_dealer	float64	1.7 MB	6	<0.1%	0%	215,036	99.9%	99.9%	0.0
181	n_channel_type_channel_corporate_sales	float64	1.7 MB	20	<0.1%	0%	213,745	99.3%	99.3%	0.0
182	n_channel_type_contact_center	float64	1.7 MB	19	<0.1%	0%	187,077	86.9%	86.9%	0.0
183	n_channel_type_countrywide	float64	1.7 MB	34	<0.1%	0%	78,922	36.7%	36.7%	1.0
184	n_channel_type_credit_and_cash	float64	1.7 MB	52	<0.1%	0%	96,482	44.8%	44.8%	0.0
185	n_channel_type_regional_and_local	float64	1.7 MB	19	<0.1%	0%	169,784	78.9%	78.9%	0.0
186	n_channel_type_stone	float64	1.7 MB	22	<0.1%	0%	133,139	61.9%	61.9%	0.0
187	n_client_type_new	float64	1.7 MB	14	<0.1%	0%	165,520	76.9%	76.9%	1.0
188	n_client_type_refreshed	float64	1.7 MB	23	<0.1%	0%	161,564	75.1%	75.1%	0.0
189	n_client_type_repeater	float64	1.7 MB	61	<0.1%	0%	49,122	22.8%	22.8%	0.0
190	n_consumer_loans	float64	1.7 MB	36	<0.1%	0%	78,331	36.4%	36.4%	1.0
191	n_contract_status_refused	float64	1.7 MB	44	<0.1%	0%	144,850	67.3%	67.3%	0.0
192	n_contract_status_unused_offer	float64	1.7 MB	11	<0.1%	0%	202,009	93.8%	93.8%	0.0
193	n_contracts_credit_card_completed	float64	1.7 MB	40	<0.1%	0%	207,783	96.5%	96.5%	0.0
194	n_credit_card_credits	float64	1.7 MB	22	<0.1%	0%	91,194	42.4%	42.4%	1.0
195	n_credits_active	float64	1.7 MB	22	<0.1%	0%	71,863	33.4%	33.4%	2.0
196	n_credits_sold	float64	1.7 MB	7	<0.1%	0%	211,547	98.3%	98.3%	0.0
197	n_credits_total	float64	1.7 MB	57	<0.1%	0%	51,153	23.8%	23.8%	4.0
198	n_currency_2	float64	1.7 MB	7	<0.1%	0%	214,671	99.7%	99.7%	0.0
199	n_different_channels	float64	1.7 MB	7	<0.1%	0%	90,541	42.1%	42.1%	2.0
200	n_different_contract_types	float64	1.7 MB	4	<0.1%	0%	89,430	41.5%	41.5%	2.0
201	n_different_credit_types	float64	1.7 MB	5	<0.1%	0%	131,569	61.1%	61.1%	2.0
202	n_different_currencies	float64	1.7 MB	3	<0.1%	0%	214,601	99.7%	99.7%	1.0
203	n_installments_late	float64	1.7 MB	99	<0.1%	0%	95,670	44.4%	44.4%	0.0
204	n_installments_late_30	float64	1.7 MB	42	<0.1%	0%	201,997	93.8%	93.8%	0.0
205	n_installments_late_7	float64	1.7 MB	59	<0.1%	0%	158,592	73.7%	73.7%	0.0
206	n_installments_total	float64	1.7 MB	310	0.1%	0%	14,007	6.5%	6.5%	25.0
207	n_microloans	float64	1.7 MB	28	<0.1%	0%	212,811	98.9%	98.9%	0.0
208	n_mortgages	float64	1.7 MB	7	<0.1%	0%	205,270	95.4%	95.4%	0.0
209	n_nflag_insured_on_approval_mean	float64	1.7 MB	102	<0.1%	0%	95,675	44.4%	44.4%	0.0
210	n_nflag_insured_on_approval_sum	float64	1.7 MB	19	<0.1%	0%	96,596	44.9%	44.9%	0.0
211	n_other_type_credit	float64	1.7 MB	9	<0.1%	0%	213,209	99.0%	99.0%	0.0
212	n_payment_type_cash_through_bank	float64	1.7 MB	44	<0.1%	0%	54,943	25.5%	25.5%	1.0
213	n_payment_type_not_available	float64	1.7 MB	46	<0.1%	0%	71,796	33.4%	33.4%	0.0
214	n_previous_credit_card_applications	float64	1.7 MB	126	0.1%	0%	155,013	72.0%	72.0%	21.0
215	n_previous_credit_card_applications_signed	float64	1.7 MB	37	<0.1%	0%	212,249	98.6%	98.6%	0.0
216	n_previous_pos_applications	float64	1.7 MB	221	0.1%	0%	16,495	7.7%	7.7%	22.0
217	n_previous_pos_applications_completed	float64	1.7 MB	45	<0.1%	0%	73,226	34.0%	34.0%	1.0
218	n_previous_pos_applications_signed	float64	1.7 MB	31	<0.1%	0%	174,587	81.1%	81.1%	0.0
219	n_product_type_walk_in	float64	1.7 MB	28	<0.1%	0%	164,239	76.3%	76.3%	0.0
220	n_reject_reason_limit	float64	1.7 MB	22	<0.1%	0%	195,275	90.7%	90.7%	0.0
221	n_reject_reason_scoc	float64	1.7 MB	20	<0.1%	0%	200,014	92.9%	92.9%	0.0
222	n_reject_reason_scofr	float64	1.7 MB	16	<0.1%	0%	210,511	97.8%	97.8%	0.0
223	n_revolving_loans	float64	1.7 MB	25	<0.1%	0%	142,248	66.1%	66.1%	0.0
224	n_yield_group_high	float64	1.7 MB	30	<0.1%	0%	89,153	41.4%	41.4%	0.0
225	n_yield_group_low_action	float64	1.7 MB	22	<0.1%	0%	174,871	81.2%	81.2%	0.0
226	n_yield_group_low_normal	float64	1.7 MB	23	<0.1%	0%	94,724	44.0%	44.0%	0.0
227	n_yield_group_middle	float64	1.7 MB	25	<0.1%	0%	80,132	37.2%	37.2%	1.0
228	percent_installments_early	float64	1.7 MB	7,892	3.7%	0%	64,688	30.1%	30.1%	1.0
229	percent_installments_late	float64	1.7 MB	4,464	2.1%	0%	95,670	44.4%	44.4%	0.0
230	percent_installments_late_30	float64	1.7 MB	894	0.4%	0%	201,997	93.8%	93.8%	0.0
231	percent_installments_late_60	float64	1.7 MB	629	0.3%	0%	209,180	97.2%	97.2%	0.0
232	percent_installments_late_7	float64	1.7 MB	2,595	1.2%	0%	158,592	73.7%	73.7%	0.0
233	rate_down_payment_max	float64	1.7 MB	84,884	39.4%	0%	53,725	25.0%	25.0%	0.0
234	rate_down_payment_range	float64	1.7 MB	73,616	34.2%	0%	94,887	44.1%	44.1%	0.0
235	rate_interest_privileged_count	float64	1.7 MB	4	<0.1%	0%	212,016	98.5%	98.5%	0.0
236	sk_dpd_credit_card_max	float64	1.7 MB	353	0.2%	0%	202,632	94.1%	94.1%	0.0
237	sk_dpd_credit_card_median	float64	1.7 MB	222	0.1%	0%	214,704	99.7%	99.7%	0.0
238	sk_dpd_def_credit_card_max	float64	1.7 MB	47	<0.1%	0%	204,810	95.1%	95.1%	0.0
239	sk_dpd_def_pos_applications_max	float64	1.7 MB	173	0.1%	0%	187,187	87.0%	87.0%	0.0
240	sk_dpd_pos_applications_max	float64	1.7 MB	1,595	0.7%	0%	176,902	82.2%	82.2%	0.0
241	years_employed	float64	1.7 MB	11,769	5.5%	0%	38,801	18.0%	18.0%	4.517808219178082
242	missingindicator_AMT_ANNUITY	float64	1.7 MB	2	<0.1%	0%	215,249	>99.9%	>99.9%	0.0
243	missingindicator_AMT_REQ_CREDIT_BUREAU_DAY	float64	1.7 MB	2	<0.1%	0%	186,176	86.5%	86.5%	0.0
244	missingindicator_AMT_REQ_CREDIT_BUREAU_HOUR	float64	1.7 MB	2	<0.1%	0%	186,176	86.5%	86.5%	0.0
245	missingindicator_AMT_REQ_CREDIT_BUREAU_MON	float64	1.7 MB	2	<0.1%	0%	186,176	86.5%	86.5%	0.0
246	missingindicator_AMT_REQ_CREDIT_BUREAU_QRT	float64	1.7 MB	2	<0.1%	0%	186,176	86.5%	86.5%	0.0
247	missingindicator_AMT_REQ_CREDIT_BUREAU_WEEK	float64	1.7 MB	2	<0.1%	0%	186,176	86.5%	86.5%	0.0
248	missingindicator_AMT_REQ_CREDIT_BUREAU_YEAR	float64	1.7 MB	2	<0.1%	0%	186,176	86.5%	86.5%	0.0
249	missingindicator_BASEMENTAREA_MODE	float64	1.7 MB	2	<0.1%	0%	125,793	58.4%	58.4%	1.0
250	missingindicator_CNT_FAM_MEMBERS	float64	1.7 MB	2	<0.1%	0%	215,256	>99.9%	>99.9%	0.0
251	missingindicator_COMMONAREA_MEDI	float64	1.7 MB	2	<0.1%	0%	150,300	69.8%	69.8%	1.0
252	missingindicator_DAYS_LAST_PHONE_CHANGE	float64	1.7 MB	2	<0.1%	0%	215,256	>99.9%	>99.9%	0.0
253	missingindicator_DEF_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	2	<0.1%	0%	214,543	99.7%	99.7%	0.0
254	missingindicator_ELEVATORS_AVG	float64	1.7 MB	2	<0.1%	0%	114,570	53.2%	53.2%	1.0
255	missingindicator_ELEVATORS_MEDI	float64	1.7 MB	2	<0.1%	0%	114,570	53.2%	53.2%	1.0
256	missingindicator_ENTRANCES_MODE	float64	1.7 MB	2	<0.1%	0%	108,270	50.3%	50.3%	1.0
257	missingindicator_EXT_SOURCE_1	float64	1.7 MB	2	<0.1%	0%	121,373	56.4%	56.4%	1.0
258	missingindicator_EXT_SOURCE_2	float64	1.7 MB	2	<0.1%	0%	214,793	99.8%	99.8%	0.0
259	missingindicator_EXT_SOURCE_3	float64	1.7 MB	2	<0.1%	0%	172,577	80.2%	80.2%	0.0
260	missingindicator_FLOORSMAX_MEDI	float64	1.7 MB	2	<0.1%	0%	108,287	50.3%	50.3%	0.0
261	missingindicator_FLOORSMIN_MEDI	float64	1.7 MB	2	<0.1%	0%	146,054	67.9%	67.9%	1.0
262	missingindicator_LANDAREA_MEDI	float64	1.7 MB	2	<0.1%	0%	127,644	59.3%	59.3%	1.0
263	missingindicator_NONLIVINGAPARTMENTS_AVG	float64	1.7 MB	2	<0.1%	0%	149,354	69.4%	69.4%	1.0
264	missingindicator_NONLIVINGAREA_MODE	float64	1.7 MB	2	<0.1%	0%	118,577	55.1%	55.1%	1.0
265	missingindicator_OBS_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	2	<0.1%	0%	214,543	99.7%	99.7%	0.0
266	missingindicator_OWN_CAR_AGE	float64	1.7 MB	2	<0.1%	0%	142,091	66.0%	66.0%	1.0
267	missingindicator_YEARS_BEGINEXPLUATATION_MODE	float64	1.7 MB	2	<0.1%	0%	110,347	51.3%	51.3%	0.0
268	missingindicator_YEARS_BUILD_AVG	float64	1.7 MB	2	<0.1%	0%	143,036	66.4%	66.4%	1.0
269	missingindicator_amt_annuity_max	float64	1.7 MB	2	<0.1%	0%	159,480	74.1%	74.1%	1.0
270	missingindicator_amt_annuity_max_previous_application	float64	1.7 MB	2	<0.1%	0%	203,505	94.5%	94.5%	0.0
271	missingindicator_amt_annuity_median	float64	1.7 MB	2	<0.1%	0%	159,480	74.1%	74.1%	1.0
272	missingindicator_amt_annuity_median_previous_application	float64	1.7 MB	2	<0.1%	0%	203,505	94.5%	94.5%	0.0
273	missingindicator_amt_annuity_min	float64	1.7 MB	2	<0.1%	0%	159,480	74.1%	74.1%	1.0
274	missingindicator_amt_annuity_min_previous_application	float64	1.7 MB	2	<0.1%	0%	203,505	94.5%	94.5%	0.0
275	missingindicator_amt_annuity_to_credit_ratio	float64	1.7 MB	2	<0.1%	0%	215,249	>99.9%	>99.9%	0.0
276	missingindicator_amt_annuity_to_income_per_family_member	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%	0.0
277	missingindicator_amt_annuity_to_income_ratio	float64	1.7 MB	2	<0.1%	0%	215,249	>99.9%	>99.9%	0.0
278	missingindicator_amt_balance_credit_card_max	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
279	missingindicator_amt_balance_credit_card_median	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
280	missingindicator_amt_balance_credit_card_min	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
281	missingindicator_amt_credit_limit_actual_median	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
282	missingindicator_amt_credit_limit_actual_range	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
283	missingindicator_amt_credit_max	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
284	missingindicator_amt_credit_max_overdue_max	float64	1.7 MB	2	<0.1%	0%	128,619	59.8%	59.8%	0.0
285	missingindicator_amt_credit_max_overdue_range	float64	1.7 MB	2	<0.1%	0%	128,619	59.8%	59.8%	0.0
286	missingindicator_amt_credit_median	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
287	missingindicator_amt_credit_min	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
288	missingindicator_amt_credit_range	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
289	missingindicator_amt_credit_sum_debt_mean	float64	1.7 MB	2	<0.1%	0%	179,218	83.3%	83.3%	0.0
290	missingindicator_amt_credit_sum_debt_median	float64	1.7 MB	2	<0.1%	0%	179,218	83.3%	83.3%	0.0
291	missingindicator_amt_credit_sum_debt_sum	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
292	missingindicator_amt_credit_sum_limit_min	float64	1.7 MB	2	<0.1%	0%	169,672	78.8%	78.8%	0.0
293	missingindicator_amt_credit_sum_limit_std	float64	1.7 MB	2	<0.1%	0%	134,361	62.4%	62.4%	0.0
294	missingindicator_amt_credit_sum_limit_sum	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
295	missingindicator_amt_credit_sum_median	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
296	missingindicator_amt_credit_sum_overdue_std	float64	1.7 MB	2	<0.1%	0%	159,292	74.0%	74.0%	0.0
297	missingindicator_amt_credit_sum_overdue_sum	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
298	missingindicator_amt_credit_sum_std	float64	1.7 MB	2	<0.1%	0%	159,292	74.0%	74.0%	0.0
299	missingindicator_amt_credit_sum_sum	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
300	missingindicator_amt_down_payment_max	float64	1.7 MB	2	<0.1%	0%	191,554	89.0%	89.0%	0.0
301	missingindicator_amt_down_payment_mean	float64	1.7 MB	2	<0.1%	0%	191,554	89.0%	89.0%	0.0
302	missingindicator_amt_drawings_atm_current_max	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
303	missingindicator_amt_drawings_atm_current_median	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
304	missingindicator_amt_drawings_atm_current_min	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
305	missingindicator_amt_drawings_current_max	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
306	missingindicator_amt_drawings_current_mean	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
307	missingindicator_amt_drawings_current_min	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
308	missingindicator_amt_drawings_other_current_max	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
309	missingindicator_amt_drawings_pos_current_max	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
310	missingindicator_amt_drawings_pos_current_mean	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
311	missingindicator_amt_drawings_pos_current_min	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
312	missingindicator_amt_goods_price_min	float64	1.7 MB	2	<0.1%	0%	203,088	94.3%	94.3%	0.0
313	missingindicator_amt_inst_min_regularity_min	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
314	missingindicator_amt_payment_current_median	float64	1.7 MB	2	<0.1%	0%	172,336	80.1%	80.1%	1.0
315	missingindicator_amt_payment_current_min	float64	1.7 MB	2	<0.1%	0%	172,336	80.1%	80.1%	1.0
316	missingindicator_amt_payment_current_range	float64	1.7 MB	2	<0.1%	0%	172,336	80.1%	80.1%	1.0
317	missingindicator_amt_payment_total_current_min	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
318	missingindicator_any_installments_late_30	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
319	missingindicator_any_installments_late_60	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
320	missingindicator_any_installments_late_7	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
321	missingindicator_bureau_dpd_status_max	float64	1.7 MB	2	<0.1%	0%	152,586	70.9%	70.9%	1.0
322	missingindicator_bureau_dpd_status_median	float64	1.7 MB	2	<0.1%	0%	152,586	70.9%	70.9%	1.0
323	missingindicator_bureau_months_balance_max	float64	1.7 MB	2	<0.1%	0%	152,586	70.9%	70.9%	1.0
324	missingindicator_cnt_credit_prolong_mean	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
325	missingindicator_cnt_credit_prolong_sum	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
326	missingindicator_cnt_drawings_atm_current_max	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
327	missingindicator_cnt_drawings_atm_current_std	float64	1.7 MB	2	<0.1%	0%	172,561	80.2%	80.2%	1.0
328	missingindicator_cnt_drawings_current_min	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
329	missingindicator_cnt_drawings_other_current_max	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
330	missingindicator_cnt_drawings_pos_current_max	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
331	missingindicator_cnt_drawings_pos_current_median	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
332	missingindicator_cnt_drawings_pos_current_min	float64	1.7 MB	2	<0.1%	0%	172,254	80.0%	80.0%	1.0
333	missingindicator_cnt_fam_members_excluding_children	float64	1.7 MB	2	<0.1%	0%	215,256	>99.9%	>99.9%	0.0
334	missingindicator_cnt_installment_future_min	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
335	missingindicator_cnt_installment_mature_cum_max	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
336	missingindicator_cnt_installment_mature_cum_min	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
337	missingindicator_cnt_installment_median	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
338	missingindicator_cnt_installment_min	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
339	missingindicator_cnt_installment_range	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
340	missingindicator_cnt_installments_diff_mean	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
341	missingindicator_cnt_installments_diff_min	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
342	missingindicator_cnt_installments_diff_range	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
343	missingindicator_cnt_payment_median	float64	1.7 MB	2	<0.1%	0%	203,505	94.5%	94.5%	0.0
344	missingindicator_cnt_payment_min	float64	1.7 MB	2	<0.1%	0%	203,505	94.5%	94.5%	0.0
345	missingindicator_cnt_payment_range	float64	1.7 MB	2	<0.1%	0%	203,505	94.5%	94.5%	0.0
346	missingindicator_days_credit_enddate_max	float64	1.7 MB	2	<0.1%	0%	182,825	84.9%	84.9%	0.0
347	missingindicator_days_credit_enddate_min	float64	1.7 MB	2	<0.1%	0%	182,825	84.9%	84.9%	0.0
348	missingindicator_days_credit_enddate_std	float64	1.7 MB	2	<0.1%	0%	156,060	72.5%	72.5%	0.0
349	missingindicator_days_credit_max	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
350	missingindicator_days_credit_median	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
351	missingindicator_days_credit_overdue_max	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
352	missingindicator_days_credit_overdue_mean	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
353	missingindicator_days_credit_overdue_median	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
354	missingindicator_days_credit_range	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
355	missingindicator_days_credit_std	float64	1.7 MB	2	<0.1%	0%	159,292	74.0%	74.0%	0.0
356	missingindicator_days_credit_update_max	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
357	missingindicator_days_credit_update_median	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
358	missingindicator_days_credit_update_range	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
359	missingindicator_days_decision_max	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
360	missingindicator_days_decision_median	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
361	missingindicator_days_decision_range	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
362	missingindicator_days_enddate_fact_max	float64	1.7 MB	2	<0.1%	0%	161,387	75.0%	75.0%	0.0
363	missingindicator_days_enddate_fact_median	float64	1.7 MB	2	<0.1%	0%	161,387	75.0%	75.0%	0.0
364	missingindicator_days_enddate_fact_range	float64	1.7 MB	2	<0.1%	0%	161,387	75.0%	75.0%	0.0
365	missingindicator_days_first_draw_min	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
366	missingindicator_days_last_due_1st_version_max	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
367	missingindicator_days_last_due_1st_version_mean	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
368	missingindicator_days_last_due_1st_version_median	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
369	missingindicator_days_last_due_1st_version_min	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
370	missingindicator_days_last_due_max	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
371	missingindicator_days_last_due_range	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
372	missingindicator_days_termination_median	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
373	missingindicator_days_termination_min	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
374	missingindicator_diff_amt_installment_payment_max	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
375	missingindicator_diff_amt_installment_payment_mean	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
376	missingindicator_diff_amt_installment_payment_median	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
377	missingindicator_diff_amt_installment_payment_range	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
378	missingindicator_diff_days_installment_payment_max	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
379	missingindicator_diff_days_installment_payment_mean	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
380	missingindicator_diff_days_installment_payment_median	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
381	missingindicator_diff_days_installment_payment_range	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
382	missingindicator_diff_days_installment_payment_sum	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
383	missingindicator_diff_days_installment_payment_sum_late_only	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
384	missingindicator_diff_percent_installment_payment_mean	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
385	missingindicator_diff_percent_installment_payment_median	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
386	missingindicator_diff_percent_installment_payment_min	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
387	missingindicator_diff_percent_installment_payment_range	float64	1.7 MB	2	<0.1%	0%	204,220	94.9%	94.9%	0.0
388	missingindicator_n_car_loans	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
389	missingindicator_n_cash_loans	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
390	missingindicator_n_channel_type_ap_minus	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
391	missingindicator_n_channel_type_car_dealer	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
392	missingindicator_n_channel_type_channel_corporate_sales	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
393	missingindicator_n_channel_type_contact_center	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
394	missingindicator_n_channel_type_countrywide	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
395	missingindicator_n_channel_type_credit_and_cash	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
396	missingindicator_n_channel_type_regional_and_local	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
397	missingindicator_n_channel_type_stone	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
398	missingindicator_n_client_type_new	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
399	missingindicator_n_client_type_refreshed	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
400	missingindicator_n_client_type_repeater	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
401	missingindicator_n_consumer_loans	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
402	missingindicator_n_contract_status_refused	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
403	missingindicator_n_contract_status_unused_offer	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
404	missingindicator_n_contracts_credit_card_completed	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
405	missingindicator_n_credit_card_credits	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
406	missingindicator_n_credits_active	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
407	missingindicator_n_credits_sold	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
408	missingindicator_n_credits_total	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
409	missingindicator_n_currency_2	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
410	missingindicator_n_different_channels	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
411	missingindicator_n_different_contract_types	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
412	missingindicator_n_different_credit_types	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
413	missingindicator_n_different_currencies	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
414	missingindicator_n_installments_late	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
415	missingindicator_n_installments_late_30	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
416	missingindicator_n_installments_late_7	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
417	missingindicator_n_installments_total	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
418	missingindicator_n_microloans	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
419	missingindicator_n_mortgages	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
420	missingindicator_n_nflag_insured_on_approval_mean	float64	1.7 MB	2	<0.1%	0%	202,880	94.3%	94.3%	0.0
421	missingindicator_n_nflag_insured_on_approval_sum	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
422	missingindicator_n_other_type_credit	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
423	missingindicator_n_payment_type_cash_through_bank	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
424	missingindicator_n_payment_type_not_available	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
425	missingindicator_n_previous_credit_card_applications	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
426	missingindicator_n_previous_credit_card_applications_signed	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
427	missingindicator_n_previous_pos_applications	float64	1.7 MB	2	<0.1%	0%	202,687	94.2%	94.2%	0.0
428	missingindicator_n_previous_pos_applications_completed	float64	1.7 MB	2	<0.1%	0%	202,687	94.2%	94.2%	0.0
429	missingindicator_n_previous_pos_applications_signed	float64	1.7 MB	2	<0.1%	0%	202,687	94.2%	94.2%	0.0
430	missingindicator_n_product_type_walk_in	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
431	missingindicator_n_reject_reason_limit	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
432	missingindicator_n_reject_reason_scoc	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
433	missingindicator_n_reject_reason_scofr	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
434	missingindicator_n_revolving_loans	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
435	missingindicator_n_yield_group_high	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
436	missingindicator_n_yield_group_low_action	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
437	missingindicator_n_yield_group_low_normal	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
438	missingindicator_n_yield_group_middle	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
439	missingindicator_percent_installments_early	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
440	missingindicator_percent_installments_late	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
441	missingindicator_percent_installments_late_30	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
442	missingindicator_percent_installments_late_60	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
443	missingindicator_percent_installments_late_7	float64	1.7 MB	2	<0.1%	0%	204,223	94.9%	94.9%	0.0
444	missingindicator_rate_down_payment_max	float64	1.7 MB	2	<0.1%	0%	191,554	89.0%	89.0%	0.0
445	missingindicator_rate_down_payment_range	float64	1.7 MB	2	<0.1%	0%	191,554	89.0%	89.0%	0.0
446	missingindicator_rate_interest_privileged_count	float64	1.7 MB	2	<0.1%	0%	203,801	94.7%	94.7%	0.0
447	missingindicator_sk_dpd_credit_card_max	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
448	missingindicator_sk_dpd_credit_card_median	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
449	missingindicator_sk_dpd_def_credit_card_max	float64	1.7 MB	2	<0.1%	0%	154,158	71.6%	71.6%	1.0
450	missingindicator_sk_dpd_def_pos_applications_max	float64	1.7 MB	2	<0.1%	0%	202,687	94.2%	94.2%	0.0
451	missingindicator_sk_dpd_pos_applications_max	float64	1.7 MB	2	<0.1%	0%	202,687	94.2%	94.2%	0.0
452	missingindicator_years_employed	float64	1.7 MB	2	<0.1%	0%	176,501	82.0%	82.0%	0.0
453	fondkapremont_mode_not_specified	float64	1.7 MB	2	<0.1%	0%	211,294	98.2%	98.2%	0.0
454	fondkapremont_mode_org_spec_account	float64	1.7 MB	2	<0.1%	0%	211,329	98.2%	98.2%	0.0
455	fondkapremont_mode_reg_oper_account	float64	1.7 MB	2	<0.1%	0%	163,472	75.9%	75.9%	0.0
456	fondkapremont_mode_reg_oper_spec_account	float64	1.7 MB	2	<0.1%	0%	206,775	96.1%	96.1%	0.0
457	fondkapremont_mode_nan	float64	1.7 MB	2	<0.1%	0%	147,099	68.3%	68.3%	1.0
458	housetype_mode_block_of_flats	float64	1.7 MB	2	<0.1%	0%	109,742	51.0%	51.0%	0.0
459	housetype_mode_specific_housing	float64	1.7 MB	2	<0.1%	0%	214,216	99.5%	99.5%	0.0
460	housetype_mode_terraced_house	float64	1.7 MB	2	<0.1%	0%	214,390	99.6%	99.6%	0.0
461	housetype_mode_nan	float64	1.7 MB	2	<0.1%	0%	107,834	50.1%	50.1%	1.0
462	name_contract_type_cash_loans	float64	1.7 MB	2	<0.1%	0%	194,675	90.4%	90.4%	1.0
463	name_contract_type_revolving_loans	float64	1.7 MB	2	<0.1%	0%	194,675	90.4%	90.4%	0.0
464	name_housing_type_co_op_apartment	float64	1.7 MB	2	<0.1%	0%	214,466	99.6%	99.6%	0.0
465	name_housing_type_house_apartment	float64	1.7 MB	2	<0.1%	0%	191,159	88.8%	88.8%	1.0
466	name_housing_type_municipal_apartment	float64	1.7 MB	2	<0.1%	0%	207,454	96.4%	96.4%	0.0
467	name_housing_type_office_apartment	float64	1.7 MB	2	<0.1%	0%	213,440	99.2%	99.2%	0.0
468	name_housing_type_rented_apartment	float64	1.7 MB	2	<0.1%	0%	211,900	98.4%	98.4%	0.0
469	name_housing_type_with_parents	float64	1.7 MB	2	<0.1%	0%	204,927	95.2%	95.2%	0.0
470	name_income_type_businessman	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%	0.0
471	name_income_type_commercial_associate	float64	1.7 MB	2	<0.1%	0%	165,151	76.7%	76.7%	0.0
472	name_income_type_maternity_leave	float64	1.7 MB	2	<0.1%	0%	215,254	>99.9%	>99.9%	0.0
473	name_income_type_pensioner	float64	1.7 MB	2	<0.1%	0%	176,509	82.0%	82.0%	0.0
474	name_income_type_state_servant	float64	1.7 MB	2	<0.1%	0%	199,875	92.9%	92.9%	0.0
475	name_income_type_student	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%	0.0
476	name_income_type_unemployed	float64	1.7 MB	2	<0.1%	0%	215,241	>99.9%	>99.9%	0.0
477	name_income_type_working	float64	1.7 MB	2	<0.1%	0%	110,984	51.6%	51.6%	1.0
478	name_type_suite_children	float64	1.7 MB	2	<0.1%	0%	212,930	98.9%	98.9%	0.0
479	name_type_suite_family	float64	1.7 MB	2	<0.1%	0%	187,256	87.0%	87.0%	0.0
480	name_type_suite_group_of_people	float64	1.7 MB	2	<0.1%	0%	215,059	99.9%	99.9%	0.0
481	name_type_suite_other_a	float64	1.7 MB	2	<0.1%	0%	214,643	99.7%	99.7%	0.0
482	name_type_suite_other_b	float64	1.7 MB	2	<0.1%	0%	214,009	99.4%	99.4%	0.0
483	name_type_suite_spouse_partner	float64	1.7 MB	2	<0.1%	0%	207,378	96.3%	96.3%	0.0
484	name_type_suite_unaccompanied	float64	1.7 MB	2	<0.1%	0%	174,089	80.9%	80.9%	1.0
485	name_type_suite_nan	float64	1.7 MB	2	<0.1%	0%	214,356	99.6%	99.6%	0.0
486	occupation_type_accountants	float64	1.7 MB	2	<0.1%	0%	208,415	96.8%	96.8%	0.0
487	occupation_type_cleaning_staff	float64	1.7 MB	2	<0.1%	0%	211,947	98.5%	98.5%	0.0
488	occupation_type_cooking_staff	float64	1.7 MB	2	<0.1%	0%	211,079	98.1%	98.1%	0.0
489	occupation_type_core_staff	float64	1.7 MB	2	<0.1%	0%	195,912	91.0%	91.0%	0.0
490	occupation_type_drivers	float64	1.7 MB	2	<0.1%	0%	202,169	93.9%	93.9%	0.0
491	occupation_type_hr_staff	float64	1.7 MB	2	<0.1%	0%	214,884	99.8%	99.8%	0.0
492	occupation_type_high_skill_tech_staff	float64	1.7 MB	2	<0.1%	0%	207,280	96.3%	96.3%	0.0
493	occupation_type_it_staff	float64	1.7 MB	2	<0.1%	0%	214,896	99.8%	99.8%	0.0
494	occupation_type_laborers	float64	1.7 MB	2	<0.1%	0%	176,666	82.1%	82.1%	0.0
495	occupation_type_low_skill_laborers	float64	1.7 MB	2	<0.1%	0%	213,777	99.3%	99.3%	0.0
496	occupation_type_managers	float64	1.7 MB	2	<0.1%	0%	200,272	93.0%	93.0%	0.0
497	occupation_type_medicine_staff	float64	1.7 MB	2	<0.1%	0%	209,207	97.2%	97.2%	0.0
498	occupation_type_private_service_staff	float64	1.7 MB	2	<0.1%	0%	213,406	99.1%	99.1%	0.0
499	occupation_type_realty_agents	float64	1.7 MB	2	<0.1%	0%	214,733	99.8%	99.8%	0.0
500	occupation_type_sales_staff	float64	1.7 MB	2	<0.1%	0%	192,972	89.6%	89.6%	0.0
501	occupation_type_secretaries	float64	1.7 MB	2	<0.1%	0%	214,342	99.6%	99.6%	0.0
502	occupation_type_security_staff	float64	1.7 MB	2	<0.1%	0%	210,559	97.8%	97.8%	0.0
503	occupation_type_waiters_barmen_staff	float64	1.7 MB	2	<0.1%	0%	214,333	99.6%	99.6%	0.0
504	occupation_type_nan	float64	1.7 MB	2	<0.1%	0%	147,777	68.7%	68.7%	0.0
505	organization_type_advertising	float64	1.7 MB	2	<0.1%	0%	214,968	99.9%	99.9%	0.0
506	organization_type_agriculture	float64	1.7 MB	2	<0.1%	0%	213,527	99.2%	99.2%	0.0
507	organization_type_bank	float64	1.7 MB	2	<0.1%	0%	213,522	99.2%	99.2%	0.0
508	organization_type_business_entity_type_1	float64	1.7 MB	2	<0.1%	0%	211,043	98.0%	98.0%	0.0
509	organization_type_business_entity_type_2	float64	1.7 MB	2	<0.1%	0%	207,883	96.6%	96.6%	0.0
510	organization_type_business_entity_type_3	float64	1.7 MB	2	<0.1%	0%	167,675	77.9%	77.9%	0.0
511	organization_type_cleaning	float64	1.7 MB	2	<0.1%	0%	215,062	99.9%	99.9%	0.0
512	organization_type_construction	float64	1.7 MB	2	<0.1%	0%	210,553	97.8%	97.8%	0.0
513	organization_type_culture	float64	1.7 MB	2	<0.1%	0%	214,988	99.9%	99.9%	0.0
514	organization_type_electricity	float64	1.7 MB	2	<0.1%	0%	214,583	99.7%	99.7%	0.0
515	organization_type_emergency	float64	1.7 MB	2	<0.1%	0%	214,862	99.8%	99.8%	0.0
516	organization_type_government	float64	1.7 MB	2	<0.1%	0%	207,933	96.6%	96.6%	0.0
517	organization_type_hotel	float64	1.7 MB	2	<0.1%	0%	214,571	99.7%	99.7%	0.0
518	organization_type_housing	float64	1.7 MB	2	<0.1%	0%	213,202	99.0%	99.0%	0.0
519	organization_type_industry_type_1	float64	1.7 MB	2	<0.1%	0%	214,520	99.7%	99.7%	0.0
520	organization_type_industry_type_10	float64	1.7 MB	2	<0.1%	0%	215,182	>99.9%	>99.9%	0.0
521	organization_type_industry_type_11	float64	1.7 MB	2	<0.1%	0%	213,369	99.1%	99.1%	0.0
522	organization_type_industry_type_12	float64	1.7 MB	2	<0.1%	0%	214,999	99.9%	99.9%	0.0
523	organization_type_industry_type_13	float64	1.7 MB	2	<0.1%	0%	215,211	>99.9%	>99.9%	0.0
524	organization_type_industry_type_2	float64	1.7 MB	2	<0.1%	0%	214,931	99.8%	99.8%	0.0
525	organization_type_industry_type_3	float64	1.7 MB	2	<0.1%	0%	212,965	98.9%	98.9%	0.0
526	organization_type_industry_type_4	float64	1.7 MB	2	<0.1%	0%	214,624	99.7%	99.7%	0.0
527	organization_type_industry_type_5	float64	1.7 MB	2	<0.1%	0%	214,864	99.8%	99.8%	0.0
528	organization_type_industry_type_6	float64	1.7 MB	2	<0.1%	0%	215,180	>99.9%	>99.9%	0.0
529	organization_type_industry_type_7	float64	1.7 MB	2	<0.1%	0%	214,354	99.6%	99.6%	0.0
530	organization_type_industry_type_8	float64	1.7 MB	2	<0.1%	0%	215,240	>99.9%	>99.9%	0.0
531	organization_type_industry_type_9	float64	1.7 MB	2	<0.1%	0%	212,861	98.9%	98.9%	0.0
532	organization_type_insurance	float64	1.7 MB	2	<0.1%	0%	214,842	99.8%	99.8%	0.0
533	organization_type_kindergarten	float64	1.7 MB	2	<0.1%	0%	210,366	97.7%	97.7%	0.0
534	organization_type_legal_services	float64	1.7 MB	2	<0.1%	0%	215,039	99.9%	99.9%	0.0
535	organization_type_medicine	float64	1.7 MB	2	<0.1%	0%	207,340	96.3%	96.3%	0.0
536	organization_type_military	float64	1.7 MB	2	<0.1%	0%	213,400	99.1%	99.1%	0.0
537	organization_type_mobile	float64	1.7 MB	2	<0.1%	0%	215,046	99.9%	99.9%	0.0
538	organization_type_other	float64	1.7 MB	2	<0.1%	0%	203,595	94.6%	94.6%	0.0
539	organization_type_police	float64	1.7 MB	2	<0.1%	0%	213,649	99.3%	99.3%	0.0
540	organization_type_postal	float64	1.7 MB	2	<0.1%	0%	213,737	99.3%	99.3%	0.0
541	organization_type_realtor	float64	1.7 MB	2	<0.1%	0%	214,978	99.9%	99.9%	0.0
542	organization_type_religion	float64	1.7 MB	2	<0.1%	0%	215,198	>99.9%	>99.9%	0.0
543	organization_type_restaurant	float64	1.7 MB	2	<0.1%	0%	213,972	99.4%	99.4%	0.0
544	organization_type_school	float64	1.7 MB	2	<0.1%	0%	208,961	97.1%	97.1%	0.0
545	organization_type_security	float64	1.7 MB	2	<0.1%	0%	212,955	98.9%	98.9%	0.0
546	organization_type_security_ministries	float64	1.7 MB	2	<0.1%	0%	213,854	99.3%	99.3%	0.0
547	organization_type_self_employed	float64	1.7 MB	2	<0.1%	0%	188,576	87.6%	87.6%	0.0
548	organization_type_services	float64	1.7 MB	2	<0.1%	0%	214,168	99.5%	99.5%	0.0
549	organization_type_telecom	float64	1.7 MB	2	<0.1%	0%	214,861	99.8%	99.8%	0.0
550	organization_type_trade_type_1	float64	1.7 MB	2	<0.1%	0%	215,020	99.9%	99.9%	0.0
551	organization_type_trade_type_2	float64	1.7 MB	2	<0.1%	0%	213,919	99.4%	99.4%	0.0
552	organization_type_trade_type_3	float64	1.7 MB	2	<0.1%	0%	212,832	98.9%	98.9%	0.0
553	organization_type_trade_type_4	float64	1.7 MB	2	<0.1%	0%	215,212	>99.9%	>99.9%	0.0
554	organization_type_trade_type_5	float64	1.7 MB	2	<0.1%	0%	215,223	>99.9%	>99.9%	0.0
555	organization_type_trade_type_6	float64	1.7 MB	2	<0.1%	0%	214,832	99.8%	99.8%	0.0
556	organization_type_trade_type_7	float64	1.7 MB	2	<0.1%	0%	209,807	97.5%	97.5%	0.0
557	organization_type_transport_type_1	float64	1.7 MB	2	<0.1%	0%	215,112	99.9%	99.9%	0.0
558	organization_type_transport_type_2	float64	1.7 MB	2	<0.1%	0%	213,728	99.3%	99.3%	0.0
559	organization_type_transport_type_3	float64	1.7 MB	2	<0.1%	0%	214,406	99.6%	99.6%	0.0
560	organization_type_transport_type_4	float64	1.7 MB	2	<0.1%	0%	211,508	98.3%	98.3%	0.0
561	organization_type_university	float64	1.7 MB	2	<0.1%	0%	214,340	99.6%	99.6%	0.0
562	organization_type_xna	float64	1.7 MB	2	<0.1%	0%	176,501	82.0%	82.0%	0.0
563	wallsmaterial_mode_block	float64	1.7 MB	2	<0.1%	0%	208,728	97.0%	97.0%	0.0
564	wallsmaterial_mode_mixed	float64	1.7 MB	2	<0.1%	0%	213,683	99.3%	99.3%	0.0
565	wallsmaterial_mode_monolithic	float64	1.7 MB	2	<0.1%	0%	214,008	99.4%	99.4%	0.0
566	wallsmaterial_mode_others	float64	1.7 MB	2	<0.1%	0%	214,119	99.5%	99.5%	0.0
567	wallsmaterial_mode_panel	float64	1.7 MB	2	<0.1%	0%	168,959	78.5%	78.5%	0.0
568	wallsmaterial_mode_stone_brick	float64	1.7 MB	2	<0.1%	0%	169,849	78.9%	78.9%	0.0
569	wallsmaterial_mode_wooden	float64	1.7 MB	2	<0.1%	0%	211,525	98.3%	98.3%	0.0
570	wallsmaterial_mode_nan	float64	1.7 MB	2	<0.1%	0%	109,329	50.8%	50.8%	1.0
571	mode_credit_type_car_loan	float64	1.7 MB	2	<0.1%	0%	212,121	98.5%	98.5%	0.0
572	mode_credit_type_consumer_credit	float64	1.7 MB	2	<0.1%	0%	160,802	74.7%	74.7%	1.0
573	mode_credit_type_credit_card	float64	1.7 MB	2	<0.1%	0%	196,123	91.1%	91.1%	0.0
574	mode_credit_type_microloan	float64	1.7 MB	2	<0.1%	0%	214,789	99.8%	99.8%	0.0
575	mode_credit_type_mortgage	float64	1.7 MB	2	<0.1%	0%	214,478	99.6%	99.6%	0.0
576	mode_credit_type_other	float64	1.7 MB	2	<0.1%	0%	215,155	>99.9%	>99.9%	0.0
577	mode_credit_type_nan	float64	1.7 MB	2	<0.1%	0%	184,421	85.7%	85.7%	0.0
578	name_education_type_academic_degree	float64	1.7 MB	2	<0.1%	0%	215,153	>99.9%	>99.9%	0.0
579	name_education_type_higher_education	float64	1.7 MB	2	<0.1%	0%	163,003	75.7%	75.7%	0.0
580	name_education_type_incomplete_higher	float64	1.7 MB	2	<0.1%	0%	208,006	96.6%	96.6%	0.0
581	name_education_type_lower_secondary	float64	1.7 MB	2	<0.1%	0%	212,602	98.8%	98.8%	0.0
582	name_education_type_secondary_secondary_special	float64	1.7 MB	2	<0.1%	0%	152,993	71.1%	71.1%	1.0

6.1.3 Steps After Pre-Processing

Next, let’s identify the problematic columns after this step:

Code

problematic_columns_2 = df_processed_col_info.query(
    "n_unique <= 1 or p_missing >= 90.00 or p_dom_excl_na >= 99.85"
)

print(f"N columns to remove: {problematic_columns_2.shape[0]}")
problematic_columns_2.pipe(an.style_col_info)

N columns to remove: 33

Table 6.4. Info on problematic columns to remove after preprocessing

	column	data_type	memory_size	n_unique	p_unique	p_missing	n_dominant	p_dominant	p_dom_excl_na
27	FLAG_DOCUMENT_15	float64	1.7 MB	2	<0.1%	0%	215,015	99.9%	99.9%
141	days_credit_overdue_median	float64	1.7 MB	168	0.1%	0%	214,955	99.9%	99.9%
180	n_channel_type_car_dealer	float64	1.7 MB	6	<0.1%	0%	215,036	99.9%	99.9%
242	missingindicator_AMT_ANNUITY	float64	1.7 MB	2	<0.1%	0%	215,249	>99.9%	>99.9%
250	missingindicator_CNT_FAM_MEMBERS	float64	1.7 MB	2	<0.1%	0%	215,256	>99.9%	>99.9%
252	missingindicator_DAYS_LAST_PHONE_CHANGE	float64	1.7 MB	2	<0.1%	0%	215,256	>99.9%	>99.9%
275	missingindicator_amt_annuity_to_credit_ratio	float64	1.7 MB	2	<0.1%	0%	215,249	>99.9%	>99.9%
276	missingindicator_amt_annuity_to_income_per_family_member	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%
277	missingindicator_amt_annuity_to_income_ratio	float64	1.7 MB	2	<0.1%	0%	215,249	>99.9%	>99.9%
333	missingindicator_cnt_fam_members_excluding_children	float64	1.7 MB	2	<0.1%	0%	215,256	>99.9%	>99.9%
470	name_income_type_businessman	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%
472	name_income_type_maternity_leave	float64	1.7 MB	2	<0.1%	0%	215,254	>99.9%	>99.9%
475	name_income_type_student	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%
476	name_income_type_unemployed	float64	1.7 MB	2	<0.1%	0%	215,241	>99.9%	>99.9%
480	name_type_suite_group_of_people	float64	1.7 MB	2	<0.1%	0%	215,059	99.9%	99.9%
505	organization_type_advertising	float64	1.7 MB	2	<0.1%	0%	214,968	99.9%	99.9%
511	organization_type_cleaning	float64	1.7 MB	2	<0.1%	0%	215,062	99.9%	99.9%
513	organization_type_culture	float64	1.7 MB	2	<0.1%	0%	214,988	99.9%	99.9%
520	organization_type_industry_type_10	float64	1.7 MB	2	<0.1%	0%	215,182	>99.9%	>99.9%
522	organization_type_industry_type_12	float64	1.7 MB	2	<0.1%	0%	214,999	99.9%	99.9%
523	organization_type_industry_type_13	float64	1.7 MB	2	<0.1%	0%	215,211	>99.9%	>99.9%
528	organization_type_industry_type_6	float64	1.7 MB	2	<0.1%	0%	215,180	>99.9%	>99.9%
530	organization_type_industry_type_8	float64	1.7 MB	2	<0.1%	0%	215,240	>99.9%	>99.9%
534	organization_type_legal_services	float64	1.7 MB	2	<0.1%	0%	215,039	99.9%	99.9%
537	organization_type_mobile	float64	1.7 MB	2	<0.1%	0%	215,046	99.9%	99.9%
541	organization_type_realtor	float64	1.7 MB	2	<0.1%	0%	214,978	99.9%	99.9%
542	organization_type_religion	float64	1.7 MB	2	<0.1%	0%	215,198	>99.9%	>99.9%
550	organization_type_trade_type_1	float64	1.7 MB	2	<0.1%	0%	215,020	99.9%	99.9%
553	organization_type_trade_type_4	float64	1.7 MB	2	<0.1%	0%	215,212	>99.9%	>99.9%
554	organization_type_trade_type_5	float64	1.7 MB	2	<0.1%	0%	215,223	>99.9%	>99.9%
557	organization_type_transport_type_1	float64	1.7 MB	2	<0.1%	0%	215,112	99.9%	99.9%
576	mode_credit_type_other	float64	1.7 MB	2	<0.1%	0%	215,155	>99.9%	>99.9%
578	name_education_type_academic_degree	float64	1.7 MB	2	<0.1%	0%	215,153	>99.9%	>99.9%

Next, problematic and redundant features after pre-processing will be identified in the same way as before pre-processing:

Code

cols_to_keep_2 = list(
    set(credits_train_transformed.columns) - set(problematic_columns_2.column)
)

pipeline_selection = Pipeline(
    steps=[
        ("column_selector_2", ColumnSelector(cols_to_keep_2)),
        ("drop_duplicate_features", DropDuplicateFeatures()),
        (
            "drop_corr_features",
            SmartCorrelatedSelection(selection_method="variance"),
        ),
    ]
)

pipeline_selection.fit(credits_train_transformed)
# Time: 2m 8.4s

Pipeline(steps=[('column_selector_2',
                 ColumnSelector(keep=['cnt_installment_mature_cum_min',
                                      'days_credit_update_max',
                                      'ORGANIZATION_TYPE_Trade_type_5',
                                      'REG_REGION_NOT_WORK_REGION',
                                      'missingindicator_days_last_due_1st_version_min',
                                      'FLOORSMIN_MEDI',
                                      'ORGANIZATION_TYPE_Business_Entity_Type_1',
                                      'missingindicator_amt_down_payment_max',
                                      'diff_percent_installme...
                                      'missingindicator_diff_days_installment_payment_sum_late_only',
                                      'missingindicator_cnt_drawings_pos_current_median',
                                      'amt_inst_min_regularity_min',
                                      'missingindicator_percent_installments_early',
                                      'missingindicator_amt_credit_sum_limit_sum',
                                      'n_reject_reason_limit', ...])),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance'))])

Pipeline

Pipeline(steps=[('column_selector_2',
                 ColumnSelector(keep=['cnt_installment_mature_cum_min',
                                      'days_credit_update_max',
                                      'ORGANIZATION_TYPE_Trade_type_5',
                                      'REG_REGION_NOT_WORK_REGION',
                                      'missingindicator_days_last_due_1st_version_min',
                                      'FLOORSMIN_MEDI',
                                      'ORGANIZATION_TYPE_Business_Entity_Type_1',
                                      'missingindicator_amt_down_payment_max',
                                      'diff_percent_installme...
                                      'missingindicator_diff_days_installment_payment_sum_late_only',
                                      'missingindicator_cnt_drawings_pos_current_median',
                                      'amt_inst_min_regularity_min',
                                      'missingindicator_percent_installments_early',
                                      'missingindicator_amt_credit_sum_limit_sum',
                                      'n_reject_reason_limit', ...])),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance'))])

ColumnSelector

ColumnSelector(keep=['cnt_installment_mature_cum_min', 'days_credit_update_max',
                     'ORGANIZATION_TYPE_Trade_type_5',
                     'REG_REGION_NOT_WORK_REGION',
                     'missingindicator_days_last_due_1st_version_min',
                     'FLOORSMIN_MEDI',
                     'ORGANIZATION_TYPE_Business_Entity_Type_1',
                     'missingindicator_amt_down_payment_max',
                     'diff_percent_installment_payment_median',
                     'days_credit_endda...
                     'missingindicator_amt_credit_sum_limit_min',
                     'AMT_REQ_CREDIT_BUREAU_QRT', 'ORGANIZATION_TYPE_Cleaning',
                     'DAYS_REGISTRATION',
                     'missingindicator_diff_days_installment_payment_sum_late_only',
                     'missingindicator_cnt_drawings_pos_current_median',
                     'amt_inst_min_regularity_min',
                     'missingindicator_percent_installments_early',
                     'missingindicator_amt_credit_sum_limit_sum',
                     'n_reject_reason_limit', ...])

DropDuplicateFeatures

DropDuplicateFeatures()

SmartCorrelatedSelection

SmartCorrelatedSelection(selection_method='variance')

Code

credits_train_transformed_not_correlated_cols = pipeline_selection.transform(
    credits_train_transformed
).sort_index(axis=1)

Code

credits_train_transformed_not_correlated_cols.shape

(215257, 361)

Code

credits_train_transformed_not_correlated_cols.head()

	AMT_ANNUITY	AMT_CREDIT	AMT_INCOME_TOTAL	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_YEAR	BASEMENTAREA_MODE	CNT_FAM_MEMBERS	COMMONAREA_MEDI	DAYS_ID_PUBLISH	DAYS_LAST_PHONE_CHANGE	DAYS_REGISTRATION	DEF_30_CNT_SOCIAL_CIRCLE	ELEVATORS_AVG	ENTRANCES_MODE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	FLAG_CONT_MOBILE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	FLAG_DOCUMENT_9	FLAG_EMAIL	FLAG_EMP_PHONE	FLAG_OWN_CAR	FLAG_OWN_REALTY	FLAG_PHONE	FLOORSMAX_MEDI	FLOORSMIN_MEDI	FONDKAPREMONT_MODE_reg_oper_account	HOUSETYPE_MODE_nan	LANDAREA_MEDI	NAME_CONTRACT_TYPE_Cash_loans	NAME_HOUSING_TYPE_House_apartment	NAME_INCOME_TYPE_Commercial_associate	NAME_INCOME_TYPE_State_servant	NAME_TYPE_SUITE_Family	NAME_TYPE_SUITE_Unaccompanied	NONLIVINGAREA_MODE	OBS_30_CNT_SOCIAL_CIRCLE	OCCUPATION_TYPE_Accountants	OCCUPATION_TYPE_Drivers	OCCUPATION_TYPE_Laborers	OCCUPATION_TYPE_Managers	OCCUPATION_TYPE_Sales_staff	ORGANIZATION_TYPE_Agriculture	ORGANIZATION_TYPE_Business_Entity_Type_3	ORGANIZATION_TYPE_Construction	ORGANIZATION_TYPE_Self_employed	...	amt_payment_current_range	any_installments_late_7	bureau_dpd_status_max	cnt_drawings_atm_current_max	cnt_drawings_pos_current_max	cnt_fam_members_excluding_children	cnt_installment_mature_cum_max	cnt_installment_median	cnt_installment_min	cnt_installment_range	cnt_installments_diff_range	cnt_payment_median	cnt_payment_min	cnt_payment_range	days_credit_enddate_max	days_credit_enddate_min	days_credit_max	days_credit_median	days_credit_range	days_credit_std	days_credit_update_max	days_credit_update_median	days_credit_update_range	days_decision_max	days_decision_median	days_decision_range	days_enddate_fact_max	days_enddate_fact_median	days_enddate_fact_range	days_first_draw_min	days_last_due_1st_version_max	days_last_due_1st_version_mean	days_last_due_1st_version_median	days_last_due_1st_version_min	days_last_due_max	days_termination_median	days_termination_min	diff_amt_installment_payment_max	diff_amt_installment_payment_mean	diff_amt_installment_payment_range	diff_days_installment_payment_max	diff_days_installment_payment_mean	diff_days_installment_payment_median	diff_days_installment_payment_range	diff_days_installment_payment_sum	diff_days_installment_payment_sum_late_only	diff_percent_installment_payment_mean	diff_percent_installment_payment_median	diff_percent_installment_payment_min	diff_percent_installment_payment_range	missingindicator_EXT_SOURCE_1	missingindicator_EXT_SOURCE_3	missingindicator_YEARS_BUILD_AVG	missingindicator_amt_credit_max_overdue_max	missingindicator_amt_credit_sum_debt_mean	missingindicator_amt_credit_sum_limit_min	missingindicator_amt_credit_sum_limit_std	missingindicator_amt_down_payment_max	missingindicator_bureau_months_balance_max	missingindicator_days_credit_enddate_std	missingindicator_days_enddate_fact_range	mode_credit_type_Consumer_credit	n_car_loans	n_channel_type_contact_center	n_channel_type_countrywide	n_channel_type_regional_and_local	n_client_type_new	n_client_type_refreshed	n_client_type_repeater	n_consumer_loans	n_contract_status_refused	n_credit_card_credits	n_credits_active	n_credits_total	n_different_channels	n_different_contract_types	n_different_credit_types	n_different_currencies	n_installments_late	n_installments_late_7	n_installments_total	n_nflag_insured_on_approval_mean	n_nflag_insured_on_approval_sum	n_payment_type_cash_through_bank	n_payment_type_not_available	n_previous_credit_card_applications	n_previous_pos_applications	n_previous_pos_applications_completed	n_product_type_walk_in	n_revolving_loans	n_yield_group_high	n_yield_group_low_action	n_yield_group_low_normal	n_yield_group_middle	ord_education_type	percent_installments_early	percent_installments_late	percent_installments_late_7	rate_down_payment_max	rate_down_payment_range	years_employed
0	68643.00	1971072.00	405000.00	0.00	0.00	0.10	4.00	0.02	-1823.00	-2169.00	-7460.00	0.00	0.00	0.24	0.68	0.33	0.64	1.00	1.00	0.00	0.00	0.00	1.00	1.00	1.00	0.00	0.17	0.21	1.00	0.00	0.00	1.00	1.00	1.00	0.00	0.00	1.00	0.03	4.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	...	63000.00	0.00	1.00	3.00	1.00	2.00	7.00	12.00	12.00	0.00	12.00	12.00	12.00	0.00	934.00	-746.00	-145.00	-1001.50	1094.00	489.28	-7.00	-189.50	734.00	-2169.00	-2169.00	0.00	-362.00	-554.00	384.00	365243.00	-1808.00	-1808.00	-1808.00	-1808.00	-1808.00	-1805.00	-1805.00	0.00	0.00	0.00	25.00	12.75	13.50	24.00	153.00	0.00	1.00	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	1.00	0.00	1.00	0.00	1.00	0.00	0.00	1.00	0.00	1.00	2.00	4.00	1.00	1.00	3.00	1.00	0.00	0.00	12.00	0.00	0.00	1.00	0.00	21.00	13.00	1.00	0.00	0.00	1.00	0.00	0.00	0.00	3.00	1.00	0.00	0.00	0.10	0.00	2.82
1	38146.50	508495.50	337500.00	0.00	6.00	0.07	2.00	0.02	-1090.00	-659.00	-4054.00	1.00	0.00	0.14	0.51	0.62	0.44	1.00	0.00	1.00	0.00	0.00	1.00	0.00	1.00	0.00	0.17	0.21	0.00	1.00	0.05	1.00	1.00	0.00	1.00	1.00	0.00	0.00	2.00	0.00	0.00	0.00	1.00	0.00	1.00	0.00	0.00	0.00	...	63000.00	0.00	0.00	3.00	1.00	2.00	0.00	12.00	11.00	13.00	11.00	12.00	0.00	24.00	911.00	-1267.00	-300.00	-957.00	1262.00	621.29	-19.00	-360.00	904.00	-330.00	-361.00	329.00	-345.00	-872.50	821.00	365243.00	365243.00	121778.00	61.00	30.00	365243.00	365243.00	-325.00	0.00	0.00	0.00	7.00	3.24	3.00	6.00	68.00	0.00	1.00	1.00	1.00	0.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	4.00	0.00	0.00	1.00	2.00	4.00	1.00	2.00	2.00	1.00	0.00	0.00	21.00	0.67	2.00	2.00	3.00	11.00	22.00	1.00	1.00	2.00	0.00	0.00	1.00	1.00	3.00	1.00	0.00	0.00	0.11	0.00	3.31
2	13068.00	110146.50	112500.00	0.00	1.00	0.07	3.00	0.02	-4130.00	-172.00	-5554.00	0.00	0.00	0.14	0.36	0.65	0.54	1.00	0.00	0.00	1.00	1.00	1.00	0.00	1.00	1.00	0.17	0.21	0.00	1.00	0.05	1.00	1.00	1.00	0.00	0.00	1.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	1.00	0.00	0.00	...	63000.00	0.00	0.00	3.00	1.00	2.00	7.00	8.00	2.00	58.00	10.00	10.00	4.00	56.00	911.00	-1267.00	-300.00	-957.00	1262.00	621.29	-19.00	-360.00	904.00	-121.00	-172.00	2606.00	-345.00	-872.50	821.00	365243.00	1628.00	-301.50	-204.00	-2426.00	-112.00	-229.00	-2420.00	0.00	-15000.00	285159.69	27.00	16.19	16.00	23.00	340.00	0.00	0.95	1.00	0.09	0.91	0.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	0.00	0.00	0.00	1.00	2.00	1.00	1.00	5.00	3.00	1.00	1.00	2.00	4.00	3.00	2.00	2.00	1.00	0.00	0.00	21.00	0.75	3.00	5.00	2.00	21.00	26.00	3.00	1.00	0.00	0.00	1.00	3.00	1.00	1.00	1.00	0.00	0.00	0.47	0.47	1.62
3	3519.00	66384.00	40500.00	1.00	2.00	0.07	4.00	0.02	-5290.00	-1576.00	-5285.00	0.00	0.00	0.14	0.39	0.60	0.45	1.00	1.00	0.00	0.00	0.00	1.00	0.00	1.00	0.00	0.17	0.21	0.00	1.00	0.05	1.00	1.00	1.00	0.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	1.00	...	63000.00	1.00	1.00	3.00	1.00	2.00	7.00	24.00	6.00	18.00	24.00	15.00	6.00	24.00	30905.00	-679.00	-325.00	-545.00	1020.00	398.50	-14.00	-20.00	629.00	-575.00	-1190.00	2293.00	-518.00	-583.50	131.00	365243.00	-84.00	-1387.67	-1392.00	-2687.00	-84.00	-1388.00	-2683.00	9004.50	243.42	9004.50	20.00	4.76	5.00	30.00	176.00	-11.00	121.18	1.00	1.00	4446.67	0.00	0.00	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00	2.00	0.00	0.00	1.00	3.00	2.00	1.00	1.00	3.00	5.00	3.00	2.00	2.00	1.00	2.00	1.00	37.00	0.67	2.00	2.00	2.00	21.00	38.00	1.00	0.00	0.00	1.00	0.00	1.00	2.00	1.00	0.78	0.05	0.03	0.10	0.10	14.73
4	31801.50	298512.00	225000.00	0.00	0.00	0.14	2.00	0.10	-3033.00	-624.00	-86.00	0.00	0.40	0.17	0.74	0.66	0.72	1.00	1.00	0.00	0.00	0.00	1.00	1.00	0.00	0.00	0.46	0.00	1.00	0.00	0.00	1.00	1.00	1.00	0.00	0.00	1.00	0.00	3.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	...	63000.00	0.00	0.00	3.00	1.00	2.00	7.00	10.00	5.00	5.00	5.00	10.00	10.00	0.00	703.00	-2526.00	-965.00	-1106.00	1896.00	1056.31	-50.00	-696.00	2445.00	-624.00	-624.00	0.00	-723.00	-1612.00	1778.00	365243.00	-323.00	-323.00	-323.00	-323.00	-473.00	-467.00	-467.00	0.00	0.00	0.00	14.00	6.40	4.00	14.00	32.00	0.00	1.00	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	1.00	0.00	0.00	0.00	1.00	1.00	0.00	0.00	1.00	0.00	1.00	1.00	3.00	1.00	1.00	2.00	1.00	0.00	0.00	5.00	0.00	0.00	1.00	0.00	21.00	6.00	1.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00	0.80	0.00	0.00	0.11	0.00	3.27

5 rows × 361 columns

credits_train_transformed_not_correlated_col_info = an.col_info(
    credits_train_transformed_not_correlated_cols
)

credits_train_transformed_not_correlated_col_info.pipe(an.style_col_info)

Table 6.5. Info on the final set of columns after preprocessing.

	column	data_type	memory_size	n_unique	p_unique	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	AMT_ANNUITY	float64	1.7 MB	12,801	5.9%	0%	4,499	2.1%	2.1%	9000.0
2	AMT_CREDIT	float64	1.7 MB	5,097	2.4%	0%	6,823	3.2%	3.2%	450000.0
3	AMT_INCOME_TOTAL	float64	1.7 MB	1,949	0.9%	0%	24,982	11.6%	11.6%	135000.0
4	AMT_REQ_CREDIT_BUREAU_DAY	float64	1.7 MB	9	<0.1%	0%	214,228	99.5%	99.5%	0.0
5	AMT_REQ_CREDIT_BUREAU_HOUR	float64	1.7 MB	5	<0.1%	0%	214,142	99.5%	99.5%	0.0
6	AMT_REQ_CREDIT_BUREAU_MON	float64	1.7 MB	22	<0.1%	0%	184,760	85.8%	85.8%	0.0
7	AMT_REQ_CREDIT_BUREAU_QRT	float64	1.7 MB	10	<0.1%	0%	179,976	83.6%	83.6%	0.0
8	AMT_REQ_CREDIT_BUREAU_WEEK	float64	1.7 MB	9	<0.1%	0%	209,327	97.2%	97.2%	0.0
9	AMT_REQ_CREDIT_BUREAU_YEAR	float64	1.7 MB	24	<0.1%	0%	73,441	34.1%	34.1%	1.0
10	BASEMENTAREA_MODE	float64	1.7 MB	3,687	1.7%	0%	125,860	58.5%	58.5%	0.07460000365972519
11	CNT_FAM_MEMBERS	float64	1.7 MB	12	<0.1%	0%	110,672	51.4%	51.4%	2.0
12	COMMONAREA_MEDI	float64	1.7 MB	2,982	1.4%	0%	150,382	69.9%	69.9%	0.020899999886751175
13	DAYS_ID_PUBLISH	float64	1.7 MB	6,122	2.8%	0%	119	0.1%	0.1%	-4074.0
14	DAYS_LAST_PHONE_CHANGE	float64	1.7 MB	3,720	1.7%	0%	26,201	12.2%	12.2%	0.0
15	DAYS_REGISTRATION	float64	1.7 MB	15,249	7.1%	0%	79	<0.1%	<0.1%	-7.0
16	DEF_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	10	<0.1%	0%	190,702	88.6%	88.6%	0.0
17	ELEVATORS_AVG	float64	1.7 MB	241	0.1%	0%	174,679	81.1%	81.1%	0.0
18	ENTRANCES_MODE	float64	1.7 MB	30	<0.1%	0%	133,580	62.1%	62.1%	0.1378999948501587
19	EXT_SOURCE_1	float64	1.7 MB	83,962	39.0%	0%	121,373	56.4%	56.4%	0.5052886605262756
20	EXT_SOURCE_2	float64	1.7 MB	102,229	47.5%	0%	503	0.2%	0.2%	0.2858978807926178
21	EXT_SOURCE_3	float64	1.7 MB	804	0.4%	0%	43,202	20.1%	20.1%	0.5352762341499329
22	FLAG_CONT_MOBILE	float64	1.7 MB	2	<0.1%	0%	214,855	99.8%	99.8%	1.0
23	FLAG_DOCUMENT_11	float64	1.7 MB	2	<0.1%	0%	214,448	99.6%	99.6%	0.0
24	FLAG_DOCUMENT_13	float64	1.7 MB	2	<0.1%	0%	214,541	99.7%	99.7%	0.0
25	FLAG_DOCUMENT_14	float64	1.7 MB	2	<0.1%	0%	214,614	99.7%	99.7%	0.0
26	FLAG_DOCUMENT_16	float64	1.7 MB	2	<0.1%	0%	213,089	99.0%	99.0%	0.0
27	FLAG_DOCUMENT_18	float64	1.7 MB	2	<0.1%	0%	213,525	99.2%	99.2%	0.0
28	FLAG_DOCUMENT_3	float64	1.7 MB	2	<0.1%	0%	152,845	71.0%	71.0%	1.0
29	FLAG_DOCUMENT_5	float64	1.7 MB	2	<0.1%	0%	212,025	98.5%	98.5%	0.0
30	FLAG_DOCUMENT_6	float64	1.7 MB	2	<0.1%	0%	196,348	91.2%	91.2%	0.0
31	FLAG_DOCUMENT_8	float64	1.7 MB	2	<0.1%	0%	197,689	91.8%	91.8%	0.0
32	FLAG_DOCUMENT_9	float64	1.7 MB	2	<0.1%	0%	214,440	99.6%	99.6%	0.0
33	FLAG_EMAIL	float64	1.7 MB	2	<0.1%	0%	203,006	94.3%	94.3%	0.0
34	FLAG_EMP_PHONE	float64	1.7 MB	2	<0.1%	0%	176,491	82.0%	82.0%	1.0
35	FLAG_IS_EMERGENCY	float64	1.7 MB	2	<0.1%	0%	213,628	99.2%	99.2%	0.0
36	FLAG_OWN_CAR	float64	1.7 MB	2	<0.1%	0%	142,086	66.0%	66.0%	0.0
37	FLAG_OWN_REALTY	float64	1.7 MB	2	<0.1%	0%	149,412	69.4%	69.4%	1.0
38	FLAG_PHONE	float64	1.7 MB	2	<0.1%	0%	154,906	72.0%	72.0%	0.0
39	FLAG_WORK_PHONE	float64	1.7 MB	2	<0.1%	0%	172,406	80.1%	80.1%	0.0
40	FLOORSMAX_MEDI	float64	1.7 MB	49	<0.1%	0%	151,629	70.4%	70.4%	0.16670000553131104
41	FLOORSMIN_MEDI	float64	1.7 MB	47	<0.1%	0%	169,787	78.9%	78.9%	0.20829999446868896
42	FONDKAPREMONT_MODE_not_specified	float64	1.7 MB	2	<0.1%	0%	211,294	98.2%	98.2%	0.0
43	FONDKAPREMONT_MODE_org_spec_account	float64	1.7 MB	2	<0.1%	0%	211,329	98.2%	98.2%	0.0
44	FONDKAPREMONT_MODE_reg_oper_account	float64	1.7 MB	2	<0.1%	0%	163,472	75.9%	75.9%	0.0
45	FONDKAPREMONT_MODE_reg_oper_spec_account	float64	1.7 MB	2	<0.1%	0%	206,775	96.1%	96.1%	0.0
46	HOUSETYPE_MODE_nan	float64	1.7 MB	2	<0.1%	0%	107,834	50.1%	50.1%	1.0
47	HOUSETYPE_MODE_specific_housing	float64	1.7 MB	2	<0.1%	0%	214,216	99.5%	99.5%	0.0
48	HOUSETYPE_MODE_terraced_house	float64	1.7 MB	2	<0.1%	0%	214,390	99.6%	99.6%	0.0
49	LANDAREA_MEDI	float64	1.7 MB	3,393	1.6%	0%	127,718	59.3%	59.3%	0.048700001090765
50	NAME_CONTRACT_TYPE_Cash_loans	float64	1.7 MB	2	<0.1%	0%	194,675	90.4%	90.4%	1.0
51	NAME_EDUCATION_TYPE_Academic_degree	float64	1.7 MB	2	<0.1%	0%	215,153	>99.9%	>99.9%	0.0
52	NAME_EDUCATION_TYPE_Incomplete_higher	float64	1.7 MB	2	<0.1%	0%	208,006	96.6%	96.6%	0.0
53	NAME_EDUCATION_TYPE_Lower_secondary	float64	1.7 MB	2	<0.1%	0%	212,602	98.8%	98.8%	0.0
54	NAME_HOUSING_TYPE_Co_op_apartment	float64	1.7 MB	2	<0.1%	0%	214,466	99.6%	99.6%	0.0
55	NAME_HOUSING_TYPE_House_apartment	float64	1.7 MB	2	<0.1%	0%	191,159	88.8%	88.8%	1.0
56	NAME_HOUSING_TYPE_Municipal_apartment	float64	1.7 MB	2	<0.1%	0%	207,454	96.4%	96.4%	0.0
57	NAME_HOUSING_TYPE_Office_apartment	float64	1.7 MB	2	<0.1%	0%	213,440	99.2%	99.2%	0.0
58	NAME_HOUSING_TYPE_Rented_apartment	float64	1.7 MB	2	<0.1%	0%	211,900	98.4%	98.4%	0.0
59	NAME_HOUSING_TYPE_With_parents	float64	1.7 MB	2	<0.1%	0%	204,927	95.2%	95.2%	0.0
60	NAME_INCOME_TYPE_Businessman	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%	0.0
61	NAME_INCOME_TYPE_Commercial_associate	float64	1.7 MB	2	<0.1%	0%	165,151	76.7%	76.7%	0.0
62	NAME_INCOME_TYPE_Maternity_leave	float64	1.7 MB	2	<0.1%	0%	215,254	>99.9%	>99.9%	0.0
63	NAME_INCOME_TYPE_State_servant	float64	1.7 MB	2	<0.1%	0%	199,875	92.9%	92.9%	0.0
64	NAME_INCOME_TYPE_Student	float64	1.7 MB	2	<0.1%	0%	215,248	>99.9%	>99.9%	0.0
65	NAME_INCOME_TYPE_Unemployed	float64	1.7 MB	2	<0.1%	0%	215,241	>99.9%	>99.9%	0.0
66	NAME_INCOME_TYPE_Working	float64	1.7 MB	2	<0.1%	0%	110,984	51.6%	51.6%	1.0
67	NAME_TYPE_SUITE_Children	float64	1.7 MB	2	<0.1%	0%	212,930	98.9%	98.9%	0.0
68	NAME_TYPE_SUITE_Family	float64	1.7 MB	2	<0.1%	0%	187,256	87.0%	87.0%	0.0
69	NAME_TYPE_SUITE_Group_of_people	float64	1.7 MB	2	<0.1%	0%	215,059	99.9%	99.9%	0.0
70	NAME_TYPE_SUITE_Other_A	float64	1.7 MB	2	<0.1%	0%	214,643	99.7%	99.7%	0.0
71	NAME_TYPE_SUITE_Other_B	float64	1.7 MB	2	<0.1%	0%	214,009	99.4%	99.4%	0.0
72	NAME_TYPE_SUITE_Spouse_partner	float64	1.7 MB	2	<0.1%	0%	207,378	96.3%	96.3%	0.0
73	NAME_TYPE_SUITE_Unaccompanied	float64	1.7 MB	2	<0.1%	0%	174,089	80.9%	80.9%	1.0
74	NAME_TYPE_SUITE_nan	float64	1.7 MB	2	<0.1%	0%	214,356	99.6%	99.6%	0.0
75	NONLIVINGAPARTMENTS_AVG	float64	1.7 MB	345	0.2%	0%	187,673	87.2%	87.2%	0.0
76	NONLIVINGAREA_MODE	float64	1.7 MB	3,090	1.4%	0%	118,905	55.2%	55.2%	0.0010999999940395355
77	OBS_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	32	<0.1%	0%	115,264	53.5%	53.5%	0.0
78	OCCUPATION_TYPE_Accountants	float64	1.7 MB	2	<0.1%	0%	208,415	96.8%	96.8%	0.0
79	OCCUPATION_TYPE_Cleaning_staff	float64	1.7 MB	2	<0.1%	0%	211,947	98.5%	98.5%	0.0
80	OCCUPATION_TYPE_Cooking_staff	float64	1.7 MB	2	<0.1%	0%	211,079	98.1%	98.1%	0.0
81	OCCUPATION_TYPE_Core_staff	float64	1.7 MB	2	<0.1%	0%	195,912	91.0%	91.0%	0.0
82	OCCUPATION_TYPE_Drivers	float64	1.7 MB	2	<0.1%	0%	202,169	93.9%	93.9%	0.0
83	OCCUPATION_TYPE_HR_staff	float64	1.7 MB	2	<0.1%	0%	214,884	99.8%	99.8%	0.0
84	OCCUPATION_TYPE_High_skill_tech_staff	float64	1.7 MB	2	<0.1%	0%	207,280	96.3%	96.3%	0.0
85	OCCUPATION_TYPE_IT_staff	float64	1.7 MB	2	<0.1%	0%	214,896	99.8%	99.8%	0.0
86	OCCUPATION_TYPE_Laborers	float64	1.7 MB	2	<0.1%	0%	176,666	82.1%	82.1%	0.0
87	OCCUPATION_TYPE_Low_skill_Laborers	float64	1.7 MB	2	<0.1%	0%	213,777	99.3%	99.3%	0.0
88	OCCUPATION_TYPE_Managers	float64	1.7 MB	2	<0.1%	0%	200,272	93.0%	93.0%	0.0
89	OCCUPATION_TYPE_Medicine_staff	float64	1.7 MB	2	<0.1%	0%	209,207	97.2%	97.2%	0.0
90	OCCUPATION_TYPE_Private_service_staff	float64	1.7 MB	2	<0.1%	0%	213,406	99.1%	99.1%	0.0
91	OCCUPATION_TYPE_Realty_agents	float64	1.7 MB	2	<0.1%	0%	214,733	99.8%	99.8%	0.0
92	OCCUPATION_TYPE_Sales_staff	float64	1.7 MB	2	<0.1%	0%	192,972	89.6%	89.6%	0.0
93	OCCUPATION_TYPE_Secretaries	float64	1.7 MB	2	<0.1%	0%	214,342	99.6%	99.6%	0.0
94	OCCUPATION_TYPE_Security_staff	float64	1.7 MB	2	<0.1%	0%	210,559	97.8%	97.8%	0.0
95	OCCUPATION_TYPE_Waiters_barmen_staff	float64	1.7 MB	2	<0.1%	0%	214,333	99.6%	99.6%	0.0
96	OCCUPATION_TYPE_nan	float64	1.7 MB	2	<0.1%	0%	147,777	68.7%	68.7%	0.0
97	ORGANIZATION_TYPE_Advertising	float64	1.7 MB	2	<0.1%	0%	214,968	99.9%	99.9%	0.0
98	ORGANIZATION_TYPE_Agriculture	float64	1.7 MB	2	<0.1%	0%	213,527	99.2%	99.2%	0.0
99	ORGANIZATION_TYPE_Bank	float64	1.7 MB	2	<0.1%	0%	213,522	99.2%	99.2%	0.0
100	ORGANIZATION_TYPE_Business_Entity_Type_1	float64	1.7 MB	2	<0.1%	0%	211,043	98.0%	98.0%	0.0
101	ORGANIZATION_TYPE_Business_Entity_Type_2	float64	1.7 MB	2	<0.1%	0%	207,883	96.6%	96.6%	0.0
102	ORGANIZATION_TYPE_Business_Entity_Type_3	float64	1.7 MB	2	<0.1%	0%	167,675	77.9%	77.9%	0.0
103	ORGANIZATION_TYPE_Cleaning	float64	1.7 MB	2	<0.1%	0%	215,062	99.9%	99.9%	0.0
104	ORGANIZATION_TYPE_Construction	float64	1.7 MB	2	<0.1%	0%	210,553	97.8%	97.8%	0.0
105	ORGANIZATION_TYPE_Culture	float64	1.7 MB	2	<0.1%	0%	214,988	99.9%	99.9%	0.0
106	ORGANIZATION_TYPE_Electricity	float64	1.7 MB	2	<0.1%	0%	214,583	99.7%	99.7%	0.0
107	ORGANIZATION_TYPE_Emergency	float64	1.7 MB	2	<0.1%	0%	214,862	99.8%	99.8%	0.0
108	ORGANIZATION_TYPE_Government	float64	1.7 MB	2	<0.1%	0%	207,933	96.6%	96.6%	0.0
109	ORGANIZATION_TYPE_Hotel	float64	1.7 MB	2	<0.1%	0%	214,571	99.7%	99.7%	0.0
110	ORGANIZATION_TYPE_Housing	float64	1.7 MB	2	<0.1%	0%	213,202	99.0%	99.0%	0.0
111	ORGANIZATION_TYPE_Industry_type_1	float64	1.7 MB	2	<0.1%	0%	214,520	99.7%	99.7%	0.0
112	ORGANIZATION_TYPE_Industry_type_10	float64	1.7 MB	2	<0.1%	0%	215,182	>99.9%	>99.9%	0.0
113	ORGANIZATION_TYPE_Industry_type_11	float64	1.7 MB	2	<0.1%	0%	213,369	99.1%	99.1%	0.0
114	ORGANIZATION_TYPE_Industry_type_12	float64	1.7 MB	2	<0.1%	0%	214,999	99.9%	99.9%	0.0
115	ORGANIZATION_TYPE_Industry_type_13	float64	1.7 MB	2	<0.1%	0%	215,211	>99.9%	>99.9%	0.0
116	ORGANIZATION_TYPE_Industry_type_2	float64	1.7 MB	2	<0.1%	0%	214,931	99.8%	99.8%	0.0
117	ORGANIZATION_TYPE_Industry_type_3	float64	1.7 MB	2	<0.1%	0%	212,965	98.9%	98.9%	0.0
118	ORGANIZATION_TYPE_Industry_type_4	float64	1.7 MB	2	<0.1%	0%	214,624	99.7%	99.7%	0.0
119	ORGANIZATION_TYPE_Industry_type_5	float64	1.7 MB	2	<0.1%	0%	214,864	99.8%	99.8%	0.0
120	ORGANIZATION_TYPE_Industry_type_6	float64	1.7 MB	2	<0.1%	0%	215,180	>99.9%	>99.9%	0.0
121	ORGANIZATION_TYPE_Industry_type_7	float64	1.7 MB	2	<0.1%	0%	214,354	99.6%	99.6%	0.0
122	ORGANIZATION_TYPE_Industry_type_8	float64	1.7 MB	2	<0.1%	0%	215,240	>99.9%	>99.9%	0.0
123	ORGANIZATION_TYPE_Industry_type_9	float64	1.7 MB	2	<0.1%	0%	212,861	98.9%	98.9%	0.0
124	ORGANIZATION_TYPE_Insurance	float64	1.7 MB	2	<0.1%	0%	214,842	99.8%	99.8%	0.0
125	ORGANIZATION_TYPE_Kindergarten	float64	1.7 MB	2	<0.1%	0%	210,366	97.7%	97.7%	0.0
126	ORGANIZATION_TYPE_Legal_Services	float64	1.7 MB	2	<0.1%	0%	215,039	99.9%	99.9%	0.0
127	ORGANIZATION_TYPE_Medicine	float64	1.7 MB	2	<0.1%	0%	207,340	96.3%	96.3%	0.0
128	ORGANIZATION_TYPE_Military	float64	1.7 MB	2	<0.1%	0%	213,400	99.1%	99.1%	0.0
129	ORGANIZATION_TYPE_Mobile	float64	1.7 MB	2	<0.1%	0%	215,046	99.9%	99.9%	0.0
130	ORGANIZATION_TYPE_Other	float64	1.7 MB	2	<0.1%	0%	203,595	94.6%	94.6%	0.0
131	ORGANIZATION_TYPE_Police	float64	1.7 MB	2	<0.1%	0%	213,649	99.3%	99.3%	0.0
132	ORGANIZATION_TYPE_Postal	float64	1.7 MB	2	<0.1%	0%	213,737	99.3%	99.3%	0.0
133	ORGANIZATION_TYPE_Realtor	float64	1.7 MB	2	<0.1%	0%	214,978	99.9%	99.9%	0.0
134	ORGANIZATION_TYPE_Religion	float64	1.7 MB	2	<0.1%	0%	215,198	>99.9%	>99.9%	0.0
135	ORGANIZATION_TYPE_Restaurant	float64	1.7 MB	2	<0.1%	0%	213,972	99.4%	99.4%	0.0
136	ORGANIZATION_TYPE_School	float64	1.7 MB	2	<0.1%	0%	208,961	97.1%	97.1%	0.0
137	ORGANIZATION_TYPE_Security	float64	1.7 MB	2	<0.1%	0%	212,955	98.9%	98.9%	0.0
138	ORGANIZATION_TYPE_Security_Ministries	float64	1.7 MB	2	<0.1%	0%	213,854	99.3%	99.3%	0.0
139	ORGANIZATION_TYPE_Self_employed	float64	1.7 MB	2	<0.1%	0%	188,576	87.6%	87.6%	0.0
140	ORGANIZATION_TYPE_Services	float64	1.7 MB	2	<0.1%	0%	214,168	99.5%	99.5%	0.0
141	ORGANIZATION_TYPE_Telecom	float64	1.7 MB	2	<0.1%	0%	214,861	99.8%	99.8%	0.0
142	ORGANIZATION_TYPE_Trade_type_1	float64	1.7 MB	2	<0.1%	0%	215,020	99.9%	99.9%	0.0
143	ORGANIZATION_TYPE_Trade_type_2	float64	1.7 MB	2	<0.1%	0%	213,919	99.4%	99.4%	0.0
144	ORGANIZATION_TYPE_Trade_type_3	float64	1.7 MB	2	<0.1%	0%	212,832	98.9%	98.9%	0.0
145	ORGANIZATION_TYPE_Trade_type_4	float64	1.7 MB	2	<0.1%	0%	215,212	>99.9%	>99.9%	0.0
146	ORGANIZATION_TYPE_Trade_type_5	float64	1.7 MB	2	<0.1%	0%	215,223	>99.9%	>99.9%	0.0
147	ORGANIZATION_TYPE_Trade_type_6	float64	1.7 MB	2	<0.1%	0%	214,832	99.8%	99.8%	0.0
148	ORGANIZATION_TYPE_Trade_type_7	float64	1.7 MB	2	<0.1%	0%	209,807	97.5%	97.5%	0.0
149	ORGANIZATION_TYPE_Transport_type_1	float64	1.7 MB	2	<0.1%	0%	215,112	99.9%	99.9%	0.0
150	ORGANIZATION_TYPE_Transport_type_2	float64	1.7 MB	2	<0.1%	0%	213,728	99.3%	99.3%	0.0
151	ORGANIZATION_TYPE_Transport_type_3	float64	1.7 MB	2	<0.1%	0%	214,406	99.6%	99.6%	0.0
152	ORGANIZATION_TYPE_Transport_type_4	float64	1.7 MB	2	<0.1%	0%	211,508	98.3%	98.3%	0.0
153	ORGANIZATION_TYPE_University	float64	1.7 MB	2	<0.1%	0%	214,340	99.6%	99.6%	0.0
154	OWN_CAR_AGE	float64	1.7 MB	61	<0.1%	0%	145,584	67.6%	67.6%	9.0
155	REGION_POPULATION_RELATIVE	float64	1.7 MB	81	<0.1%	0%	11,494	5.3%	5.3%	0.03579200059175491
156	REGION_RATING_CLIENT	float64	1.7 MB	3	<0.1%	0%	158,846	73.8%	73.8%	2.0
157	REG_CITY_NOT_LIVE_CITY	float64	1.7 MB	2	<0.1%	0%	198,549	92.2%	92.2%	0.0
158	REG_CITY_NOT_WORK_CITY	float64	1.7 MB	2	<0.1%	0%	165,697	77.0%	77.0%	0.0
159	REG_REGION_NOT_LIVE_REGION	float64	1.7 MB	2	<0.1%	0%	211,999	98.5%	98.5%	0.0
160	REG_REGION_NOT_WORK_REGION	float64	1.7 MB	2	<0.1%	0%	204,222	94.9%	94.9%	0.0
161	WALLSMATERIAL_MODE_Block	float64	1.7 MB	2	<0.1%	0%	208,728	97.0%	97.0%	0.0
162	WALLSMATERIAL_MODE_Mixed	float64	1.7 MB	2	<0.1%	0%	213,683	99.3%	99.3%	0.0
163	WALLSMATERIAL_MODE_Monolithic	float64	1.7 MB	2	<0.1%	0%	214,008	99.4%	99.4%	0.0
164	WALLSMATERIAL_MODE_Others	float64	1.7 MB	2	<0.1%	0%	214,119	99.5%	99.5%	0.0
165	WALLSMATERIAL_MODE_Panel	float64	1.7 MB	2	<0.1%	0%	168,959	78.5%	78.5%	0.0
166	WALLSMATERIAL_MODE_Stone_brick	float64	1.7 MB	2	<0.1%	0%	169,849	78.9%	78.9%	0.0
167	WALLSMATERIAL_MODE_Wooden	float64	1.7 MB	2	<0.1%	0%	211,525	98.3%	98.3%	0.0
168	YEARS_BEGINEXPLUATATION_MODE	float64	1.7 MB	210	0.1%	0%	107,681	50.0%	50.0%	0.9815999865531921
169	YEARS_BUILD_AVG	float64	1.7 MB	146	0.1%	0%	144,837	67.3%	67.3%	0.7552000284194946
170	amt_annuity_max	float64	1.7 MB	18,638	8.7%	0%	159,516	74.1%	74.1%	12500.01
171	amt_annuity_median	float64	1.7 MB	16,441	7.6%	0%	159,485	74.1%	74.1%	3942.0
172	amt_annuity_median_previous_application	float64	1.7 MB	157,063	73.0%	0%	11,753	5.5%	5.5%	10773.157500000001
173	amt_annuity_min	float64	1.7 MB	9,921	4.6%	0%	196,455	91.3%	91.3%	0.0
174	amt_annuity_min_previous_application	float64	1.7 MB	113,816	52.9%	0%	16,017	7.4%	7.4%	2250.0
175	amt_annuity_to_credit_ratio	float64	1.7 MB	33,148	15.4%	0%	20,564	9.6%	9.6%	0.05000000074505806
176	amt_annuity_to_income_per_family_member	float64	1.7 MB	88,172	41.0%	0%	1,500	0.7%	0.7%	0.3
177	amt_annuity_to_income_ratio	float64	1.7 MB	71,916	33.4%	0%	2,049	1.0%	1.0%	0.1
178	amt_balance_credit_card_max	float64	1.7 MB	40,175	18.7%	0%	154,159	71.6%	71.6%	97790.49
179	amt_balance_credit_card_median	float64	1.7 MB	27,685	12.9%	0%	187,185	87.0%	87.0%	0.0
180	amt_balance_credit_card_min	float64	1.7 MB	8,310	3.9%	0%	206,302	95.8%	95.8%	0.0
181	amt_credit_limit_actual_median	float64	1.7 MB	151	0.1%	0%	155,593	72.3%	72.3%	157500.0
182	amt_credit_limit_actual_range	float64	1.7 MB	147	0.1%	0%	157,689	73.3%	73.3%	45000.0
183	amt_credit_max	float64	1.7 MB	49,618	23.1%	0%	14,581	6.8%	6.8%	225000.0
184	amt_credit_max_overdue_max	float64	1.7 MB	32,871	15.3%	0%	166,187	77.2%	77.2%	0.0
185	amt_credit_max_overdue_range	float64	1.7 MB	27,267	12.7%	0%	175,595	81.6%	81.6%	0.0
186	amt_credit_median	float64	1.7 MB	73,966	34.4%	0%	11,457	5.3%	5.3%	83054.25
187	amt_credit_min	float64	1.7 MB	33,220	15.4%	0%	79,660	37.0%	37.0%	0.0
188	amt_credit_range	float64	1.7 MB	71,950	33.4%	0%	37,038	17.2%	17.2%	0.0
189	amt_credit_sum_debt_mean	float64	1.7 MB	121,544	56.5%	0%	48,543	22.6%	22.6%	0.0
190	amt_credit_sum_debt_sum	float64	1.7 MB	113,811	52.9%	0%	53,746	25.0%	25.0%	0.0
191	amt_credit_sum_limit_min	float64	1.7 MB	2,121	1.0%	0%	212,794	98.9%	98.9%	0.0
192	amt_credit_sum_limit_sum	float64	1.7 MB	26,367	12.2%	0%	181,184	84.2%	84.2%	0.0
193	amt_credit_sum_median	float64	1.7 MB	77,800	36.1%	0%	30,841	14.3%	14.3%	133852.5
194	amt_credit_sum_overdue_sum	float64	1.7 MB	930	0.4%	0%	212,926	98.9%	98.9%	0.0
195	amt_credit_sum_std	float64	1.7 MB	148,440	69.0%	0%	55,965	26.0%	26.0%	183202.88926385253
196	amt_credit_sum_sum	float64	1.7 MB	147,742	68.6%	0%	30,837	14.3%	14.3%	964161.0
197	amt_credit_to_income_ratio	float64	1.7 MB	39,372	18.3%	0%	3,691	1.7%	1.7%	2.0
198	amt_down_payment_max	float64	1.7 MB	17,608	8.2%	0%	53,725	25.0%	25.0%	0.0
199	amt_drawings_atm_current_max	float64	1.7 MB	1,131	0.5%	0%	175,102	81.3%	81.3%	90000.0
200	amt_drawings_atm_current_median	float64	1.7 MB	378	0.2%	0%	208,835	97.0%	97.0%	0.0
201	amt_drawings_atm_current_min	float64	1.7 MB	114	0.1%	0%	214,655	99.7%	99.7%	0.0
202	amt_drawings_current_mean	float64	1.7 MB	35,095	16.3%	0%	154,159	71.6%	71.6%	3498.702077922078
203	amt_drawings_current_min	float64	1.7 MB	1,475	0.7%	0%	213,422	99.1%	99.1%	0.0
204	amt_drawings_other_current_max	float64	1.7 MB	1,084	0.5%	0%	211,253	98.1%	98.1%	0.0
205	amt_drawings_pos_current_max	float64	1.7 MB	20,726	9.6%	0%	172,260	80.0%	80.0%	6300.0
206	amt_drawings_pos_current_mean	float64	1.7 MB	23,516	10.9%	0%	172,255	80.0%	80.0%	303.42857142857144
207	amt_drawings_pos_current_min	float64	1.7 MB	1,772	0.8%	0%	213,337	99.1%	99.1%	0.0
208	amt_goods_price_min	float64	1.7 MB	39,171	18.2%	0%	12,169	5.7%	5.7%	45735.75
209	amt_inst_min_regularity_min	float64	1.7 MB	1,664	0.8%	0%	211,946	98.5%	98.5%	0.0
210	amt_payment_current_median	float64	1.7 MB	17,066	7.9%	0%	172,523	80.1%	80.1%	5850.0
211	amt_payment_current_min	float64	1.7 MB	7,398	3.4%	0%	199,261	92.6%	92.6%	0.0
212	amt_payment_current_range	float64	1.7 MB	22,545	10.5%	0%	172,454	80.1%	80.1%	63000.0
213	amt_payment_total_current_min	float64	1.7 MB	1,131	0.5%	0%	213,443	99.2%	99.2%	0.0
214	any_installments_late_30	float64	1.7 MB	2	<0.1%	0%	201,997	93.8%	93.8%	0.0
215	any_installments_late_60	float64	1.7 MB	2	<0.1%	0%	209,180	97.2%	97.2%	0.0
216	any_installments_late_7	float64	1.7 MB	2	<0.1%	0%	158,592	73.7%	73.7%	0.0
217	bureau_dpd_status_max	float64	1.7 MB	6	<0.1%	0%	193,628	90.0%	90.0%	0.0
218	bureau_dpd_status_median	float64	1.7 MB	11	<0.1%	0%	214,312	99.6%	99.6%	0.0
219	bureau_months_balance_max	float64	1.7 MB	89	<0.1%	0%	212,281	98.6%	98.6%	0.0
220	cnt_credit_prolong_mean	float64	1.7 MB	100	<0.1%	0%	209,248	97.2%	97.2%	0.0
221	cnt_credit_prolong_sum	float64	1.7 MB	10	<0.1%	0%	209,248	97.2%	97.2%	0.0
222	cnt_drawings_atm_current_max	float64	1.7 MB	43	<0.1%	0%	178,554	82.9%	82.9%	3.0
223	cnt_drawings_current_min	float64	1.7 MB	39	<0.1%	0%	213,436	99.2%	99.2%	0.0
224	cnt_drawings_other_current_max	float64	1.7 MB	11	<0.1%	0%	211,241	98.1%	98.1%	0.0
225	cnt_drawings_pos_current_max	float64	1.7 MB	116	0.1%	0%	176,434	82.0%	82.0%	1.0
226	cnt_drawings_pos_current_median	float64	1.7 MB	113	0.1%	0%	205,975	95.7%	95.7%	0.0
227	cnt_drawings_pos_current_min	float64	1.7 MB	40	<0.1%	0%	213,337	99.1%	99.1%	0.0
228	cnt_fam_members_excluding_children	float64	1.7 MB	2	<0.1%	0%	158,302	73.5%	73.5%	2.0
229	cnt_installment_future_min	float64	1.7 MB	61	<0.1%	0%	196,054	91.1%	91.1%	0.0
230	cnt_installment_mature_cum_max	float64	1.7 MB	120	0.1%	0%	156,329	72.6%	72.6%	7.0
231	cnt_installment_mature_cum_min	float64	1.7 MB	28	<0.1%	0%	193,011	89.7%	89.7%	0.0
232	cnt_installment_median	float64	1.7 MB	103	<0.1%	0%	73,750	34.3%	34.3%	12.0
233	cnt_installment_min	float64	1.7 MB	53	<0.1%	0%	54,950	25.5%	25.5%	6.0
234	cnt_installment_range	float64	1.7 MB	69	<0.1%	0%	49,692	23.1%	23.1%	0.0
235	cnt_installments_diff_min	float64	1.7 MB	58	<0.1%	0%	210,671	97.9%	97.9%	0.0
236	cnt_installments_diff_range	float64	1.7 MB	82	<0.1%	0%	48,330	22.5%	22.5%	12.0
237	cnt_payment_median	float64	1.7 MB	87	<0.1%	0%	65,750	30.5%	30.5%	12.0
238	cnt_payment_min	float64	1.7 MB	31	<0.1%	0%	68,588	31.9%	31.9%	0.0
239	cnt_payment_range	float64	1.7 MB	69	<0.1%	0%	54,639	25.4%	25.4%	0.0
240	days_credit_enddate_max	float64	1.7 MB	12,274	5.7%	0%	32,491	15.1%	15.1%	911.0
241	days_credit_enddate_min	float64	1.7 MB	6,266	2.9%	0%	32,492	15.1%	15.1%	-1267.0
242	days_credit_max	float64	1.7 MB	2,922	1.4%	0%	31,067	14.4%	14.4%	-300.0
243	days_credit_median	float64	1.7 MB	5,711	2.7%	0%	30,932	14.4%	14.4%	-957.0
244	days_credit_overdue_max	float64	1.7 MB	671	0.3%	0%	212,892	98.9%	98.9%	0.0
245	days_credit_overdue_mean	float64	1.7 MB	1,195	0.6%	0%	212,892	98.9%	98.9%	0.0
246	days_credit_range	float64	1.7 MB	2,913	1.4%	0%	30,890	14.4%	14.4%	1262.0
247	days_credit_std	float64	1.7 MB	133,053	61.8%	0%	55,965	26.0%	26.0%	621.2873840332031
248	days_credit_update_max	float64	1.7 MB	2,585	1.2%	0%	34,359	16.0%	16.0%	-19.0
249	days_credit_update_median	float64	1.7 MB	4,779	2.2%	0%	30,948	14.4%	14.4%	-360.0
250	days_credit_update_range	float64	1.7 MB	2,925	1.4%	0%	30,911	14.4%	14.4%	904.0
251	days_decision_max	float64	1.7 MB	2,921	1.4%	0%	11,697	5.4%	5.4%	-299.0
252	days_decision_median	float64	1.7 MB	5,656	2.6%	0%	11,546	5.4%	5.4%	-647.0
253	days_decision_range	float64	1.7 MB	2,919	1.4%	0%	40,565	18.8%	18.8%	0.0
254	days_enddate_fact_max	float64	1.7 MB	2,793	1.3%	0%	54,020	25.1%	25.1%	-345.0
255	days_enddate_fact_median	float64	1.7 MB	5,341	2.5%	0%	53,910	25.0%	25.0%	-872.5
256	days_enddate_fact_range	float64	1.7 MB	2,796	1.3%	0%	53,924	25.1%	25.1%	821.0
257	days_first_draw_min	float64	1.7 MB	2,718	1.3%	0%	177,781	82.6%	82.6%	365243.0
258	days_last_due_1st_version_max	float64	1.7 MB	4,521	2.1%	0%	55,263	25.7%	25.7%	365243.0
259	days_last_due_1st_version_mean	float64	1.7 MB	51,499	23.9%	0%	12,398	5.8%	5.8%	-207.5
260	days_last_due_1st_version_median	float64	1.7 MB	10,719	5.0%	0%	12,497	5.8%	5.8%	-325.0
261	days_last_due_1st_version_min	float64	1.7 MB	4,081	1.9%	0%	12,430	5.8%	5.8%	-1089.0
262	days_last_due_max	float64	1.7 MB	2,761	1.3%	0%	98,527	45.8%	45.8%	365243.0
263	days_termination_median	float64	1.7 MB	7,716	3.6%	0%	23,269	10.8%	10.8%	365243.0
264	days_termination_min	float64	1.7 MB	2,797	1.3%	0%	15,833	7.4%	7.4%	365243.0
265	diff_amt_installment_payment_max	float64	1.7 MB	75,445	35.0%	0%	127,555	59.3%	59.3%	0.0
266	diff_amt_installment_payment_mean	float64	1.7 MB	97,257	45.2%	0%	114,097	53.0%	53.0%	0.0
267	diff_amt_installment_payment_median	float64	1.7 MB	6,855	3.2%	0%	206,997	96.2%	96.2%	0.0
268	diff_amt_installment_payment_range	float64	1.7 MB	90,195	41.9%	0%	114,099	53.0%	53.0%	0.0
269	diff_days_installment_payment_max	float64	1.7 MB	409	0.2%	0%	18,396	8.5%	8.5%	31.0
270	diff_days_installment_payment_mean	float64	1.7 MB	50,247	23.3%	0%	11,037	5.1%	5.1%	9.524199962615967
271	diff_days_installment_payment_median	float64	1.7 MB	320	0.1%	0%	21,620	10.0%	10.0%	0.0
272	diff_days_installment_payment_range	float64	1.7 MB	1,465	0.7%	0%	14,802	6.9%	6.9%	37.0
273	diff_days_installment_payment_sum	float64	1.7 MB	4,383	2.0%	0%	11,369	5.3%	5.3%	240.0
274	diff_days_installment_payment_sum_late_only	float64	1.7 MB	1,815	0.8%	0%	95,670	44.4%	44.4%	0.0
275	diff_percent_installment_payment_mean	float64	1.7 MB	87,934	40.9%	0%	114,228	53.1%	53.1%	1.0
276	diff_percent_installment_payment_median	float64	1.7 MB	7,969	3.7%	0%	206,997	96.2%	96.2%	1.0
277	diff_percent_installment_payment_min	float64	1.7 MB	25,589	11.9%	0%	189,010	87.8%	87.8%	1.0
278	diff_percent_installment_payment_range	float64	1.7 MB	97,055	45.1%	0%	114,227	53.1%	53.1%	0.0
279	missingindicator_DEF_30_CNT_SOCIAL_CIRCLE	float64	1.7 MB	2	<0.1%	0%	214,543	99.7%	99.7%	0.0
280	missingindicator_EXT_SOURCE_1	float64	1.7 MB	2	<0.1%	0%	121,373	56.4%	56.4%	1.0
281	missingindicator_EXT_SOURCE_2	float64	1.7 MB	2	<0.1%	0%	214,793	99.8%	99.8%	0.0
282	missingindicator_EXT_SOURCE_3	float64	1.7 MB	2	<0.1%	0%	172,577	80.2%	80.2%	0.0
283	missingindicator_YEARS_BUILD_AVG	float64	1.7 MB	2	<0.1%	0%	143,036	66.4%	66.4%	1.0
284	missingindicator_amt_credit_max_overdue_max	float64	1.7 MB	2	<0.1%	0%	128,619	59.8%	59.8%	0.0
285	missingindicator_amt_credit_sum_debt_mean	float64	1.7 MB	2	<0.1%	0%	179,218	83.3%	83.3%	0.0
286	missingindicator_amt_credit_sum_limit_min	float64	1.7 MB	2	<0.1%	0%	169,672	78.8%	78.8%	0.0
287	missingindicator_amt_credit_sum_limit_std	float64	1.7 MB	2	<0.1%	0%	134,361	62.4%	62.4%	0.0
288	missingindicator_amt_down_payment_max	float64	1.7 MB	2	<0.1%	0%	191,554	89.0%	89.0%	0.0
289	missingindicator_bureau_months_balance_max	float64	1.7 MB	2	<0.1%	0%	152,586	70.9%	70.9%	1.0
290	missingindicator_cnt_installment_range	float64	1.7 MB	2	<0.1%	0%	202,669	94.2%	94.2%	0.0
291	missingindicator_days_credit_enddate_std	float64	1.7 MB	2	<0.1%	0%	156,060	72.5%	72.5%	0.0
292	missingindicator_days_enddate_fact_range	float64	1.7 MB	2	<0.1%	0%	161,387	75.0%	75.0%	0.0
293	mode_credit_type_Car_loan	float64	1.7 MB	2	<0.1%	0%	212,121	98.5%	98.5%	0.0
294	mode_credit_type_Consumer_credit	float64	1.7 MB	2	<0.1%	0%	160,802	74.7%	74.7%	1.0
295	mode_credit_type_Credit_card	float64	1.7 MB	2	<0.1%	0%	196,123	91.1%	91.1%	0.0
296	mode_credit_type_Microloan	float64	1.7 MB	2	<0.1%	0%	214,789	99.8%	99.8%	0.0
297	mode_credit_type_Mortgage	float64	1.7 MB	2	<0.1%	0%	214,478	99.6%	99.6%	0.0
298	mode_credit_type_Other	float64	1.7 MB	2	<0.1%	0%	215,155	>99.9%	>99.9%	0.0
299	n_car_loans	float64	1.7 MB	9	<0.1%	0%	201,519	93.6%	93.6%	0.0
300	n_channel_type_ap_minus	float64	1.7 MB	33	<0.1%	0%	199,207	92.5%	92.5%	0.0
301	n_channel_type_channel_corporate_sales	float64	1.7 MB	20	<0.1%	0%	213,745	99.3%	99.3%	0.0
302	n_channel_type_contact_center	float64	1.7 MB	19	<0.1%	0%	187,077	86.9%	86.9%	0.0
303	n_channel_type_countrywide	float64	1.7 MB	34	<0.1%	0%	78,922	36.7%	36.7%	1.0
304	n_channel_type_regional_and_local	float64	1.7 MB	19	<0.1%	0%	169,784	78.9%	78.9%	0.0
305	n_channel_type_stone	float64	1.7 MB	22	<0.1%	0%	133,139	61.9%	61.9%	0.0
306	n_client_type_new	float64	1.7 MB	14	<0.1%	0%	165,520	76.9%	76.9%	1.0
307	n_client_type_refreshed	float64	1.7 MB	23	<0.1%	0%	161,564	75.1%	75.1%	0.0
308	n_client_type_repeater	float64	1.7 MB	61	<0.1%	0%	49,122	22.8%	22.8%	0.0
309	n_consumer_loans	float64	1.7 MB	36	<0.1%	0%	78,331	36.4%	36.4%	1.0
310	n_contract_status_refused	float64	1.7 MB	44	<0.1%	0%	144,850	67.3%	67.3%	0.0
311	n_contract_status_unused_offer	float64	1.7 MB	11	<0.1%	0%	202,009	93.8%	93.8%	0.0
312	n_contracts_credit_card_completed	float64	1.7 MB	40	<0.1%	0%	207,783	96.5%	96.5%	0.0
313	n_credit_card_credits	float64	1.7 MB	22	<0.1%	0%	91,194	42.4%	42.4%	1.0
314	n_credits_active	float64	1.7 MB	22	<0.1%	0%	71,863	33.4%	33.4%	2.0
315	n_credits_sold	float64	1.7 MB	7	<0.1%	0%	211,547	98.3%	98.3%	0.0
316	n_credits_total	float64	1.7 MB	57	<0.1%	0%	51,153	23.8%	23.8%	4.0
317	n_currency_2	float64	1.7 MB	7	<0.1%	0%	214,671	99.7%	99.7%	0.0
318	n_different_channels	float64	1.7 MB	7	<0.1%	0%	90,541	42.1%	42.1%	2.0
319	n_different_contract_types	float64	1.7 MB	4	<0.1%	0%	89,430	41.5%	41.5%	2.0
320	n_different_credit_types	float64	1.7 MB	5	<0.1%	0%	131,569	61.1%	61.1%	2.0
321	n_different_currencies	float64	1.7 MB	3	<0.1%	0%	214,601	99.7%	99.7%	1.0
322	n_installments_late	float64	1.7 MB	99	<0.1%	0%	95,670	44.4%	44.4%	0.0
323	n_installments_late_30	float64	1.7 MB	42	<0.1%	0%	201,997	93.8%	93.8%	0.0
324	n_installments_late_7	float64	1.7 MB	59	<0.1%	0%	158,592	73.7%	73.7%	0.0
325	n_installments_total	float64	1.7 MB	310	0.1%	0%	14,007	6.5%	6.5%	25.0
326	n_microloans	float64	1.7 MB	28	<0.1%	0%	212,811	98.9%	98.9%	0.0
327	n_mortgages	float64	1.7 MB	7	<0.1%	0%	205,270	95.4%	95.4%	0.0
328	n_nflag_insured_on_approval_mean	float64	1.7 MB	102	<0.1%	0%	95,675	44.4%	44.4%	0.0
329	n_nflag_insured_on_approval_sum	float64	1.7 MB	19	<0.1%	0%	96,596	44.9%	44.9%	0.0
330	n_other_type_credit	float64	1.7 MB	9	<0.1%	0%	213,209	99.0%	99.0%	0.0
331	n_payment_type_cash_through_bank	float64	1.7 MB	44	<0.1%	0%	54,943	25.5%	25.5%	1.0
332	n_payment_type_not_available	float64	1.7 MB	46	<0.1%	0%	71,796	33.4%	33.4%	0.0
333	n_previous_credit_card_applications	float64	1.7 MB	126	0.1%	0%	155,013	72.0%	72.0%	21.0
334	n_previous_credit_card_applications_signed	float64	1.7 MB	37	<0.1%	0%	212,249	98.6%	98.6%	0.0
335	n_previous_pos_applications	float64	1.7 MB	221	0.1%	0%	16,495	7.7%	7.7%	22.0
336	n_previous_pos_applications_completed	float64	1.7 MB	45	<0.1%	0%	73,226	34.0%	34.0%	1.0
337	n_previous_pos_applications_signed	float64	1.7 MB	31	<0.1%	0%	174,587	81.1%	81.1%	0.0
338	n_product_type_walk_in	float64	1.7 MB	28	<0.1%	0%	164,239	76.3%	76.3%	0.0
339	n_reject_reason_limit	float64	1.7 MB	22	<0.1%	0%	195,275	90.7%	90.7%	0.0
340	n_reject_reason_scoc	float64	1.7 MB	20	<0.1%	0%	200,014	92.9%	92.9%	0.0
341	n_reject_reason_scofr	float64	1.7 MB	16	<0.1%	0%	210,511	97.8%	97.8%	0.0
342	n_revolving_loans	float64	1.7 MB	25	<0.1%	0%	142,248	66.1%	66.1%	0.0
343	n_yield_group_high	float64	1.7 MB	30	<0.1%	0%	89,153	41.4%	41.4%	0.0
344	n_yield_group_low_action	float64	1.7 MB	22	<0.1%	0%	174,871	81.2%	81.2%	0.0
345	n_yield_group_low_normal	float64	1.7 MB	23	<0.1%	0%	94,724	44.0%	44.0%	0.0
346	n_yield_group_middle	float64	1.7 MB	25	<0.1%	0%	80,132	37.2%	37.2%	1.0
347	ord_education_type	float64	1.7 MB	5	<0.1%	0%	152,993	71.1%	71.1%	1.0
348	percent_installments_early	float64	1.7 MB	7,892	3.7%	0%	64,688	30.1%	30.1%	1.0
349	percent_installments_late	float64	1.7 MB	4,464	2.1%	0%	95,670	44.4%	44.4%	0.0
350	percent_installments_late_30	float64	1.7 MB	894	0.4%	0%	201,997	93.8%	93.8%	0.0
351	percent_installments_late_60	float64	1.7 MB	629	0.3%	0%	209,180	97.2%	97.2%	0.0
352	percent_installments_late_7	float64	1.7 MB	2,595	1.2%	0%	158,592	73.7%	73.7%	0.0
353	rate_down_payment_max	float64	1.7 MB	84,884	39.4%	0%	53,725	25.0%	25.0%	0.0
354	rate_down_payment_range	float64	1.7 MB	73,616	34.2%	0%	94,887	44.1%	44.1%	0.0
355	rate_interest_privileged_count	float64	1.7 MB	4	<0.1%	0%	212,016	98.5%	98.5%	0.0
356	sk_dpd_credit_card_max	float64	1.7 MB	353	0.2%	0%	202,632	94.1%	94.1%	0.0
357	sk_dpd_credit_card_median	float64	1.7 MB	222	0.1%	0%	214,704	99.7%	99.7%	0.0
358	sk_dpd_def_credit_card_max	float64	1.7 MB	47	<0.1%	0%	204,810	95.1%	95.1%	0.0
359	sk_dpd_def_pos_applications_max	float64	1.7 MB	173	0.1%	0%	187,187	87.0%	87.0%	0.0
360	sk_dpd_pos_applications_max	float64	1.7 MB	1,595	0.7%	0%	176,902	82.2%	82.2%	0.0
361	years_employed	float64	1.7 MB	11,769	5.5%	0%	38,801	18.0%	18.0%	4.517808219178082

Code

# Save to file
file_path = dir_interim + "colnames--cols_to_keep_after_preprocessing.csv"
credits_train_transformed_not_correlated_col_info.column.to_csv(file_path, index=False)

# Load from file (to check)
cols_to_keep_after_preprocessing = pd.read_csv(file_path).column.tolist()
del file_path

Code

# Clean up a bit
del (
    credits_train,
    credits_train_transformed,
    credits_train_transformed_not_correlated_cols,
)

6.2 Train, Validation, and Test Sets

In this section, the training, validation, and test sets will be created by merging datasets and applying the pre-processing steps created in the previous sections. The results will be cached to avoid repeating the same steps in the future.

Code

file = dir_interim + "merged-selected--credit_train.feather"

if os.path.exists(file):
    credit_train = pd.read_feather(file)
else:
    credit_train = (
        merge_credit_history(to=application_train)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
        .loc[:, cols_to_include_in_preprocessing + ["TARGET"]]
    )
    credit_train.to_feather(file)

X_credit_train = credit_train.drop(columns=["TARGET"])
y_credit_train = credit_train["TARGET"]

del file

Code

X_credit_train.shape

(215257, 251)

Code

file = dir_interim + "merged-selected--credit_validation.feather"

if os.path.exists(file):
    credit_validation = pd.read_feather(file)
else:
    credit_validation = (
        merge_credit_history(to=application_validation)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
        .loc[:, cols_to_include_in_preprocessing + ["TARGET"]]
    )
    credit_validation.to_feather(file)

X_credit_validation = credit_validation.drop(columns=["TARGET"])
y_credit_validation = credit_validation["TARGET"]

del file

Code

X_credit_validation.shape

(46127, 251)

Code

file = dir_interim + "merged-selected--credit_test.feather"

if os.path.exists(file):
    credit_test = pd.read_feather(file)
else:
    credit_test = (
        merge_credit_history(to=application_test)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
        .loc[:, cols_to_include_in_preprocessing + ["TARGET"]]
    )
    credit_test.to_feather(file)

X_credit_test = credit_test.drop(columns=["TARGET"])
y_credit_test = credit_test["TARGET"]

del file

Code

X_credit_test.shape

(46127, 251)

7 Modeling (w/ Historical Data)

In this section, models based on application data and historical data will be trained and evaluated.

The steps are similar to those in the section Modeling (w/o Historical Data), so most of the steps will not be commented.

7.1 Train Full Model

Let’s start with the model that employs all 361 features that are left after feature filtering step.

Code

if "models" not in locals():
    models = {}


@my.cache_results(dir_interim + "task-2-w-credit-history--01_lgbm.pickle")
def fit_lgbm_extended():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
            ("preprocessor_2", clone(pre_processing)),
            ("selector_2", ColumnSelector(cols_to_keep_after_preprocessing)),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_credit_train, y_credit_train)
    return pipeline


# Time: (1-2 minutes)
models["LGBM (FULL | 361 feat.)"] = fit_lgbm_extended()
models["LGBM (FULL | 361 feat.)"]

7.2 Evaluate Models

The validation perdurance of the model that uses historical credit data is slightly better (ROC AUC = 0.778) compared to the best model that does not use historical data (ROC AUC = 0.759). However, the difference is very small (only 0.019).

Code

print("--- Train ---")

ml.classification_scores(
    models,
    X_credit_train,
    y_credit_train,
    color="orange",
    sort_by="ROC_AUC",
)

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (FULL \| 361 feat.)	215257	0.919	0.737	0.748	0.495	0.318	0.837	0.760	0.735	0.201	0.972	0.827

Code

print("--- Validation ---")

ml.classification_scores(
    models,
    X_credit_validation,
    y_credit_validation,
    sort_by="ROC_AUC",
)

--- Validation ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (FULL \| 361 feat.)	46127	0.919	0.724	0.704	0.409	0.285	0.829	0.680	0.728	0.180	0.963	0.778

Code

sns.set_style("white")
y_pred_validation_lgbm = models["LGBM (FULL | 361 feat.)"].predict(X_credit_validation)
ml.plot_confusion_matrices(y_credit_validation, y_pred_validation_lgbm, figsize=(13, 4));

7.3 Feature Importance

Feature importance analysis revealed that the 6 most important features are from or are based on the application table only. Nad only the 7th most important feature is based on historical data.

Note. Feature names in CAPITALS indicate the original features from the application table and feature names in lowercase indicate that these are derived or extracted features either from the original application table or from the credit history data tables.

Find the details below.

Code

@my.cache_results(dir_interim + "task-2--shap_lgbm_k=all.pickle")
def get_shap_values_lgbm_extended():
    model = "LGBM (FULL | 361 feat.)"
    preproc = Pipeline(steps=models[model].steps[:-1])
    classifier = models[model]["classifier"]
    X_validation_preproc = preproc.transform(X_credit_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm_ext, data_for_lgbm_ext = get_shap_values_lgbm_extended()

Code

vals = np.abs(shap_values_lgbm_ext).mean(0).mean(0)
feature_importance_ext = (
    pd.DataFrame(
        list(zip(data_for_lgbm_ext.columns, vals)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

Code

sns.set_style("whitegrid")
lgb.plot_importance(
    models["LGBM (FULL | 361 feat.)"]["classifier"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance",
);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_ext[1],
    data_for_lgbm_ext,
    plot_type="bar",
    max_display=110,
    plot_size=(10, 15),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_ext[1],
    data_for_lgbm_ext,
    max_display=50,
    plot_size=(10, 9),
    show=False,
)
plt.title("SHAP Feature Importance", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

feature_importance_ext.style.format(precision=6)

Table 7.1. SHAP values of the features for the LGBM model.

	index	col_name	importance
0	19	EXT_SOURCE_2	0.334729
1	20	EXT_SOURCE_3	0.283458
2	18	EXT_SOURCE_1	0.152720
3	174	amt_annuity_to_credit_ratio	0.109900
4	360	years_employed	0.104907
5	346	ord_education_type	0.078970
6	348	percent_installments_late	0.078802
7	0	AMT_ANNUITY	0.076940
8	188	amt_credit_sum_debt_mean	0.058354
9	309	n_contract_status_refused	0.056706
10	197	amt_down_payment_max	0.050801
11	236	cnt_payment_median	0.050325
12	334	n_previous_pos_applications	0.049682
13	171	amt_annuity_median_previous_application	0.049286
14	233	cnt_installment_range	0.045852
15	265	diff_amt_installment_payment_mean	0.044301
16	347	percent_installments_early	0.043852
17	153	OWN_CAR_AGE	0.041045
18	239	days_credit_enddate_max	0.040445
19	342	n_yield_group_high	0.040265
20	259	days_last_due_1st_version_median	0.039270
21	238	cnt_payment_range	0.039066
22	253	days_enddate_fact_max	0.039035
23	183	amt_credit_max_overdue_max	0.036707
24	343	n_yield_group_low_action	0.036122
25	227	cnt_fam_members_excluding_children	0.035529
26	257	days_last_due_1st_version_max	0.034028
27	49	NAME_CONTRACT_TYPE_Cash_loans	0.030754
28	241	days_credit_max	0.030463
29	65	NAME_INCOME_TYPE_Working	0.028907
30	15	DEF_30_CNT_SOCIAL_CIRCLE	0.028700
31	176	amt_annuity_to_income_ratio	0.028551
32	1	AMT_CREDIT	0.028165
33	27	FLAG_DOCUMENT_3	0.027166
34	286	missingindicator_amt_credit_sum_limit_std	0.026530
35	201	amt_drawings_current_mean	0.026340
36	242	days_credit_median	0.024028
37	85	OCCUPATION_TYPE_Laborers	0.023403
38	177	amt_balance_credit_card_max	0.023300
39	279	missingindicator_EXT_SOURCE_1	0.023174
40	12	DAYS_ID_PUBLISH	0.023152
41	155	REGION_RATING_CLIENT	0.023097
42	13	DAYS_LAST_PHONE_CHANGE	0.022974
43	261	days_last_due_max	0.022880
44	324	n_installments_total	0.022650
45	173	amt_annuity_min_previous_application	0.021348
46	14	DAYS_REGISTRATION	0.021207
47	344	n_yield_group_low_normal	0.019307
48	351	percent_installments_late_7	0.018375
49	6	AMT_REQ_CREDIT_BUREAU_QRT	0.018232
50	272	diff_days_installment_payment_sum	0.017944
51	252	days_decision_range	0.017601
52	235	cnt_installments_diff_range	0.017023
53	192	amt_credit_sum_median	0.016875
54	81	OCCUPATION_TYPE_Drivers	0.016874
55	156	REG_CITY_NOT_LIVE_CITY	0.015121
56	263	days_termination_min	0.015035
57	352	rate_down_payment_max	0.014277
58	274	diff_percent_installment_payment_mean	0.014275
59	101	ORGANIZATION_TYPE_Business_Entity_Type_3	0.013745
60	326	n_mortgages	0.013138
61	358	sk_dpd_def_pos_applications_max	0.012471
62	196	amt_credit_to_income_ratio	0.012435
63	35	FLAG_OWN_CAR	0.012376
64	313	n_credits_active	0.012074
65	138	ORGANIZATION_TYPE_Self_employed	0.011357
66	167	YEARS_BEGINEXPLUATATION_MODE	0.009927
67	195	amt_credit_sum_sum	0.009862
68	260	days_last_due_1st_version_min	0.009566
69	33	FLAG_EMP_PHONE	0.009426
70	80	OCCUPATION_TYPE_Core_staff	0.009391
71	255	days_enddate_fact_range	0.008932
72	194	amt_credit_sum_std	0.008921
73	248	days_credit_update_median	0.008893
74	178	amt_balance_credit_card_median	0.008372
75	164	WALLSMATERIAL_MODE_Panel	0.008108
76	283	missingindicator_amt_credit_max_overdue_max	0.008048
77	154	REGION_POPULATION_RELATIVE	0.007998
78	182	amt_credit_max	0.007632
79	175	amt_annuity_to_income_per_family_member	0.007558
80	337	n_product_type_walk_in	0.007203
81	273	diff_days_installment_payment_sum_late_only	0.007141
82	325	n_microloans	0.007115
83	221	cnt_drawings_atm_current_max	0.007105
84	207	amt_goods_price_min	0.006913
85	103	ORGANIZATION_TYPE_Construction	0.006912
86	191	amt_credit_sum_limit_sum	0.006807
87	38	FLAG_WORK_PHONE	0.006806
88	335	n_previous_pos_applications_completed	0.006764
89	288	missingindicator_bureau_months_balance_max	0.006737
90	232	cnt_installment_min	0.006655
91	251	days_decision_median	0.006493
92	37	FLAG_PHONE	0.006251
93	209	amt_payment_current_median	0.006012
94	39	FLOORSMAX_MEDI	0.006009
95	290	missingindicator_days_credit_enddate_std	0.005797
96	77	OCCUPATION_TYPE_Accountants	0.005781
97	199	amt_drawings_atm_current_median	0.005756
98	269	diff_days_installment_payment_mean	0.005724
99	353	rate_down_payment_range	0.005708
100	180	amt_credit_limit_actual_median	0.005691
101	179	amt_balance_credit_card_min	0.005593
102	189	amt_credit_sum_debt_sum	0.005370
103	338	n_reject_reason_limit	0.005277
104	249	days_credit_update_range	0.005114
105	276	diff_percent_installment_payment_min	0.004991
106	271	diff_days_installment_payment_range	0.004699
107	267	diff_amt_installment_payment_range	0.004486
108	245	days_credit_range	0.004452
109	169	amt_annuity_max	0.004424
110	250	days_decision_max	0.004413
111	258	days_last_due_1st_version_mean	0.004406
112	281	missingindicator_EXT_SOURCE_3	0.004364
113	247	days_credit_update_max	0.004330
114	16	ELEVATORS_AVG	0.004316
115	211	amt_payment_current_range	0.004173
116	186	amt_credit_min	0.004173
117	2	AMT_INCOME_TOTAL	0.004114
118	300	n_channel_type_channel_corporate_sales	0.004060
119	315	n_credits_total	0.004046
120	135	ORGANIZATION_TYPE_School	0.003835
121	305	n_client_type_new	0.003779
122	327	n_nflag_insured_on_approval_mean	0.003658
123	291	missingindicator_days_enddate_fact_range	0.003585
124	268	diff_days_installment_payment_max	0.003481
125	187	amt_credit_range	0.003452
126	240	days_credit_enddate_min	0.003372
127	184	amt_credit_max_overdue_range	0.003139
128	185	amt_credit_median	0.002895
129	262	days_termination_median	0.002754
130	62	NAME_INCOME_TYPE_State_servant	0.002689
131	229	cnt_installment_mature_cum_max	0.002672
132	264	diff_amt_installment_payment_max	0.002671
133	17	ENTRANCES_MODE	0.002638
134	86	OCCUPATION_TYPE_Low_skill_Laborers	0.002633
135	224	cnt_drawings_pos_current_max	0.002612
136	321	n_installments_late	0.002576
137	168	YEARS_BUILD_AVG	0.002534
138	9	BASEMENTAREA_MODE	0.002169
139	48	LANDAREA_MEDI	0.002018
140	11	COMMONAREA_MEDI	0.001994
141	299	n_channel_type_ap_minus	0.001951
142	60	NAME_INCOME_TYPE_Commercial_associate	0.001926
143	256	days_first_draw_min	0.001872
144	76	OBS_30_CNT_SOCIAL_CIRCLE	0.001866
145	204	amt_drawings_pos_current_max	0.001863
146	172	amt_annuity_min	0.001810
147	127	ORGANIZATION_TYPE_Military	0.001797
148	340	n_reject_reason_scofr	0.001793
149	301	n_channel_type_contact_center	0.001719
150	98	ORGANIZATION_TYPE_Bank	0.001705
151	270	diff_days_installment_payment_median	0.001700
152	93	OCCUPATION_TYPE_Security_staff	0.001660
153	124	ORGANIZATION_TYPE_Kindergarten	0.001577
154	8	AMT_REQ_CREDIT_BUREAU_YEAR	0.001543
155	10	CNT_FAM_MEMBERS	0.001543
156	284	missingindicator_amt_credit_sum_debt_mean	0.001454
157	304	n_channel_type_stone	0.001438
158	285	missingindicator_amt_credit_sum_limit_min	0.001342
159	193	amt_credit_sum_overdue_sum	0.001334
160	266	diff_amt_installment_payment_median	0.001291
161	75	NONLIVINGAREA_MODE	0.001231
162	298	n_car_loans	0.001193
163	277	diff_percent_installment_payment_range	0.001146
164	228	cnt_installment_future_min	0.001089
165	150	ORGANIZATION_TYPE_Transport_type_3	0.001076
166	122	ORGANIZATION_TYPE_Industry_type_9	0.001053
167	142	ORGANIZATION_TYPE_Trade_type_2	0.000975
168	246	days_credit_std	0.000886
169	345	n_yield_group_middle	0.000860
170	306	n_client_type_refreshed	0.000839
171	331	n_payment_type_not_available	0.000825
172	198	amt_drawings_atm_current_max	0.000804
173	355	sk_dpd_credit_card_max	0.000761
174	231	cnt_installment_median	0.000751
175	40	FLOORSMIN_MEDI	0.000716
176	312	n_credit_card_credits	0.000665
177	359	sk_dpd_pos_applications_max	0.000651
178	311	n_contracts_credit_card_completed	0.000636
179	210	amt_payment_current_min	0.000606
180	170	amt_annuity_median	0.000594
181	275	diff_percent_installment_payment_median	0.000547
182	293	mode_credit_type_Consumer_credit	0.000520
183	130	ORGANIZATION_TYPE_Police	0.000480
184	23	FLAG_DOCUMENT_13	0.000465
185	36	FLAG_OWN_REALTY	0.000456
186	165	WALLSMATERIAL_MODE_Stone_brick	0.000455
187	74	NONLIVINGAPARTMENTS_AVG	0.000438
188	26	FLAG_DOCUMENT_18	0.000433
189	5	AMT_REQ_CREDIT_BUREAU_MON	0.000426
190	303	n_channel_type_regional_and_local	0.000414
191	296	mode_credit_type_Mortgage	0.000363
192	57	NAME_HOUSING_TYPE_Rented_apartment	0.000362
193	219	cnt_credit_prolong_mean	0.000323
194	330	n_payment_type_cash_through_bank	0.000318
195	254	days_enddate_fact_median	0.000284
196	237	cnt_payment_min	0.000279
197	56	NAME_HOUSING_TYPE_Office_apartment	0.000269
198	339	n_reject_reason_scoc	0.000251
199	206	amt_drawings_pos_current_min	0.000246
200	350	percent_installments_late_60	0.000242
201	132	ORGANIZATION_TYPE_Realtor	0.000232
202	349	percent_installments_late_30	0.000226
203	310	n_contract_status_unused_offer	0.000216
204	181	amt_credit_limit_actual_range	0.000209
205	217	bureau_dpd_status_median	0.000203
206	30	FLAG_DOCUMENT_8	0.000196
207	212	amt_payment_total_current_min	0.000173
208	50	NAME_EDUCATION_TYPE_Academic_degree	0.000171
209	157	REG_CITY_NOT_WORK_CITY	0.000165
210	243	days_credit_overdue_max	0.000164
211	95	OCCUPATION_TYPE_nan	0.000152
212	203	amt_drawings_other_current_max	0.000141
213	341	n_revolving_loans	0.000127
214	308	n_consumer_loans	0.000127
215	125	ORGANIZATION_TYPE_Legal_Services	0.000101
216	218	bureau_months_balance_max	0.000094
217	190	amt_credit_sum_limit_min	0.000090
218	323	n_installments_late_7	0.000089
219	147	ORGANIZATION_TYPE_Trade_type_7	0.000087
220	91	OCCUPATION_TYPE_Sales_staff	0.000079
221	356	sk_dpd_credit_card_median	0.000079
222	34	FLAG_IS_EMERGENCY	0.000077
223	67	NAME_TYPE_SUITE_Family	0.000075
224	230	cnt_installment_mature_cum_min	0.000069
225	7	AMT_REQ_CREDIT_BUREAU_WEEK	0.000064
226	161	WALLSMATERIAL_MODE_Mixed	0.000057
227	79	OCCUPATION_TYPE_Cooking_staff	0.000056
228	149	ORGANIZATION_TYPE_Transport_type_2	0.000033
229	333	n_previous_credit_card_applications_signed	0.000027
230	21	FLAG_CONT_MOBILE	0.000000
231	31	FLAG_DOCUMENT_9	0.000000
232	320	n_different_currencies	0.000000
233	282	missingindicator_YEARS_BUILD_AVG	0.000000
234	47	HOUSETYPE_MODE_terraced_house	0.000000
235	332	n_previous_credit_card_applications	0.000000
236	319	n_different_credit_types	0.000000
237	280	missingindicator_EXT_SOURCE_2	0.000000
238	322	n_installments_late_30	0.000000
239	51	NAME_EDUCATION_TYPE_Incomplete_higher	0.000000
240	287	missingindicator_amt_down_payment_max	0.000000
241	354	rate_interest_privileged_count	0.000000
242	278	missingindicator_DEF_30_CNT_SOCIAL_CIRCLE	0.000000
243	357	sk_dpd_def_credit_card_max	0.000000
244	329	n_other_type_credit	0.000000
245	32	FLAG_EMAIL	0.000000
246	52	NAME_EDUCATION_TYPE_Lower_secondary	0.000000
247	42	FONDKAPREMONT_MODE_org_spec_account	0.000000
248	317	n_different_channels	0.000000
249	318	n_different_contract_types	0.000000
250	3	AMT_REQ_CREDIT_BUREAU_DAY	0.000000
251	307	n_client_type_repeater	0.000000
252	41	FONDKAPREMONT_MODE_not_specified	0.000000
253	302	n_channel_type_countrywide	0.000000
254	336	n_previous_pos_applications_signed	0.000000
255	43	FONDKAPREMONT_MODE_reg_oper_account	0.000000
256	53	NAME_HOUSING_TYPE_Co_op_apartment	0.000000
257	24	FLAG_DOCUMENT_14	0.000000
258	297	mode_credit_type_Other	0.000000
259	28	FLAG_DOCUMENT_5	0.000000
260	29	FLAG_DOCUMENT_6	0.000000
261	314	n_credits_sold	0.000000
262	295	mode_credit_type_Microloan	0.000000
263	294	mode_credit_type_Credit_card	0.000000
264	316	n_currency_2	0.000000
265	22	FLAG_DOCUMENT_11	0.000000
266	292	mode_credit_type_Car_loan	0.000000
267	44	FONDKAPREMONT_MODE_reg_oper_spec_account	0.000000
268	45	HOUSETYPE_MODE_nan	0.000000
269	289	missingindicator_cnt_installment_range	0.000000
270	25	FLAG_DOCUMENT_16	0.000000
271	46	HOUSETYPE_MODE_specific_housing	0.000000
272	328	n_nflag_insured_on_approval_sum	0.000000
273	128	ORGANIZATION_TYPE_Mobile	0.000000
274	54	NAME_HOUSING_TYPE_House_apartment	0.000000
275	115	ORGANIZATION_TYPE_Industry_type_2	0.000000
276	121	ORGANIZATION_TYPE_Industry_type_8	0.000000
277	166	WALLSMATERIAL_MODE_Wooden	0.000000
278	120	ORGANIZATION_TYPE_Industry_type_7	0.000000
279	119	ORGANIZATION_TYPE_Industry_type_6	0.000000
280	118	ORGANIZATION_TYPE_Industry_type_5	0.000000
281	117	ORGANIZATION_TYPE_Industry_type_4	0.000000
282	4	AMT_REQ_CREDIT_BUREAU_HOUR	0.000000
283	116	ORGANIZATION_TYPE_Industry_type_3	0.000000
284	114	ORGANIZATION_TYPE_Industry_type_13	0.000000
285	162	WALLSMATERIAL_MODE_Monolithic	0.000000
286	113	ORGANIZATION_TYPE_Industry_type_12	0.000000
287	112	ORGANIZATION_TYPE_Industry_type_11	0.000000
288	111	ORGANIZATION_TYPE_Industry_type_10	0.000000
289	110	ORGANIZATION_TYPE_Industry_type_1	0.000000
290	109	ORGANIZATION_TYPE_Housing	0.000000
291	108	ORGANIZATION_TYPE_Hotel	0.000000
292	107	ORGANIZATION_TYPE_Government	0.000000
293	106	ORGANIZATION_TYPE_Emergency	0.000000
294	163	WALLSMATERIAL_MODE_Others	0.000000
295	160	WALLSMATERIAL_MODE_Block	0.000000
296	104	ORGANIZATION_TYPE_Culture	0.000000
297	141	ORGANIZATION_TYPE_Trade_type_1	0.000000
298	131	ORGANIZATION_TYPE_Postal	0.000000
299	133	ORGANIZATION_TYPE_Religion	0.000000
300	134	ORGANIZATION_TYPE_Restaurant	0.000000
301	136	ORGANIZATION_TYPE_Security	0.000000
302	137	ORGANIZATION_TYPE_Security_Ministries	0.000000
303	126	ORGANIZATION_TYPE_Medicine	0.000000
304	139	ORGANIZATION_TYPE_Services	0.000000
305	140	ORGANIZATION_TYPE_Telecom	0.000000
306	143	ORGANIZATION_TYPE_Trade_type_3	0.000000
307	159	REG_REGION_NOT_WORK_REGION	0.000000
308	144	ORGANIZATION_TYPE_Trade_type_4	0.000000
309	145	ORGANIZATION_TYPE_Trade_type_5	0.000000
310	146	ORGANIZATION_TYPE_Trade_type_6	0.000000
311	148	ORGANIZATION_TYPE_Transport_type_1	0.000000
312	151	ORGANIZATION_TYPE_Transport_type_4	0.000000
313	152	ORGANIZATION_TYPE_University	0.000000
314	123	ORGANIZATION_TYPE_Insurance	0.000000
315	158	REG_REGION_NOT_LIVE_REGION	0.000000
316	105	ORGANIZATION_TYPE_Electricity	0.000000
317	102	ORGANIZATION_TYPE_Cleaning	0.000000
318	55	NAME_HOUSING_TYPE_Municipal_apartment	0.000000
319	71	NAME_TYPE_SUITE_Spouse_partner	0.000000
320	87	OCCUPATION_TYPE_Managers	0.000000
321	84	OCCUPATION_TYPE_IT_staff	0.000000
322	83	OCCUPATION_TYPE_High_skill_tech_staff	0.000000
323	82	OCCUPATION_TYPE_HR_staff	0.000000
324	78	OCCUPATION_TYPE_Cleaning_staff	0.000000
325	129	ORGANIZATION_TYPE_Other	0.000000
326	73	NAME_TYPE_SUITE_nan	0.000000
327	72	NAME_TYPE_SUITE_Unaccompanied	0.000000
328	70	NAME_TYPE_SUITE_Other_B	0.000000
329	88	OCCUPATION_TYPE_Medicine_staff	0.000000
330	69	NAME_TYPE_SUITE_Other_A	0.000000
331	68	NAME_TYPE_SUITE_Group_of_people	0.000000
332	66	NAME_TYPE_SUITE_Children	0.000000
333	64	NAME_INCOME_TYPE_Unemployed	0.000000
334	63	NAME_INCOME_TYPE_Student	0.000000
335	61	NAME_INCOME_TYPE_Maternity_leave	0.000000
336	59	NAME_INCOME_TYPE_Businessman	0.000000
337	58	NAME_HOUSING_TYPE_With_parents	0.000000
338	234	cnt_installments_diff_min	0.000000
339	89	OCCUPATION_TYPE_Private_service_staff	0.000000
340	100	ORGANIZATION_TYPE_Business_Entity_Type_2	0.000000
341	92	OCCUPATION_TYPE_Secretaries	0.000000
342	99	ORGANIZATION_TYPE_Business_Entity_Type_1	0.000000
343	97	ORGANIZATION_TYPE_Agriculture	0.000000
344	96	ORGANIZATION_TYPE_Advertising	0.000000
345	200	amt_drawings_atm_current_min	0.000000
346	202	amt_drawings_current_min	0.000000
347	205	amt_drawings_pos_current_mean	0.000000
348	94	OCCUPATION_TYPE_Waiters_barmen_staff	0.000000
349	208	amt_inst_min_regularity_min	0.000000
350	213	any_installments_late_30	0.000000
351	90	OCCUPATION_TYPE_Realty_agents	0.000000
352	214	any_installments_late_60	0.000000
353	215	any_installments_late_7	0.000000
354	216	bureau_dpd_status_max	0.000000
355	220	cnt_credit_prolong_sum	0.000000
356	222	cnt_drawings_current_min	0.000000
357	223	cnt_drawings_other_current_max	0.000000
358	225	cnt_drawings_pos_current_median	0.000000
359	226	cnt_drawings_pos_current_min	0.000000
360	244	days_credit_overdue_mean	0.000000

7.4 Training Models with Feature Selection

The models are trained based on smaller subsets of features. These subsets are created based on the arbitrarily selected thresholds of SHAP values (the threshold values were selected to reduce the number of features by 20-40 in most cases).

The model with 216 features (SHAP > 0.0001) shows the best validation performance in terms of ROC AUC (0.779) and as well as some other metrics.

Code

def fit_lgbm_ext_on_features(features):
    """Template to fit a LGBM model with a smaller number of features."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
            ("preprocessor_2", clone(pre_processing)),
            ("selector_2", ColumnSelector(features)),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_credit_train, y_credit_train)
    return pipeline


def get_feature_names_by_shap_thereshold(threshold):
    return feature_importance_ext.query(f"importance > {threshold}").col_name.to_list()


def n_features_by_shap_threshold(thresholds):
    [
        print(
            f"Threshold: {threshold} | "
            f"Number of features: {len(get_feature_names_by_shap_thereshold(threshold))}"
        )
        for threshold in thresholds
    ]


def fit_lgbm_ext_with_shap_threshold(threshold):
    """Function for feature selection based on SHAP values"""
    features = feature_importance_ext.query(
        f"importance > {threshold}"
    ).col_name.to_list()
    k = len(features)
    return f"LGBM ({k} features)", fit_lgbm_ext_on_features(features)

Code

thresholds = [
    0.0001,
    0.0005,
    0.0010,
    0.0020,
    0.0040,
    0.0050,
    0.0070,
    0.0100,
    0.0200,
    0.0300,
    0.0400,
    0.0500,
    0.1000,
]

Code

n_features_by_shap_threshold(thresholds)

Threshold: 0.0001 | Number of features: 216
Threshold: 0.0005 | Number of features: 183
Threshold: 0.001 | Number of features: 167
Threshold: 0.002 | Number of features: 140
Threshold: 0.004 | Number of features: 120
Threshold: 0.005 | Number of features: 105
Threshold: 0.007 | Number of features: 84
Threshold: 0.01 | Number of features: 66
Threshold: 0.02 | Number of features: 47
Threshold: 0.03 | Number of features: 29
Threshold: 0.04 | Number of features: 20
Threshold: 0.05 | Number of features: 12
Threshold: 0.1 | Number of features: 5

Code

# Restore from file or calculate
file = dir_interim + "task-2-w-credit-history--lgbm_molels_as_dict.pkl"

if os.path.exists(file):
    with open(file, "rb") as f:
        models = joblib.load(f)
else:
    for threshold in thresholds:
        model_name, model = fit_lgbm_ext_with_shap_threshold(threshold)
        models[model_name] = model

    with open(file, "wb") as f:
        joblib.dump(models, f)

del file
# Time: 5m 7.1s

Code

print("--- Train ---")
ml.classification_scores(
    models,
    X_credit_train,
    y_credit_train,
    sort_by="ROC_AUC",
    color="orange",
)

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (216 features)	215257	0.919	0.737	0.747	0.493	0.317	0.837	0.759	0.735	0.201	0.972	0.827
LGBM (167 features)	215257	0.919	0.737	0.747	0.495	0.318	0.837	0.759	0.735	0.201	0.972	0.827
LGBM (FULL \| 361 feat.)	215257	0.919	0.737	0.748	0.495	0.318	0.837	0.760	0.735	0.201	0.972	0.827
LGBM (183 features)	215257	0.919	0.737	0.748	0.496	0.318	0.837	0.761	0.735	0.201	0.972	0.827
LGBM (140 features)	215257	0.919	0.737	0.748	0.496	0.319	0.837	0.761	0.735	0.201	0.972	0.827
LGBM (120 features)	215257	0.919	0.737	0.746	0.492	0.317	0.837	0.757	0.735	0.201	0.972	0.826
LGBM (105 features)	215257	0.919	0.736	0.746	0.493	0.317	0.836	0.759	0.734	0.200	0.972	0.826
LGBM (84 features)	215257	0.919	0.734	0.744	0.488	0.315	0.835	0.755	0.733	0.199	0.971	0.824
LGBM (66 features)	215257	0.919	0.733	0.743	0.486	0.313	0.834	0.756	0.731	0.198	0.971	0.823
LGBM (47 features)	215257	0.919	0.730	0.741	0.481	0.311	0.832	0.753	0.728	0.196	0.971	0.819
LGBM (29 features)	215257	0.919	0.724	0.734	0.468	0.304	0.828	0.746	0.722	0.191	0.970	0.812
LGBM (20 features)	215257	0.919	0.720	0.731	0.462	0.300	0.825	0.744	0.718	0.188	0.970	0.807
LGBM (12 features)	215257	0.919	0.711	0.719	0.437	0.289	0.819	0.728	0.710	0.180	0.967	0.795
LGBM (5 features)	215257	0.919	0.701	0.701	0.403	0.275	0.812	0.702	0.701	0.171	0.964	0.777

Code

print("--- Validation ---")
ml.classification_scores(
    models,
    X_credit_validation,
    y_credit_validation,
    sort_by="ROC_AUC",
)

--- Validation ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (216 features)	46127	0.919	0.726	0.709	0.419	0.289	0.830	0.689	0.730	0.183	0.964	0.779
LGBM (183 features)	46127	0.919	0.724	0.705	0.411	0.286	0.829	0.682	0.728	0.181	0.963	0.779
LGBM (167 features)	46127	0.919	0.727	0.706	0.412	0.287	0.831	0.681	0.731	0.182	0.963	0.778
LGBM (140 features)	46127	0.919	0.724	0.706	0.411	0.286	0.829	0.684	0.728	0.181	0.963	0.778
LGBM (FULL \| 361 feat.)	46127	0.919	0.724	0.704	0.409	0.285	0.829	0.680	0.728	0.180	0.963	0.778
LGBM (120 features)	46127	0.919	0.726	0.706	0.413	0.287	0.831	0.683	0.730	0.182	0.963	0.777
LGBM (105 features)	46127	0.919	0.725	0.704	0.408	0.285	0.829	0.679	0.729	0.180	0.963	0.777
LGBM (66 features)	46127	0.919	0.723	0.708	0.415	0.287	0.828	0.689	0.726	0.181	0.964	0.777
LGBM (84 features)	46127	0.919	0.722	0.706	0.412	0.285	0.828	0.687	0.726	0.180	0.963	0.776
LGBM (47 features)	46127	0.919	0.720	0.703	0.407	0.283	0.826	0.683	0.723	0.178	0.963	0.774
LGBM (29 features)	46127	0.919	0.714	0.700	0.399	0.278	0.822	0.682	0.717	0.175	0.963	0.770
LGBM (20 features)	46127	0.919	0.710	0.695	0.391	0.274	0.819	0.677	0.713	0.172	0.962	0.767
LGBM (12 features)	46127	0.919	0.705	0.693	0.386	0.271	0.815	0.679	0.707	0.169	0.962	0.760
LGBM (5 features)	46127	0.919	0.696	0.685	0.370	0.263	0.809	0.671	0.699	0.164	0.960	0.748

7.5 Tune Hyperparameters

In this section, the best-performing model based on 216 features will be tuned. To tune hyperparameters, the Optuna package is used.

Code

# Use the subset of the selected features
file_path = dir_interim + "colnames--cols_to_include_in_preprocessing.csv"
cols_to_include_in_preprocessing = pd.read_csv(file_path).column.tolist()
features_to_tune = feature_importance_ext.query(
    f"importance > 0.0001"
).col_name.to_list()
del file_path

# Use 3-fold stratified CV
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

NameError: name 'feature_importance_ext' is not defined

Code

# Define objective function for optuna
def objective(trial):
    "Objective fuction for hyperparameter tuning"
    # LGBM params
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000, step=50),
        "max_depth": trial.suggest_int("max_depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["gbdt"]),
        # Tree Structure and Complexity
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        # Regularization
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 1.0),
        # Learning Rate and Feature Selection
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
        # Other Parameters
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_weight": trial.suggest_float(
            "min_child_weight", 1e-3, 1e3, log=True
        ),
        "min_split_gain": trial.suggest_float("min_split_gain", 0.0, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 50),
        "max_delta_step": trial.suggest_int("max_delta_step", 0, 10),
    }

    model = LGBMClassifier(
        objective="binary",
        metric="auc",
        random_state=1,
        class_weight="balanced",
        n_jobs=-1,
        device="gpu",
        **params,
    )

    pipeline_to_tune = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
            ("preprocessor_2", clone(pre_processing)),
            ("selector_2", ColumnSelector(features_to_tune)),
            ("classifier", model),
        ]
    )

    scores = cross_val_score(
        pipeline_to_tune, X_credit_train, y_credit_train, n_jobs=-1, cv=stratified_kfold
    )

    return scores.mean()


study_name = "tune-w-credit-history"
storage_name = f"sqlite:///{dir_interim}/optuna--{study_name}.db"

study = optuna.create_study(
    study_name=study_name,
    storage=storage_name,
    load_if_exists=True,
    direction="maximize",
)
study.optimize(objective, n_trials=100, timeout=3600)
# Time 62m 20.0s

[I 2023-12-27 19:20:13,041] A new study created in RDB with name: tune-w-credit-history
[I 2023-12-27 19:21:14,358] Trial 0 finished with value: 0.7133844678540423 and parameters: {'n_estimators': 800, 'max_depth': 1, 'boosting_type': 'gbdt', 'num_leaves': 68, 'min_child_samples': 29, 'lambda_l1': 0.0001520382569789408, 'lambda_l2': 1.0837050089743732e-05, 'reg_alpha': 0.06205900866526515, 'reg_lambda': 0.9973509941432356, 'learning_rate': 0.17385115867266585, 'feature_fraction': 0.457613838544846, 'subsample': 0.7641858502243166, 'colsample_bytree': 0.4691247981564376, 'bagging_fraction': 0.9776044061943601, 'bagging_freq': 3, 'min_child_weight': 0.2632724059022625, 'min_split_gain': 0.3144179175251397, 'min_data_in_leaf': 20, 'max_delta_step': 6}. Best is trial 0 with value: 0.7133844678540423.
[I 2023-12-27 19:21:54,405] Trial 1 finished with value: 0.7188662922677773 and parameters: {'n_estimators': 300, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 111, 'min_child_samples': 72, 'lambda_l1': 0.0010433150020650284, 'lambda_l2': 0.000733681979055871, 'reg_alpha': 0.35178729517365503, 'reg_lambda': 0.03423976796711303, 'learning_rate': 0.2903376154019435, 'feature_fraction': 0.6147235927730519, 'subsample': 0.6121461820293557, 'colsample_bytree': 0.8644863075835109, 'bagging_fraction': 0.6308830635202857, 'bagging_freq': 2, 'min_child_weight': 0.2432421435462729, 'min_split_gain': 0.223990834897291, 'min_data_in_leaf': 40, 'max_delta_step': 2}. Best is trial 1 with value: 0.7188662922677773.
[I 2023-12-27 19:22:23,909] Trial 2 finished with value: 0.7126876338496727 and parameters: {'n_estimators': 150, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 139, 'min_child_samples': 39, 'lambda_l1': 0.4651706361118245, 'lambda_l2': 0.0113567650942596, 'reg_alpha': 0.28596868511468554, 'reg_lambda': 0.5617766235501069, 'learning_rate': 0.2862326501066601, 'feature_fraction': 0.7760128851639144, 'subsample': 0.48788054317390844, 'colsample_bytree': 0.2983688315154828, 'bagging_fraction': 0.7507872036234606, 'bagging_freq': 7, 'min_child_weight': 814.1423255572674, 'min_split_gain': 0.12315860127817202, 'min_data_in_leaf': 26, 'max_delta_step': 1}. Best is trial 1 with value: 0.7188662922677773.
[I 2023-12-27 19:23:52,349] Trial 3 finished with value: 0.8479259574144425 and parameters: {'n_estimators': 1000, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 112, 'min_child_samples': 96, 'lambda_l1': 0.00017306252387348225, 'lambda_l2': 1.977539005227824e-05, 'reg_alpha': 0.4283787945872213, 'reg_lambda': 0.9893280264194702, 'learning_rate': 0.17128861427623032, 'feature_fraction': 0.6845818343036181, 'subsample': 0.17203387975679538, 'colsample_bytree': 0.7980915047057495, 'bagging_fraction': 0.8435277923028028, 'bagging_freq': 4, 'min_child_weight': 0.004024480124332665, 'min_split_gain': 0.4465134798497602, 'min_data_in_leaf': 38, 'max_delta_step': 3}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:25:11,579] Trial 4 finished with value: 0.7387123283474638 and parameters: {'n_estimators': 500, 'max_depth': 6, 'boosting_type': 'gbdt', 'num_leaves': 223, 'min_child_samples': 21, 'lambda_l1': 1.036153287921386e-07, 'lambda_l2': 1.6295147516128015e-07, 'reg_alpha': 0.6652264420514161, 'reg_lambda': 0.18748862643684328, 'learning_rate': 0.01347760446234664, 'feature_fraction': 0.8228363548713917, 'subsample': 0.5904533867022965, 'colsample_bytree': 0.9358815070918077, 'bagging_fraction': 0.5846974106187401, 'bagging_freq': 5, 'min_child_weight': 0.7074335671273558, 'min_split_gain': 0.29439756447372956, 'min_data_in_leaf': 17, 'max_delta_step': 9}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:25:39,516] Trial 5 finished with value: 0.6993268474558821 and parameters: {'n_estimators': 150, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 216, 'min_child_samples': 14, 'lambda_l1': 9.750757318137843, 'lambda_l2': 0.34479627430862014, 'reg_alpha': 0.8213573723009959, 'reg_lambda': 0.9754288921619616, 'learning_rate': 0.08283760652331447, 'feature_fraction': 0.782639149015379, 'subsample': 0.43314082646508134, 'colsample_bytree': 0.2156419154071122, 'bagging_fraction': 0.9768194695171522, 'bagging_freq': 3, 'min_child_weight': 0.005020398733771443, 'min_split_gain': 0.710369095642509, 'min_data_in_leaf': 19, 'max_delta_step': 6}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:26:50,629] Trial 6 finished with value: 0.8316849025663076 and parameters: {'n_estimators': 200, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 120, 'min_child_samples': 100, 'lambda_l1': 4.911477023170647e-08, 'lambda_l2': 1.3489936875313564e-05, 'reg_alpha': 0.673138884928222, 'reg_lambda': 0.003695339233514061, 'learning_rate': 0.11318013505175774, 'feature_fraction': 0.9208947900557085, 'subsample': 0.27865211549049074, 'colsample_bytree': 0.7964060768332551, 'bagging_fraction': 0.6686535176120247, 'bagging_freq': 1, 'min_child_weight': 4.080570526996787, 'min_split_gain': 0.9121238054632345, 'min_data_in_leaf': 37, 'max_delta_step': 6}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:27:54,315] Trial 7 finished with value: 0.7522496444555378 and parameters: {'n_estimators': 600, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 244, 'min_child_samples': 43, 'lambda_l1': 2.534823760275499e-08, 'lambda_l2': 2.7995860787697047, 'reg_alpha': 0.28650632087067907, 'reg_lambda': 0.6249923343905277, 'learning_rate': 0.039342315509374066, 'feature_fraction': 0.7293660504039947, 'subsample': 0.6741250410202273, 'colsample_bytree': 0.16784925739666362, 'bagging_fraction': 0.5496868541785603, 'bagging_freq': 3, 'min_child_weight': 6.446646717695389, 'min_split_gain': 0.9250331900530822, 'min_data_in_leaf': 49, 'max_delta_step': 2}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:28:21,006] Trial 8 finished with value: 0.7072801486649237 and parameters: {'n_estimators': 50, 'max_depth': 3, 'boosting_type': 'gbdt', 'num_leaves': 59, 'min_child_samples': 6, 'lambda_l1': 2.05655116866198e-08, 'lambda_l2': 4.0069835388076785e-08, 'reg_alpha': 0.26041375594498495, 'reg_lambda': 0.38611463427103676, 'learning_rate': 0.2781185770603419, 'feature_fraction': 0.741937028506221, 'subsample': 0.4462685297259511, 'colsample_bytree': 0.5284777436645074, 'bagging_fraction': 0.9732383128478747, 'bagging_freq': 1, 'min_child_weight': 2.319742366056286, 'min_split_gain': 0.4202994519219667, 'min_data_in_leaf': 15, 'max_delta_step': 2}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:28:49,372] Trial 9 finished with value: 0.6756110093515885 and parameters: {'n_estimators': 50, 'max_depth': 4, 'boosting_type': 'gbdt', 'num_leaves': 11, 'min_child_samples': 41, 'lambda_l1': 7.36483774066561e-07, 'lambda_l2': 0.0004694219095532689, 'reg_alpha': 0.8791483544217241, 'reg_lambda': 0.42223959207093187, 'learning_rate': 0.013380954064159494, 'feature_fraction': 0.48647182501042135, 'subsample': 0.33239085856469064, 'colsample_bytree': 0.4521145383014534, 'bagging_fraction': 0.6908504039445515, 'bagging_freq': 6, 'min_child_weight': 82.2433055046293, 'min_split_gain': 0.28892951590992133, 'min_data_in_leaf': 25, 'max_delta_step': 1}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:36:23,915] Trial 10 finished with value: 0.9043887034768833 and parameters: {'n_estimators': 1000, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 171, 'min_child_samples': 71, 'lambda_l1': 1.6894405269684752e-05, 'lambda_l2': 2.5479861382375376e-06, 'reg_alpha': 0.5305753243887574, 'reg_lambda': 0.7834468951359224, 'learning_rate': 0.05897985586531902, 'feature_fraction': 0.608096917733464, 'subsample': 0.0823584299376674, 'colsample_bytree': 0.6951910076532127, 'bagging_fraction': 0.4298117967213491, 'bagging_freq': 5, 'min_child_weight': 0.0011663326986012653, 'min_split_gain': 0.553012365836588, 'min_data_in_leaf': 1, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 19:44:20,709] Trial 11 finished with value: 0.9002541073988085 and parameters: {'n_estimators': 1000, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 173, 'min_child_samples': 78, 'lambda_l1': 1.0708866586763564e-05, 'lambda_l2': 2.4683340848716906e-06, 'reg_alpha': 0.5302837150243924, 'reg_lambda': 0.8033717158930774, 'learning_rate': 0.05047074984874552, 'feature_fraction': 0.6196066363480937, 'subsample': 0.08527296183385108, 'colsample_bytree': 0.7184725298338295, 'bagging_fraction': 0.43735780133140306, 'bagging_freq': 5, 'min_child_weight': 0.0010552040603009237, 'min_split_gain': 0.5642018881288399, 'min_data_in_leaf': 2, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 19:53:01,431] Trial 12 finished with value: 0.9002541115424737 and parameters: {'n_estimators': 1000, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 173, 'min_child_samples': 70, 'lambda_l1': 1.165530319451606e-05, 'lambda_l2': 6.557722320701904e-07, 'reg_alpha': 0.55443498043106, 'reg_lambda': 0.750406064876292, 'learning_rate': 0.04803394354033783, 'feature_fraction': 0.5418748467431972, 'subsample': 0.0527277396464233, 'colsample_bytree': 0.6409664835561325, 'bagging_fraction': 0.4057583105199048, 'bagging_freq': 5, 'min_child_weight': 0.0012744723872042992, 'min_split_gain': 0.6235370931106026, 'min_data_in_leaf': 1, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 19:57:59,299] Trial 13 finished with value: 0.8785080092391984 and parameters: {'n_estimators': 800, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 179, 'min_child_samples': 59, 'lambda_l1': 3.870504370488863e-06, 'lambda_l2': 3.5632798838577234e-07, 'reg_alpha': 0.5552302283068307, 'reg_lambda': 0.7441647781250288, 'learning_rate': 0.0331882192882924, 'feature_fraction': 0.5379323013978335, 'subsample': 0.05782808696150672, 'colsample_bytree': 0.6668916697472603, 'bagging_fraction': 0.4042138039242215, 'bagging_freq': 5, 'min_child_weight': 0.02749957725080116, 'min_split_gain': 0.6358682934280196, 'min_data_in_leaf': 2, 'max_delta_step': 8}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:02:08,027] Trial 14 finished with value: 0.8543276011471223 and parameters: {'n_estimators': 850, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 197, 'min_child_samples': 65, 'lambda_l1': 5.857325396819327e-06, 'lambda_l2': 1.1540765814883997e-08, 'reg_alpha': 0.9449558716340772, 'reg_lambda': 0.7643735518759827, 'learning_rate': 0.0264377319047264, 'feature_fraction': 0.4138809416134508, 'subsample': 0.9785014763384117, 'colsample_bytree': 0.6217631862933801, 'bagging_fraction': 0.5004604633571963, 'bagging_freq': 7, 'min_child_weight': 0.029231874860064476, 'min_split_gain': 0.7346524444201824, 'min_data_in_leaf': 9, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:05:28,515] Trial 15 finished with value: 0.8773558974807143 and parameters: {'n_estimators': 650, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 159, 'min_child_samples': 84, 'lambda_l1': 0.0010947979254749935, 'lambda_l2': 6.566180247402054e-07, 'reg_alpha': 0.700542554170593, 'reg_lambda': 0.6367861929705528, 'learning_rate': 0.06963364500865421, 'feature_fraction': 0.5454571723060675, 'subsample': 0.21789311730283067, 'colsample_bytree': 0.9417867015781536, 'bagging_fraction': 0.4613056213776964, 'bagging_freq': 6, 'min_child_weight': 0.0010050650360104857, 'min_split_gain': 0.4781261442568392, 'min_data_in_leaf': 8, 'max_delta_step': 5}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:09:52,587] Trial 16 finished with value: 0.8996594719728325 and parameters: {'n_estimators': 900, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 150, 'min_child_samples': 56, 'lambda_l1': 5.2101850657853695e-05, 'lambda_l2': 1.0270011883222986e-08, 'reg_alpha': 0.47817116755622174, 'reg_lambda': 0.8455458639986815, 'learning_rate': 0.05796270751653336, 'feature_fraction': 0.6127662875632117, 'subsample': 0.1571666236334549, 'colsample_bytree': 0.5457412302228843, 'bagging_fraction': 0.5123096232048837, 'bagging_freq': 4, 'min_child_weight': 0.03226715486140934, 'min_split_gain': 0.5688182774542483, 'min_data_in_leaf': 9, 'max_delta_step': 0}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:11:43,889] Trial 17 finished with value: 0.7970983517618615 and parameters: {'n_estimators': 400, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 249, 'min_child_samples': 88, 'lambda_l1': 8.485857943906983e-07, 'lambda_l2': 2.332740208848319e-06, 'reg_alpha': 0.9978838168722097, 'reg_lambda': 0.6798468844976686, 'learning_rate': 0.028667289635909032, 'feature_fraction': 0.40793738816795067, 'subsample': 0.0673086918921274, 'colsample_bytree': 0.9946989885817998, 'bagging_fraction': 0.4246976859779345, 'bagging_freq': 6, 'min_child_weight': 0.004197888784056201, 'min_split_gain': 0.7772416365414636, 'min_data_in_leaf': 1, 'max_delta_step': 8}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:14:34,560] Trial 18 finished with value: 0.8411340787793012 and parameters: {'n_estimators': 700, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 79, 'min_child_samples': 70, 'lambda_l1': 0.0064640587661589505, 'lambda_l2': 7.546095424976181e-05, 'reg_alpha': 0.6179256374168793, 'reg_lambda': 0.8653554387092988, 'learning_rate': 0.04692359560723053, 'feature_fraction': 0.540231361055241, 'subsample': 0.3039159708340914, 'colsample_bytree': 0.7111036760812757, 'bagging_fraction': 0.4943364397262184, 'bagging_freq': 5, 'min_child_weight': 0.012808373527934225, 'min_split_gain': 0.6459614317370587, 'min_data_in_leaf': 11, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:18:39,592] Trial 19 finished with value: 0.9003656058507529 and parameters: {'n_estimators': 950, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 195, 'min_child_samples': 51, 'lambda_l1': 1.9740443034087055e-05, 'lambda_l2': 1.538492352140492e-07, 'reg_alpha': 0.7820676589881548, 'reg_lambda': 0.485391596878385, 'learning_rate': 0.08338420033598173, 'feature_fraction': 0.6750014906044565, 'subsample': 0.18379862862364665, 'colsample_bytree': 0.3922667871442633, 'bagging_fraction': 0.4002258795457228, 'bagging_freq': 4, 'min_child_weight': 0.0929726219315795, 'min_split_gain': 0.8462828519856542, 'min_data_in_leaf': 30, 'max_delta_step': 7}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:22:31,723] Trial 20 finished with value: 0.9029578514479404 and parameters: {'n_estimators': 900, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 209, 'min_child_samples': 47, 'lambda_l1': 3.351553254308848e-07, 'lambda_l2': 6.380193421921823e-08, 'reg_alpha': 0.7908982979418974, 'reg_lambda': 0.4871542957987469, 'learning_rate': 0.08670301703025042, 'feature_fraction': 0.6613735657699136, 'subsample': 0.3683340397111962, 'colsample_bytree': 0.3697916753974209, 'bagging_fraction': 0.5700116779255459, 'bagging_freq': 4, 'min_child_weight': 0.09342824353373518, 'min_split_gain': 0.8503928290277616, 'min_data_in_leaf': 27, 'max_delta_step': 10}. Best is trial 10 with value: 0.9043887034768833.

Best is trial 10 with CV AUC value: 0.9044

Trial 10 finished with value: 0.9043887034768833 and parameters:

‘n_estimators’: 1000,
‘max_depth’: 12,
‘boosting_type’: ‘gbdt’,
‘num_leaves’: 171,
‘min_child_samples’: 71,
‘lambda_l1’: 1.6894405269684752e-05,
‘lambda_l2’: 2.5479861382375376e-06,
‘reg_alpha’: 0.5305753243887574,
‘reg_lambda’: 0.7834468951359224,
‘learning_rate’: 0.05897985586531902,
‘feature_fraction’: 0.608096917733464,
‘subsample’: 0.0823584299376674,
‘colsample_bytree’: 0.6951910076532127,
‘bagging_fraction’: 0.4298117967213491,
‘bagging_freq’: 5,
‘min_child_weight’: 0.0011663326986012653,
‘min_split_gain’: 0.553012365836588,
‘min_data_in_leaf’: 1,
‘max_delta_step’: 4

7.6 Evaluate Tuned Model

This time the tuned model faces the same issues related to overfitting as the tuned model in Section 7.6. This should be addressed by limiting model complexity. Unfortunately, this time there is not enough time to do this.

Code

params_tuned_2 = {
    "n_estimators": 1000,
    "max_depth": 12,
    "boosting_type": "gbdt",
    "num_leaves": 171,
    "min_child_samples": 71,
    "lambda_l1": 1.6894405269684752e-05,
    "lambda_l2": 2.5479861382375376e-06,
    "reg_alpha": 0.5305753243887574,
    "reg_lambda": 0.7834468951359224,
    "learning_rate": 0.05897985586531902,
    "feature_fraction": 0.608096917733464,
    "subsample": 0.0823584299376674,
    "colsample_bytree": 0.6951910076532127,
    "bagging_fraction": 0.4298117967213491,
    "bagging_freq": 5,
    "min_child_weight": 0.0011663326986012653,
    "min_split_gain": 0.553012365836588,
    "min_data_in_leaf": 1,
    "max_delta_step": 4,
}

model_tuned_2 = LGBMClassifier(
    objective="binary",
    metric="auc",
    random_state=1,
    class_weight="balanced",
    n_jobs=-1,
    device="gpu",
    **params_tuned_2
)

pipeline_tuned_2 = Pipeline(
    steps=[
        ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
        ("preprocessor_2", clone(pre_processing)),
        ("selector_2", ColumnSelector(features_to_tune)),
        ("classifier", model_tuned_2),
    ]
)

pipeline_tuned_2.fit(X_credit_train, y_credit_train)

[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 27981
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 216
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 104 dense feature groups (21.35 MB) transferred to GPU in 0.051113 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000

Code

models["LGBM (216 feat. | tuned)"] = pipeline_tuned_2

Code

performance_train_2 = ml.classification_scores(
    models,
    X_credit_train,
    y_credit_train,
    color="orange",
    sort_by="ROC_AUC",
)

[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5

Code

performance_validation_2 = ml.classification_scores(
    models,
    X_credit_validation,
    y_credit_validation,
    sort_by="ROC_AUC",
)

[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5

Code

print("--- Train ---")
performance_train_2

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (216 feat. \| tuned)	215257	0.919	0.991	0.995	0.990	0.947	0.995	1.000	0.990	0.900	1.000	1.000
LGBM (216 features)	215257	0.919	0.737	0.747	0.493	0.317	0.837	0.759	0.735	0.201	0.972	0.827
LGBM (167 features)	215257	0.919	0.737	0.747	0.495	0.318	0.837	0.759	0.735	0.201	0.972	0.827
LGBM (FULL \| 361 feat.)	215257	0.919	0.737	0.748	0.495	0.318	0.837	0.760	0.735	0.201	0.972	0.827
LGBM (183 features)	215257	0.919	0.737	0.748	0.496	0.318	0.837	0.761	0.735	0.201	0.972	0.827
LGBM (140 features)	215257	0.919	0.737	0.748	0.496	0.319	0.837	0.761	0.735	0.201	0.972	0.827
LGBM (120 features)	215257	0.919	0.737	0.746	0.492	0.317	0.837	0.757	0.735	0.201	0.972	0.826
LGBM (105 features)	215257	0.919	0.736	0.746	0.493	0.317	0.836	0.759	0.734	0.200	0.972	0.826
LGBM (84 features)	215257	0.919	0.734	0.744	0.488	0.315	0.835	0.755	0.733	0.199	0.971	0.824
LGBM (66 features)	215257	0.919	0.733	0.743	0.486	0.313	0.834	0.756	0.731	0.198	0.971	0.823
LGBM (47 features)	215257	0.919	0.730	0.741	0.481	0.311	0.832	0.753	0.728	0.196	0.971	0.819
LGBM (29 features)	215257	0.919	0.724	0.734	0.468	0.304	0.828	0.746	0.722	0.191	0.970	0.812
LGBM (20 features)	215257	0.919	0.720	0.731	0.462	0.300	0.825	0.744	0.718	0.188	0.970	0.807
LGBM (12 features)	215257	0.919	0.711	0.719	0.437	0.289	0.819	0.728	0.710	0.180	0.967	0.795
LGBM (5 features)	215257	0.919	0.701	0.701	0.403	0.275	0.812	0.702	0.701	0.171	0.964	0.777

Code

print("--- Validation ---")
performance_validation_2

--- Validation ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (216 features)	46127	0.919	0.726	0.709	0.419	0.289	0.830	0.689	0.730	0.183	0.964	0.779
LGBM (183 features)	46127	0.919	0.724	0.705	0.411	0.286	0.829	0.682	0.728	0.181	0.963	0.779
LGBM (167 features)	46127	0.919	0.727	0.706	0.412	0.287	0.831	0.681	0.731	0.182	0.963	0.778
LGBM (140 features)	46127	0.919	0.724	0.706	0.411	0.286	0.829	0.684	0.728	0.181	0.963	0.778
LGBM (FULL \| 361 feat.)	46127	0.919	0.724	0.704	0.409	0.285	0.829	0.680	0.728	0.180	0.963	0.778
LGBM (120 features)	46127	0.919	0.726	0.706	0.413	0.287	0.831	0.683	0.730	0.182	0.963	0.777
LGBM (105 features)	46127	0.919	0.725	0.704	0.408	0.285	0.829	0.679	0.729	0.180	0.963	0.777
LGBM (66 features)	46127	0.919	0.723	0.708	0.415	0.287	0.828	0.689	0.726	0.181	0.964	0.777
LGBM (84 features)	46127	0.919	0.722	0.706	0.412	0.285	0.828	0.687	0.726	0.180	0.963	0.776
LGBM (47 features)	46127	0.919	0.720	0.703	0.407	0.283	0.826	0.683	0.723	0.178	0.963	0.774
LGBM (29 features)	46127	0.919	0.714	0.700	0.399	0.278	0.822	0.682	0.717	0.175	0.963	0.770
LGBM (20 features)	46127	0.919	0.710	0.695	0.391	0.274	0.819	0.677	0.713	0.172	0.962	0.767
LGBM (12 features)	46127	0.919	0.705	0.693	0.386	0.271	0.815	0.679	0.707	0.169	0.962	0.760
LGBM (5 features)	46127	0.919	0.696	0.685	0.370	0.263	0.809	0.671	0.699	0.164	0.960	0.748
LGBM (216 feat. \| tuned)	46127	0.919	0.893	0.596	0.193	0.268	0.942	0.243	0.950	0.298	0.935	0.746

7.7 Final Evaluation

After hyperparameter tuning, the trade-off between model complexity and accuracy was re-considered. Instead of the best-performing model based on 216 features, a much less complex model based on 47 features with comparable performance (AUC = 0.774 which is only smaller by 0.005) was chosen as the final model to be deployed.

The final performance of the model based on these 47 features is AUC = 0.777. In the case, where no credit history data was used, the best model had AUC = 0.763, so the improvement by this type of data is not huge (only 0.014).

Code

features_47 = feature_importance_ext.head(47).col_name.to_list()

pipeline_final_2_with_47_feat = Pipeline(
    steps=[
        ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
        ("preprocessor_2", clone(pre_processing)),
        ("selector_2", ColumnSelector(features_47)),
        ("classifier", clone(lgbm_classifier)),
    ]
)

Code

# For evaluation
X_credit_train_validation = pd.concat([X_credit_train, X_credit_validation])
y_credit_train_validation = pd.concat([y_credit_train, y_credit_validation])

models_final_2 = {}
models_final_2["LGBM (47 feat. | final)"] = pipeline_final_2_with_47_feat.fit(
    X_credit_train_validation, y_credit_train_validation
)

[LightGBM] [Info] Number of positive: 21101, number of negative: 240283
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 7853
[LightGBM] [Info] Number of data points in the train set: 261384, number of used features: 47
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (8.97 MB) transferred to GPU in 0.026298 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000

Code

print("--- Train ---")

ml.classification_scores(
    models_final_2,
    X_credit_train_validation,
    y_credit_train_validation,
    color="orange",
    sort_by="ROC_AUC",
)

--- Train ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (47 feat. \| final)	261384	0.919	0.726	0.734	0.469	0.305	0.829	0.744	0.724	0.192	0.970	0.813

Code of the figure

sns.set_style("white")
y_pred_train_val_2 = models_final_2["LGBM (47 feat. | final)"].predict(
    X_credit_train_validation
)
ml.plot_confusion_matrices(y_credit_train_validation, y_pred_train_val_2, figsize=(13, 3));

Fig. 7.1. Confusion matrices for the joint **train and validation set**.

Code

print("--- Test ---")

ml.classification_scores(
    models_final_2,
    X_credit_test,
    y_credit_test,
    sort_by="ROC_AUC",
)

--- Test ---

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM (47 feat. \| final)	46127	0.919	0.722	0.710	0.420	0.288	0.827	0.697	0.724	0.181	0.964	0.777

Code of the figure

sns.set_style("white")
y_pred_test_2 = models_final_2["LGBM (47 feat. | final)"].predict(X_credit_test)
ml.plot_confusion_matrices(y_credit_test, y_pred_test_2, figsize=(13, 3));

Fig. 7.2. Confusion matrices for the **test set**.

Code

# SHAP values for the final model
@my.cache_results(dir_interim + "task-2--shap_lgbm_k=47-final.pickle")
def get_shap_values_lgbm_final_2():
    model = "LGBM (47 feat. | final)"
    preproc = Pipeline(steps=models_final_2[model].steps[:-1])
    classifier = models_final_2[model]["classifier"]
    X_test_preproc = preproc.transform(X_credit_test)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_test_preproc)
    return shap_values, X_test_preproc


shap_values_lgbm_test_2, data_for_lgbm_test_2 = get_shap_values_lgbm_final_2()

feature_importance_test_2 = (
    pd.DataFrame(
        list(
            zip(
                data_for_lgbm_test_2.columns,
                np.abs(shap_values_lgbm_test_2).mean(0).mean(0),
            )
        ),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

Code

sns.set_style("whitegrid")
lgb.plot_importance(
    models_final_2["LGBM (47 feat. | final)"]["classifier"],
    max_num_features=50,
    figsize=(8, 9),
    height=0.8,
    title="LGBM Feature Importance (Final Model)",
);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_2[1],
    data_for_lgbm_test_2,
    plot_type="bar",
    max_display=50,
    plot_size=(10, 6),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code

sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_2[1],
    data_for_lgbm_test_2,
    max_display=50,
    plot_size=(10, 6),
    show=False,
)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

7.8 Model for Deployment (w/ Historical Data)

Merge all data to train the final model:

Code

# For deployment
X_credit_all = pd.concat([X_credit_train, X_credit_validation, X_credit_test], axis=0)
y_credit_all = pd.concat([y_credit_train, y_credit_validation, y_credit_test], axis=0)

pipeline_to_deploy_2 = clone(pipeline_final_2_with_47_feat)
pipeline_to_deploy_2 = pipeline_to_deploy_2.fit(X_credit_all, y_credit_all)

[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 7851
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 47
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (10.56 MB) transferred to GPU in 0.029807 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000

For simplicity, the model will be deployed without pre-processing pipeline.

Code

# Extract and save classifier
classifier_to_deploy_2 = pipeline_to_deploy_2.named_steps["classifier"]

with open("models/classifier-2--with_credit_history.pickle", "wb") as f:
    joblib.dump(classifier_to_deploy_2, f)

8 Final Remarks

In binary classification, the default threshold of 0.5 was used.
Threshold adjustment (e.g., via ROC curve analysis) might be beneficial.
Hyperparameter tuning was not efficient in this analysis. The issue of overfitting should be addressed by limiting model complexity. Unfortunately, there was not enough time to do this.
Only LGBM model was used. To try other architectures (e.g., logistic regression, Naive Bayes, or neural networks) might be beneficial too.
In some cases, more self-explanatory variable names could be used.
There is a lot of repeated code between the parts of modeling with and without historical data. This could be improved. Unfortunately, this would have required much more time.
Some results and lines of code could be described in more detail.
Not all functions from functions subfolder were used. They should be treated as a separate module.
The requirements.txt file was created by using pip freeze > requirements.txt command. This is not the best way to create this file as there are more packages than it is needed.
A consultation with a field expert would be beneficial to understand the data better and to improve the model.

Annotation

Abbreaviations

1 Plan

2 Setup

3 Data

3.1 Explore Data Files

3.2 Read Data

3.3 Inspect Data

3.3.1 Table application

3.3.2 Table bureau

3.3.3 Table bureau_balance

3.3.4 Table previous_application

3.3.5 Table pos_cash_balance

3.3.6 Table credit_card_balance

3.3.7 Table installments_payments

3.3.8 Tables application_test and sample_submission

3.4 Split to Train, Validation, and Test Sets

3.5 EDA on Train Set

4 Modeling (w/o Historical Data)

4.1 Create Pipelines

4.2 Train Full Model

4.3 Evaluate Models

4.4 Feature Importance

4.5 Training Models with Feature Selection

4.6 Hyperparameter Tuning

4.7 Evaluate Tuned Model

4.8 Final Evaluation

4.9 Model for Deployment (w/o Historical Data)

5 Feature Engineering

5.1 Table bureau

5.2 Table bureau_balance

5.3 Table previous_application

5.4 Table installments_payments

5.5 Table pos_cash_balance

5.6 Table credit_card_balance

5.7 Merge and Further Pre-Process Tables

5.8 Inspect Training Set

6 Further Pre-Processing

6.1 Identify Redundant and Problematic Features

6.1.1 Steps Before Pre-Processing

6.1.2 Pre-Processing

6.1.3 Steps After Pre-Processing

6.2 Train, Validation, and Test Sets

7 Modeling (w/ Historical Data)

7.1 Train Full Model

7.2 Evaluate Models

7.3 Feature Importance

7.4 Training Models with Feature Selection

7.5 Tune Hyperparameters

7.6 Evaluate Tuned Model

7.7 Final Evaluation

7.8 Model for Deployment (w/ Historical Data)

8 Final Remarks

3.3.1 Table `application`

3.3.2 Table `bureau`

3.3.3 Table `bureau_balance`

3.3.4 Table `previous_application`

3.3.5 Table `pos_cash_balance`

3.3.6 Table `credit_card_balance`

3.3.7 Table `installments_payments`

3.3.8 Tables `application_test` and `sample_submission`

5.1 Table `bureau`

5.2 Table `bureau_balance`

5.3 Table `previous_application`

5.4 Table `installments_payments`

5.5 Table `pos_cash_balance`

5.6 Table `credit_card_balance`