Home Credit Default Risk Modeling

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-12-27

Updated

2023-12-29

Home Credit Default Risk project logo. Originally generated with Leonardo.Ai.

Annotation

In this project, an extensive analysis of credit data from Home Credit Group was undertaken. Two distinct models were developed to predict if a loan applicant repays the loan or faces financial difficulties: one model operates independently of credit history data and another one incorporates such information. After rigorous testing, the models were successfully deployed on the Google Cloud Platform and now are available via API. Surprisingly, the results indicate only marginal improvement in model performance when historical credit data is included.

The homepage of deployed models is currently available at:
https://home-credit-default-prediction-sarhiiybua-ew.a.run.app/

The examples on how to use the API are available in this README file.

Abbreaviations

  • API: application programming interface;
  • AUC: area under the ROC curve;
  • BAcc_01: balanced accuracy score normalized to interval [0, 1];
  • BAcc: balanced accuracy score;
  • EDA: exploratory data analysis;
  • F1_neg: F1 score for the negative class;
  • F1: F1 score (usually for the positive class);
  • K: thousand;
  • M: million;
  • ML: machine learning;
  • NPV: negative predictive value;
  • PPV: positive predictive value (precision);
  • ROC: receiver operating characteristic;
  • SHAP: SHapley Additive exPlanations;
  • TNR: true negative rate;
  • TPR: true positive rate (recall).

1 Plan

In this project, a dataset from Home Credit Group will be analyzed. The main purpose of this analysis is to investigate if significantly increases the performance of models that predict if a client, who is going to take credit, will face financial difficulties or not. For this purpose, the following plan will be implemented:

  1. EDA on data will be performed.
  2. A predictive model based on data from the application table (no credit history data included) will be created (the first model);
    • assumption: this data is always available.
  3. The first model will be deployed.
  4. Data from the remaining tables will be pre-processed (extracted, aggregated) and merged with the applications dataset.
  5. The model based on all currently available data (including credit and loan history) will be created (the second model).
    • assumption: the data in other tables than application might be updated rarely, to acquire it might cost or sometimes might even be unavailable for some people who want to take a credit.
  6. The second model will be deployed.

2 Setup

Some preparation steps are described in the README.md file of the project (e.g., here). The Python code that imports the necessary tools:

# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

# Plotting
%matplotlib inline

# Packages and modules -------------------------------
# Utilities
import os
import warnings
import numpy as np
import joblib

# Data frames
import pandas as pd

# EDA and plotting
import seaborn as sns
import matplotlib.pyplot as plt

import sweetviz
import klib

# Data wrangling, maths, feature engineering
import numpy as np

# Patch sklearn with Intel's version
from sklearnex import patch_sklearn

patch_sklearn()  # Run this code before importing from sklearn

# Machine learning
import lightgbm as lgb
from sklearn import set_config

from sklearn.base import clone, BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
)
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.model_selection import (
    cross_val_score,
    train_test_split,
    StratifiedKFold,
)

# ML: classification models
from lightgbm import LGBMClassifier

# ML: feature engineering and selection
from feature_engine.selection import (
    DropDuplicateFeatures,
    SmartCorrelatedSelection,
)

# ML: hyperparameter tuning
import optuna

# ML: explainability
import shap

# Display
from IPython.display import display

# Custom functions
import functions.fun_utils as my
import functions.fun_analysis as an
import functions.fun_ml as ml
from functions.utils import (
    ColumnSelector,
    CleanColumnNames,
)

%aimport functions.fun_utils
%aimport functions.fun_analysis
%aimport functions.fun_ml
%aimport functions.utils

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(10, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_colwidth", 50)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# Turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)

# Scikit-learn options
set_config(transform_output="pandas")

# Analysis parameters: use Sweetviz for eda?
do_eda = True

# For caching results ---------------------------------
dir_interim = "data/interim/"
os.mkdir(dir_interim) if not os.path.exists(dir_interim) else None
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

3 Data

In this project, a dataset from Home Credit Group is investigated.

The target variable TARGET has 2 values:

  • 1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample,
  • 0 - all other cases.

In most cases, the meaning of the following values are:

  • XNA: unknown / not available
  • XAP: not applicable

More information on the dataset:

The dataset was downloaded as a zip file and extracted into the data folder (data/raw/ and data/info/directories; more details in the next section).

3.1 Explore Data Files

In this section, the data and metadata files are explored.

The files with metadata and descriptions are stored in data/info directory: the files are acquired from the main source as well as from elsewhere.

Details:

Code
!echo Files with data description:
!ls data/info/
Files with data description:
HomeCredit_column_descriptions.xlsx
HomeCredit_columns_description.csv
description--Home Credit Default Risk.pdf

The files with datasets are stored in data/raw directory.

Code
!echo Data files:
!ls data/raw/
Data files:
POS_CASH_balance.csv
application_test.csv
application_train.csv
bureau.csv
bureau_balance.csv
credit_card_balance.csv
installments_payments.csv
previous_application.csv
sample_submission.csv
Code
!echo File sizes:
!cd data/raw/ &&\
   du -m * | sed 's/\([0-9]\+\)/\1 MB /'
File sizes:
375 MB  POS_CASH_balance.csv
26 MB   application_test.csv
159 MB  application_train.csv
163 MB  bureau.csv
359 MB  bureau_balance.csv
405 MB  credit_card_balance.csv
690 MB  installments_payments.csv
387 MB  previous_application.csv
1 MB    sample_submission.csv
Code
# NOTE: header line is also included here
!echo Number of lines per file:
!cd data/raw/ &&\
    wc --lines *
Number of lines per file:
  10001359 POS_CASH_balance.csv
     48745 application_test.csv
    307512 application_train.csv
   1716429 bureau.csv
  27299926 bureau_balance.csv
   3840313 credit_card_balance.csv
  13605402 installments_payments.csv
   1670215 previous_application.csv
     48745 sample_submission.csv
  58538646 total

A few top rows (formatted as table) of each file:

Code
!cd data/raw/ &&\
    head -n 5 application_train.csv
SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.08303696739132256,0.2629485927471776,0.13937578009978951,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003540999999999999,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.3112673113812225,0.6222457752555098,,0.0959,0.0529,0.9851,0.7959999999999999,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.5559120833904428,0.7295666907060153,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.6504416904014653,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
Code
!cd data/raw/ &&\
    head -n 5 application_train.csv | csvlook
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL |  AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE           | NAME_FAMILY_STATUS   | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE      | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR |
| ---------- | ------ | ------------------ | ----------- | ------------ | --------------- | ------------ | ---------------- | ----------- | ----------- | --------------- | --------------- | ---------------- | ----------------------------- | -------------------- | ----------------- | -------------------------- | ---------- | ------------- | ----------------- | --------------- | ----------- | ---------- | -------------- | --------------- | ---------------- | ---------- | ---------- | --------------- | --------------- | -------------------- | --------------------------- | -------------------------- | ----------------------- | -------------------------- | -------------------------- | --------------------------- | ---------------------- | ---------------------- | ----------------------- | ---------------------- | ------------ | ------------ | ------------ | -------------- | ---------------- | --------------------------- | --------------- | -------------- | ------------- | ------------- | ------------- | ------------- | ------------ | -------------------- | -------------- | ----------------------- | ----------------- | --------------- | ----------------- | ---------------------------- | ---------------- | --------------- | -------------- | -------------- | -------------- | -------------- | ------------- | --------------------- | --------------- | ------------------------ | ------------------ | --------------- | ----------------- | ---------------------------- | ---------------- | --------------- | -------------- | -------------- | -------------- | -------------- | ------------- | --------------------- | --------------- | ------------------------ | ------------------ | ------------------ | -------------- | -------------- | ------------------ | ------------------- | ------------------------ | ------------------------ | ------------------------ | ------------------------ | ---------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | -------------------------- | ------------------------- | -------------------------- | ------------------------- | ------------------------- | -------------------------- |
|    100,002 |   True | Cash loans         | M           |        False |            True |        False |          202,500 |   406,597.5 |    24,700.5 |         351,000 | Unaccompanied   | Working          | Secondary / secondary special | Single / not married | House / apartment |                     0.019… |     -9,461 |          -637 |            -3,648 |          -2,120 |             |       True |           True |           False |             True |       True |      False | Laborers        |               1 |                    2 |                           2 |                 0001-01-03 |                      10 |                      False |                      False |                       False |                  False |                  False |                   False | Business Entity Type 3 |       0.083… |       0.263… |       0.139… |         0.025… |           0.037… |                      0.972… |          0.619… |         0.014… |          0.00 |        0.069… |        0.083… |        0.125… |       0.037… |               0.020… |         0.019… |                  0.000… |            0.000… |          0.025… |            0.038… |                       0.972… |           0.634… |          0.014… |         0.000… |         0.069… |         0.083… |         0.125… |        0.038… |                 0.022 |          0.020… |                        0 |                  0 |          0.025… |            0.037… |                       0.972… |           0.624… |          0.014… |           0.00 |         0.069… |         0.083… |         0.125… |        0.038… |                0.020… |          0.019… |                   0.000… |               0.00 | reg oper account   | block of flats |         0.015… | Stone, brick       |               False |                        2 |                        2 |                        2 |                        2 |                 -1,134 |           False |            True |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                          0 |                         0 |                          0 |                         0 |                         0 |                          1 |
|    100,003 |  False | Cash loans         | F           |        False |           False |        False |          270,000 | 1,293,502.5 |    35,698.5 |       1,129,500 | Family          | State servant    | Higher education              | Married              | House / apartment |                     0.004… |    -16,765 |        -1,188 |            -1,186 |            -291 |             |       True |           True |           False |             True |       True |      False | Core staff      |               2 |                    1 |                           1 |                 0001-01-08 |                      11 |                      False |                      False |                       False |                  False |                  False |                   False | School                 |       0.311… |       0.622… |              |         0.096… |           0.053… |                      0.985… |          0.796… |         0.060… |          0.08 |        0.034… |        0.292… |        0.333… |       0.013… |               0.077… |         0.055… |                  0.004… |            0.010… |          0.092… |            0.054… |                       0.985… |           0.804… |          0.050… |         0.081… |         0.034… |         0.292… |         0.333… |        0.013… |                 0.079 |          0.055… |                        0 |                  0 |          0.097… |            0.053… |                       0.985… |           0.799… |          0.061… |           0.08 |         0.034… |         0.292… |         0.333… |        0.013… |                0.079… |          0.056… |                   0.004… |               0.01 | reg oper account   | block of flats |         0.071… | Block              |               False |                        1 |                        0 |                        1 |                        0 |                   -828 |           False |            True |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                          0 |                         0 |                          0 |                         0 |                         0 |                          0 |
|    100,004 |  False | Revolving loans    | M           |         True |            True |        False |           67,500 |   135,000.0 |     6,750.0 |         135,000 | Unaccompanied   | Working          | Secondary / secondary special | Single / not married | House / apartment |                     0.010… |    -19,046 |          -225 |            -4,260 |          -2,531 |          26 |       True |           True |            True |             True |       True |      False | Laborers        |               1 |                    2 |                           2 |                 0001-01-08 |                       9 |                      False |                      False |                       False |                  False |                  False |                   False | Government             |              |       0.556… |       0.730… |                |                  |                             |                 |                |               |               |               |               |              |                      |                |                         |                   |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                    |                |                |                    |                     |                        0 |                        0 |                        0 |                        0 |                   -815 |           False |           False |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                          0 |                         0 |                          0 |                         0 |                         0 |                          0 |
|    100,006 |  False | Cash loans         | F           |        False |            True |        False |          135,000 |   312,682.5 |    29,686.5 |         297,000 | Unaccompanied   | Working          | Secondary / secondary special | Civil marriage       | House / apartment |                     0.008… |    -19,005 |        -3,039 |            -9,833 |          -2,437 |             |       True |           True |           False |             True |      False |      False | Laborers        |               2 |                    2 |                           2 |                 0001-01-03 |                      17 |                      False |                      False |                       False |                  False |                  False |                   False | Business Entity Type 3 |              |       0.650… |              |                |                  |                             |                 |                |               |               |               |               |              |                      |                |                         |                   |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                 |                   |                              |                  |                 |                |                |                |                |               |                       |                 |                          |                    |                    |                |                |                    |                     |                        2 |                        0 |                        2 |                        0 |                   -617 |           False |            True |           False |           False |           False |           False |           False |           False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |            False |                            |                           |                            |                           |                           |                            |

3.2 Read Data

Next, the data files are read into the Python environment.

Potentially useful files are:

  • application_train.csv
  • bureau.csv
  • bureau_balance.csv
  • previous_application.csv
  • POS_CASH_balance.csv
  • credit_card_balance.csv
  • installments_payments.csv

Files to discard from the analysis:

  • application_test.csv: no target variable.
  • sample_submission.csv: just an example of a submission file, not relevant for the analysis.
Code
if not os.path.exists("data/interim/raw--application.feather"):
    # Read CSV data, convert to smaller data types
    # and save as Feather files
    application = pd.read_csv("data/raw/application_train.csv").pipe(
        klib.convert_datatypes
    )
    bureau = pd.read_csv("data/raw/bureau.csv").pipe(klib.convert_datatypes)
    bureau_balance = pd.read_csv("data/raw/bureau_balance.csv").pipe(
        klib.convert_datatypes
    )
    previous_application = pd.read_csv("data/raw/previous_application.csv").pipe(
        klib.convert_datatypes
    )
    credit_card_balance = pd.read_csv("data/raw/credit_card_balance.csv").pipe(
        klib.convert_datatypes
    )
    installments_payments = pd.read_csv("data/raw/installments_payments.csv").pipe(
        klib.convert_datatypes
    )
    pos_cash_balance = pd.read_csv("data/raw/POS_CASH_balance.csv").pipe(
        klib.convert_datatypes
    )

    application_test = pd.read_csv("data/raw/application_test.csv").pipe(
        klib.convert_datatypes
    )
    sample_submission = pd.read_csv("data/raw/sample_submission.csv").pipe(
        klib.convert_datatypes
    )
    # Time to read CSV data: 2m 1.1s

    # Use Feather format for quicker loading of data
    application.to_feather("data/interim/raw--application.feather")
    bureau.to_feather("data/interim/raw--bureau.feather")
    bureau_balance.to_feather("data/interim/raw--bureau_balance.feather")
    previous_application.to_feather("data/interim/raw--previous_application.feather")
    credit_card_balance.to_feather("data/interim/raw--credit_card_balance.feather")
    installments_payments.to_feather("data/interim/raw--installments_payments.feather")
    pos_cash_balance.to_feather("data/interim/raw--pos_cash_balance.feather")

    application_test.to_feather("data/interim/raw--application_test.feather")
    sample_submission.to_feather("data/interim/raw--sample_submission.feather")

else:
    # Read cached data
    application = pd.read_feather("data/interim/raw--application.feather")
    bureau = pd.read_feather("data/interim/raw--bureau.feather")
    bureau_balance = pd.read_feather("data/interim/raw--bureau_balance.feather")
    previous_application = pd.read_feather(
        "data/interim/raw--previous_application.feather"
    )
    credit_card_balance = pd.read_feather(
        "data/interim/raw--credit_card_balance.feather"
    )
    installments_payments = pd.read_feather(
        "data/interim/raw--installments_payments.feather"
    )
    pos_cash_balance = pd.read_feather("data/interim/raw--pos_cash_balance.feather")

    application_test = pd.read_feather("data/interim/raw--application_test.feather")
    sample_submission = pd.read_feather("data/interim/raw--sample_submission.feather")

3.3 Inspect Data

Next, the data files are inspected. The purpose of this step is to get a general idea of the data and to spot potential issues. More detailed data exploration (EDA) will be performed later on training data only.

Table application_train has the largest number of columns (122), and table burau_balance has the largest number of rows (27.3M). The next sub-sections explore each table in more detail.

3.3.1 Table application

Code
application.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: category(16), float32(64), float64(1), int16(2), int32(2), int8(37)
memory usage: 96.5 MB
Code
application.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.00 406597.50 24700.50 351000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.02 -9461 -637 -3648.00 -2120 NaN 1 1 0 1 1 0 Laborers 1.00 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.08 0.26 0.14 0.02 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.63 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 reg oper account block of flats 0.01 Stone, brick No 2.00 2.00 2.00 2.00 -1134.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00
1 100003 0 Cash loans F N N 0 270000.00 1293502.50 35698.50 1129500.00 Family State servant Higher education Married House / apartment 0.00 -16765 -1188 -1186.00 -291 NaN 1 1 0 1 1 0 Core staff 2.00 1 1 MONDAY 11 0 0 0 0 0 0 School 0.31 0.62 NaN 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.05 0.00 0.01 0.09 0.05 0.99 0.80 0.05 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.00 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.01 reg oper account block of flats 0.07 Block No 1.00 0.00 1.00 0.00 -828.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
2 100004 0 Revolving loans M Y Y 0 67500.00 135000.00 6750.00 135000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.01 -19046 -225 -4260.00 -2531 26.00 1 1 1 1 1 0 Laborers 1.00 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.56 0.73 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -815.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
3 100006 0 Cash loans F N Y 0 135000.00 312682.50 29686.50 297000.00 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.01 -19005 -3039 -9833.00 -2437 NaN 1 1 0 1 0 0 Laborers 2.00 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 0.00 2.00 0.00 -617.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.00 513000.00 21865.50 513000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.03 -19932 -3038 -4311.00 -3458 NaN 1 1 0 1 0 0 Core staff 1.00 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1106.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
Code
an.col_info(application, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 1.2 MB 307,511 100.0% 0 0% 1 <0.1% <0.1% 100002
2 TARGET int8 307.5 kB 2 <0.1% 0 0% 282,686 91.9% 91.9% 0
3 NAME_CONTRACT_TYPE category 307.8 kB 2 <0.1% 0 0% 278,232 90.5% 90.5% Cash loans
4 CODE_GENDER category 307.8 kB 3 <0.1% 0 0% 202,448 65.8% 65.8% F
5 FLAG_OWN_CAR category 307.7 kB 2 <0.1% 0 0% 202,924 66.0% 66.0% N
6 FLAG_OWN_REALTY category 307.7 kB 2 <0.1% 0 0% 213,312 69.4% 69.4% Y
7 CNT_CHILDREN int8 307.5 kB 15 <0.1% 0 0% 215,371 70.0% 70.0% 0
8 AMT_INCOME_TOTAL float64 2.5 MB 2,548 0.8% 0 0% 35,750 11.6% 11.6% 135000.0
9 AMT_CREDIT float32 1.2 MB 5,603 1.8% 0 0% 9,709 3.2% 3.2% 450000.0
10 AMT_ANNUITY float32 1.2 MB 13,672 4.4% 12 <0.1% 6,385 2.1% 2.1% 9000.0
11 AMT_GOODS_PRICE float32 1.2 MB 1,002 0.3% 278 0.1% 26,022 8.5% 8.5% 450000.0
12 NAME_TYPE_SUITE category 308.3 kB 7 <0.1% 1,292 0.4% 248,526 80.8% 81.2% Unaccompanied
13 NAME_INCOME_TYPE category 308.4 kB 8 <0.1% 0 0% 158,774 51.6% 51.6% Working
14 NAME_EDUCATION_TYPE category 308.1 kB 5 <0.1% 0 0% 218,391 71.0% 71.0% Secondary / secondary special
15 NAME_FAMILY_STATUS category 308.1 kB 6 <0.1% 0 0% 196,432 63.9% 63.9% Married
16 NAME_HOUSING_TYPE category 308.1 kB 6 <0.1% 0 0% 272,868 88.7% 88.7% House / apartment
17 REGION_POPULATION_RELATIVE float32 1.2 MB 81 <0.1% 0 0% 16,408 5.3% 5.3% 0.035792
18 DAYS_BIRTH int16 615.0 kB 17,460 5.7% 0 0% 43 <0.1% <0.1% -13749
19 DAYS_EMPLOYED int32 1.2 MB 12,574 4.1% 0 0% 55,374 18.0% 18.0% 365243
20 DAYS_REGISTRATION float32 1.2 MB 15,688 5.1% 0 0% 113 <0.1% <0.1% -1.0
21 DAYS_ID_PUBLISH int16 615.0 kB 6,168 2.0% 0 0% 169 0.1% 0.1% -4053
22 OWN_CAR_AGE float32 1.2 MB 62 <0.1% 202,929 66.0% 7,424 2.4% 7.1% 7.0
23 FLAG_MOBIL int8 307.5 kB 2 <0.1% 0 0% 307,510 >99.9% >99.9% 1
24 FLAG_EMP_PHONE int8 307.5 kB 2 <0.1% 0 0% 252,125 82.0% 82.0% 1
25 FLAG_WORK_PHONE int8 307.5 kB 2 <0.1% 0 0% 246,203 80.1% 80.1% 0
26 FLAG_CONT_MOBILE int8 307.5 kB 2 <0.1% 0 0% 306,937 99.8% 99.8% 1
27 FLAG_PHONE int8 307.5 kB 2 <0.1% 0 0% 221,080 71.9% 71.9% 0
28 FLAG_EMAIL int8 307.5 kB 2 <0.1% 0 0% 290,069 94.3% 94.3% 0
29 OCCUPATION_TYPE category 309.3 kB 18 <0.1% 96,391 31.3% 55,186 17.9% 26.1% Laborers
30 CNT_FAM_MEMBERS float32 1.2 MB 17 <0.1% 2 <0.1% 158,357 51.5% 51.5% 2.0
31 REGION_RATING_CLIENT int8 307.5 kB 3 <0.1% 0 0% 226,984 73.8% 73.8% 2
32 REGION_RATING_CLIENT_W_CITY int8 307.5 kB 3 <0.1% 0 0% 229,484 74.6% 74.6% 2
33 WEEKDAY_APPR_PROCESS_START category 308.3 kB 7 <0.1% 0 0% 53,901 17.5% 17.5% TUESDAY
34 HOUR_APPR_PROCESS_START int8 307.5 kB 24 <0.1% 0 0% 37,722 12.3% 12.3% 10
35 REG_REGION_NOT_LIVE_REGION int8 307.5 kB 2 <0.1% 0 0% 302,854 98.5% 98.5% 0
36 REG_REGION_NOT_WORK_REGION int8 307.5 kB 2 <0.1% 0 0% 291,899 94.9% 94.9% 0
37 LIVE_REGION_NOT_WORK_REGION int8 307.5 kB 2 <0.1% 0 0% 295,008 95.9% 95.9% 0
38 REG_CITY_NOT_LIVE_CITY int8 307.5 kB 2 <0.1% 0 0% 283,472 92.2% 92.2% 0
39 REG_CITY_NOT_WORK_CITY int8 307.5 kB 2 <0.1% 0 0% 236,644 77.0% 77.0% 0
40 LIVE_CITY_NOT_WORK_CITY int8 307.5 kB 2 <0.1% 0 0% 252,296 82.0% 82.0% 0
41 ORGANIZATION_TYPE category 313.6 kB 58 <0.1% 0 0% 67,992 22.1% 22.1% Business Entity Type 3
42 EXT_SOURCE_1 float32 1.2 MB 114,584 37.3% 173,378 56.4% 5 <0.1% <0.1% 0.62270665
43 EXT_SOURCE_2 float32 1.2 MB 119,831 39.0% 660 0.2% 721 0.2% 0.2% 0.28589788
44 EXT_SOURCE_3 float32 1.2 MB 814 0.3% 60,965 19.8% 1,460 0.5% 0.6% 0.7463002
45 APARTMENTS_AVG float32 1.2 MB 2,339 0.8% 156,061 50.7% 6,663 2.2% 4.4% 0.0825
46 BASEMENTAREA_AVG float32 1.2 MB 3,780 1.2% 179,943 58.5% 14,745 4.8% 11.6% 0.0
47 YEARS_BEGINEXPLUATATION_AVG float32 1.2 MB 285 0.1% 150,007 48.8% 4,311 1.4% 2.7% 0.9871
48 YEARS_BUILD_AVG float32 1.2 MB 149 <0.1% 204,488 66.5% 2,999 1.0% 2.9% 0.8232
49 COMMONAREA_AVG float32 1.2 MB 3,181 1.0% 214,865 69.9% 8,442 2.7% 9.1% 0.0
50 ELEVATORS_AVG float32 1.2 MB 257 0.1% 163,891 53.3% 85,718 27.9% 59.7% 0.0
51 ENTRANCES_AVG float32 1.2 MB 285 0.1% 154,828 50.3% 34,007 11.1% 22.3% 0.1379
52 FLOORSMAX_AVG float32 1.2 MB 403 0.1% 153,020 49.8% 61,875 20.1% 40.1% 0.1667
53 FLOORSMIN_AVG float32 1.2 MB 305 0.1% 208,642 67.8% 32,875 10.7% 33.3% 0.2083
54 LANDAREA_AVG float32 1.2 MB 3,527 1.1% 182,590 59.4% 15,600 5.1% 12.5% 0.0
55 LIVINGAPARTMENTS_AVG float32 1.2 MB 1,868 0.6% 210,199 68.4% 4,272 1.4% 4.4% 0.0504
56 LIVINGAREA_AVG float32 1.2 MB 5,199 1.7% 154,350 50.2% 284 0.1% 0.2% 0.0
57 NONLIVINGAPARTMENTS_AVG float32 1.2 MB 386 0.1% 213,514 69.4% 54,549 17.7% 58.0% 0.0
58 NONLIVINGAREA_AVG float32 1.2 MB 3,290 1.1% 169,682 55.2% 58,735 19.1% 42.6% 0.0
59 APARTMENTS_MODE float32 1.2 MB 760 0.2% 156,061 50.7% 7,522 2.4% 5.0% 0.084
60 BASEMENTAREA_MODE float32 1.2 MB 3,841 1.2% 179,943 58.5% 16,598 5.4% 13.0% 0.0
61 YEARS_BEGINEXPLUATATION_MODE float32 1.2 MB 221 0.1% 150,007 48.8% 4,291 1.4% 2.7% 0.9871
62 YEARS_BUILD_MODE float32 1.2 MB 154 0.1% 204,488 66.5% 2,960 1.0% 2.9% 0.8301
63 COMMONAREA_MODE float32 1.2 MB 3,128 1.0% 214,865 69.9% 9,690 3.2% 10.5% 0.0
64 ELEVATORS_MODE float32 1.2 MB 26 <0.1% 163,891 53.3% 89,498 29.1% 62.3% 0.0
65 ENTRANCES_MODE float32 1.2 MB 30 <0.1% 154,828 50.3% 36,041 11.7% 23.6% 0.1379
66 FLOORSMAX_MODE float32 1.2 MB 25 <0.1% 153,020 49.8% 65,550 21.3% 42.4% 0.1667
67 FLOORSMIN_MODE float32 1.2 MB 25 <0.1% 208,642 67.8% 34,403 11.2% 34.8% 0.2083
68 LANDAREA_MODE float32 1.2 MB 3,563 1.2% 182,590 59.4% 17,453 5.7% 14.0% 0.0
69 LIVINGAPARTMENTS_MODE float32 1.2 MB 736 0.2% 210,199 68.4% 4,931 1.6% 5.1% 0.0551
70 LIVINGAREA_MODE float32 1.2 MB 5,301 1.7% 154,350 50.2% 444 0.1% 0.3% 0.0
71 NONLIVINGAPARTMENTS_MODE float32 1.2 MB 167 0.1% 213,514 69.4% 59,255 19.3% 63.0% 0.0
72 NONLIVINGAREA_MODE float32 1.2 MB 3,327 1.1% 169,682 55.2% 67,126 21.8% 48.7% 0.0
73 APARTMENTS_MEDI float32 1.2 MB 1,148 0.4% 156,061 50.7% 7,109 2.3% 4.7% 0.0833
74 BASEMENTAREA_MEDI float32 1.2 MB 3,772 1.2% 179,943 58.5% 14,991 4.9% 11.8% 0.0
75 YEARS_BEGINEXPLUATATION_MEDI float32 1.2 MB 245 0.1% 150,007 48.8% 4,314 1.4% 2.7% 0.9871
76 YEARS_BUILD_MEDI float32 1.2 MB 151 <0.1% 204,488 66.5% 2,994 1.0% 2.9% 0.8256
77 COMMONAREA_MEDI float32 1.2 MB 3,202 1.0% 214,865 69.9% 8,691 2.8% 9.4% 0.0
78 ELEVATORS_MEDI float32 1.2 MB 46 <0.1% 163,891 53.3% 87,026 28.3% 60.6% 0.0
79 ENTRANCES_MEDI float32 1.2 MB 46 <0.1% 154,828 50.3% 35,535 11.6% 23.3% 0.1379
80 FLOORSMAX_MEDI float32 1.2 MB 49 <0.1% 153,020 49.8% 63,607 20.7% 41.2% 0.1667
81 FLOORSMIN_MEDI float32 1.2 MB 47 <0.1% 208,642 67.8% 33,737 11.0% 34.1% 0.2083
82 LANDAREA_MEDI float32 1.2 MB 3,560 1.2% 182,590 59.4% 15,919 5.2% 12.7% 0.0
83 LIVINGAPARTMENTS_MEDI float32 1.2 MB 1,097 0.4% 210,199 68.4% 4,500 1.5% 4.6% 0.0513
84 LIVINGAREA_MEDI float32 1.2 MB 5,281 1.7% 154,350 50.2% 299 0.1% 0.2% 0.0
85 NONLIVINGAPARTMENTS_MEDI float32 1.2 MB 214 0.1% 213,514 69.4% 56,097 18.2% 59.7% 0.0
86 NONLIVINGAREA_MEDI float32 1.2 MB 3,323 1.1% 169,682 55.2% 60,954 19.8% 44.2% 0.0
87 FONDKAPREMONT_MODE category 308.0 kB 4 <0.1% 210,295 68.4% 73,830 24.0% 75.9% reg oper account
88 HOUSETYPE_MODE category 307.8 kB 3 <0.1% 154,297 50.2% 150,503 48.9% 98.2% block of flats
89 TOTALAREA_MODE float32 1.2 MB 5,116 1.7% 148,431 48.3% 582 0.2% 0.4% 0.0
90 WALLSMATERIAL_MODE category 308.3 kB 7 <0.1% 156,341 50.8% 66,040 21.5% 43.7% Panel
91 EMERGENCYSTATE_MODE category 307.7 kB 2 <0.1% 145,755 47.4% 159,428 51.8% 98.6% No
92 OBS_30_CNT_SOCIAL_CIRCLE float32 1.2 MB 33 <0.1% 1,021 0.3% 163,910 53.3% 53.5% 0.0
93 DEF_30_CNT_SOCIAL_CIRCLE float32 1.2 MB 10 <0.1% 1,021 0.3% 271,324 88.2% 88.5% 0.0
94 OBS_60_CNT_SOCIAL_CIRCLE float32 1.2 MB 33 <0.1% 1,021 0.3% 164,666 53.5% 53.7% 0.0
95 DEF_60_CNT_SOCIAL_CIRCLE float32 1.2 MB 9 <0.1% 1,021 0.3% 280,721 91.3% 91.6% 0.0
96 DAYS_LAST_PHONE_CHANGE float32 1.2 MB 3,773 1.2% 1 <0.1% 37,672 12.3% 12.3% 0.0
97 FLAG_DOCUMENT_2 int8 307.5 kB 2 <0.1% 0 0% 307,498 >99.9% >99.9% 0
98 FLAG_DOCUMENT_3 int8 307.5 kB 2 <0.1% 0 0% 218,340 71.0% 71.0% 1
99 FLAG_DOCUMENT_4 int8 307.5 kB 2 <0.1% 0 0% 307,486 >99.9% >99.9% 0
100 FLAG_DOCUMENT_5 int8 307.5 kB 2 <0.1% 0 0% 302,863 98.5% 98.5% 0
101 FLAG_DOCUMENT_6 int8 307.5 kB 2 <0.1% 0 0% 280,433 91.2% 91.2% 0
102 FLAG_DOCUMENT_7 int8 307.5 kB 2 <0.1% 0 0% 307,452 >99.9% >99.9% 0
103 FLAG_DOCUMENT_8 int8 307.5 kB 2 <0.1% 0 0% 282,487 91.9% 91.9% 0
104 FLAG_DOCUMENT_9 int8 307.5 kB 2 <0.1% 0 0% 306,313 99.6% 99.6% 0
105 FLAG_DOCUMENT_10 int8 307.5 kB 2 <0.1% 0 0% 307,504 >99.9% >99.9% 0
106 FLAG_DOCUMENT_11 int8 307.5 kB 2 <0.1% 0 0% 306,308 99.6% 99.6% 0
107 FLAG_DOCUMENT_12 int8 307.5 kB 2 <0.1% 0 0% 307,509 >99.9% >99.9% 0
108 FLAG_DOCUMENT_13 int8 307.5 kB 2 <0.1% 0 0% 306,427 99.6% 99.6% 0
109 FLAG_DOCUMENT_14 int8 307.5 kB 2 <0.1% 0 0% 306,608 99.7% 99.7% 0
110 FLAG_DOCUMENT_15 int8 307.5 kB 2 <0.1% 0 0% 307,139 99.9% 99.9% 0
111 FLAG_DOCUMENT_16 int8 307.5 kB 2 <0.1% 0 0% 304,458 99.0% 99.0% 0
112 FLAG_DOCUMENT_17 int8 307.5 kB 2 <0.1% 0 0% 307,429 >99.9% >99.9% 0
113 FLAG_DOCUMENT_18 int8 307.5 kB 2 <0.1% 0 0% 305,011 99.2% 99.2% 0
114 FLAG_DOCUMENT_19 int8 307.5 kB 2 <0.1% 0 0% 307,328 99.9% 99.9% 0
115 FLAG_DOCUMENT_20 int8 307.5 kB 2 <0.1% 0 0% 307,355 99.9% 99.9% 0
116 FLAG_DOCUMENT_21 int8 307.5 kB 2 <0.1% 0 0% 307,408 >99.9% >99.9% 0
117 AMT_REQ_CREDIT_BUREAU_HOUR float32 1.2 MB 5 <0.1% 41,519 13.5% 264,366 86.0% 99.4% 0.0
118 AMT_REQ_CREDIT_BUREAU_DAY float32 1.2 MB 9 <0.1% 41,519 13.5% 264,503 86.0% 99.4% 0.0
119 AMT_REQ_CREDIT_BUREAU_WEEK float32 1.2 MB 9 <0.1% 41,519 13.5% 257,456 83.7% 96.8% 0.0
120 AMT_REQ_CREDIT_BUREAU_MON float32 1.2 MB 24 <0.1% 41,519 13.5% 222,233 72.3% 83.5% 0.0
121 AMT_REQ_CREDIT_BUREAU_QRT float32 1.2 MB 11 <0.1% 41,519 13.5% 215,417 70.1% 81.0% 0.0
122 AMT_REQ_CREDIT_BUREAU_YEAR float32 1.2 MB 25 <0.1% 41,519 13.5% 71,801 23.3% 27.0% 0.0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_1 = sweetviz.analyze(
            [application, "application"],
            pairwise_analysis="off",
        )
        report_inspect_1.show_notebook()

Let’s test the hypothesis that males (M) and females (F) have different repayment abilities. For this purpose, the chi-squared test of independence will be used. The CODE_GENDER and TARGET columns are selected from the application table and treated as nominal variables. The null hypothesis is that the proportion of 0 (no financial difficulties) and 1 (have financial difficulties) in each group is the same. The alternative hypothesis is that the proportions are different.

The test results reveal that the differences are significant and the frequency table (see below) reveals that the size of the difference in males and females with financial difficulties is approximately 3 percent.

Note. As making financial decisions based on gender is illegal in many countries, this variable will be excluded from the analysis.

Code
sns.set_theme(style="whitegrid")
crosstab = an.CrossTab(
    "CODE_GENDER", "TARGET", application.query("CODE_GENDER != 'XNA'")
)
crosstab.barplot(normalize="rows", stacked=True)
print(crosstab.chisq_test())
# The percentages of each row add up to 100%
crosstab.row_percentage.style.format("{:.1f}%")
chi-square test, χ²(1, n = 307507) = 920.01, p < 0.001
TARGET 0 1
CODE_GENDER    
F 93.0% 7.0%
M 89.9% 10.1%


3.3.2 Table bureau

Code
bureau.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Columns: 17 entries, SK_ID_CURR to AMT_ANNUITY
dtypes: category(3), float32(2), float64(6), int16(2), int32(3), int8(1)
memory usage: 124.4 MB
Code
bureau.head()
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.00 -153.00 NaN 0 91323.00 0.00 NaN 0.00 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.00 NaN NaN 0 225000.00 171342.00 NaN 0.00 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.00 NaN NaN 0 464323.50 NaN NaN 0.00 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.00 NaN NaN 0.00 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.00 NaN 77674.50 0 2700000.00 NaN NaN 0.00 Consumer credit -21 NaN
Code
an.col_info(bureau, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 6.9 MB 305,811 17.8% 0 0% 116 <0.1% <0.1% 120860
2 SK_ID_BUREAU int32 6.9 MB 1,716,428 100.0% 0 0% 1 <0.1% <0.1% 5714462
3 CREDIT_ACTIVE category 1.7 MB 4 <0.1% 0 0% 1,079,273 62.9% 62.9% Closed
4 CREDIT_CURRENCY category 1.7 MB 4 <0.1% 0 0% 1,715,020 99.9% 99.9% currency 1
5 DAYS_CREDIT int16 3.4 MB 2,923 0.2% 0 0% 1,330 0.1% 0.1% -364
6 CREDIT_DAY_OVERDUE int16 3.4 MB 942 0.1% 0 0% 1,712,211 99.8% 99.8% 0
7 DAYS_CREDIT_ENDDATE float32 6.9 MB 14,096 0.8% 105,553 6.1% 883 0.1% 0.1% 0.0
8 DAYS_ENDDATE_FACT float32 6.9 MB 2,917 0.2% 633,653 36.9% 811 <0.1% 0.1% -329.0
9 AMT_CREDIT_MAX_OVERDUE float64 13.7 MB 68,251 4.0% 1,124,488 65.5% 470,650 27.4% 79.5% 0.0
10 CNT_CREDIT_PROLONG int8 1.7 MB 10 <0.1% 0 0% 1,707,314 99.5% 99.5% 0
11 AMT_CREDIT_SUM float64 13.7 MB 236,708 13.8% 13 <0.1% 66,582 3.9% 3.9% 0.0
12 AMT_CREDIT_SUM_DEBT float64 13.7 MB 226,537 13.2% 257,669 15.0% 1,016,434 59.2% 69.7% 0.0
13 AMT_CREDIT_SUM_LIMIT float64 13.7 MB 51,726 3.0% 591,780 34.5% 1,050,142 61.2% 93.4% 0.0
14 AMT_CREDIT_SUM_OVERDUE float64 13.7 MB 1,616 0.1% 0 0% 1,712,270 99.8% 99.8% 0.0
15 CREDIT_TYPE category 1.7 MB 15 <0.1% 0 0% 1,251,615 72.9% 72.9% Consumer credit
16 DAYS_CREDIT_UPDATE int32 6.9 MB 2,982 0.2% 0 0% 18,503 1.1% 1.1% -7
17 AMT_ANNUITY float64 13.7 MB 40,321 2.3% 1,226,791 71.5% 256,915 15.0% 52.5% 0.0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_2 = sweetviz.analyze(
            [bureau, "bureau"],
            pairwise_analysis="off",
        )
        report_inspect_2.show_notebook()

3.3.3 Table bureau_balance

Code
bureau_balance.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Columns: 3 entries, SK_ID_BUREAU to STATUS
dtypes: category(1), int32(1), int8(1)
memory usage: 156.2 MB
Code
bureau_balance.head()
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
Code
an.col_info(bureau_balance, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_BUREAU int32 109.2 MB 817,395 3.0% 0 0% 97 <0.1% <0.1% 5645521
2 MONTHS_BALANCE int8 27.3 MB 97 <0.1% 0 0% 622,601 2.3% 2.3% -1
3 STATUS category 27.3 MB 8 <0.1% 0 0% 13,646,993 50.0% 50.0% C
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_3 = sweetviz.analyze(
            [bureau_balance, "bureau_balance"],
            pairwise_analysis="off",
        )
        report_inspect_3.show_notebook()

3.3.4 Table previous_application

Code
previous_application.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Columns: 37 entries, SK_ID_PREV to NFLAG_INSURED_ON_APPROVAL
dtypes: category(16), float32(10), float64(5), int16(1), int32(3), int8(2)
memory usage: 178.4 MB
Code
previous_application.head()
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.43 17145.00 17145.00 0.00 17145.00 SATURDAY 15 Y 1 0.00 0.18 0.87 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.00 middle POS mobile with interest 365243.00 -42.00 300.00 -42.00 -37.00 0.00
1 2802425 108129 Cash loans 25188.62 607500.00 679671.00 NaN 607500.00 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.00 low_action Cash X-Sell: low 365243.00 -134.00 916.00 365243.00 365243.00 1.00
2 2523466 122040 Cash loans 15060.74 112500.00 136444.50 NaN 112500.00 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 high Cash X-Sell: high 365243.00 -271.00 59.00 365243.00 365243.00 1.00
3 2819243 176158 Cash loans 47041.33 450000.00 470790.00 NaN 450000.00 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 middle Cash X-Sell: middle 365243.00 -482.00 -152.00 -182.00 -177.00 1.00
4 1784265 202054 Cash loans 31924.40 337500.00 404055.00 NaN 337500.00 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.00 high Cash Street: high NaN NaN NaN NaN NaN NaN
Code
an.col_info(previous_application, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_PREV int32 6.7 MB 1,670,214 100.0% 0 0% 1 <0.1% <0.1% 2030495
2 SK_ID_CURR int32 6.7 MB 338,857 20.3% 0 0% 77 <0.1% <0.1% 187868
3 NAME_CONTRACT_TYPE category 1.7 MB 4 <0.1% 0 0% 747,553 44.8% 44.8% Cash loans
4 AMT_ANNUITY float64 13.4 MB 357,959 21.4% 372,235 22.3% 31,865 1.9% 2.5% 2250.0
5 AMT_APPLICATION float64 13.4 MB 93,885 5.6% 0 0% 392,402 23.5% 23.5% 0.0
6 AMT_CREDIT float64 13.4 MB 86,803 5.2% 1 <0.1% 336,768 20.2% 20.2% 0.0
7 AMT_DOWN_PAYMENT float64 13.4 MB 29,278 1.8% 895,844 53.6% 369,854 22.1% 47.8% 0.0
8 AMT_GOODS_PRICE float64 13.4 MB 93,885 5.6% 385,515 23.1% 47,831 2.9% 3.7% 45000.0
9 WEEKDAY_APPR_PROCESS_START category 1.7 MB 7 <0.1% 0 0% 255,118 15.3% 15.3% TUESDAY
10 HOUR_APPR_PROCESS_START int8 1.7 MB 24 <0.1% 0 0% 192,728 11.5% 11.5% 11
11 FLAG_LAST_APPL_PER_CONTRACT category 1.7 MB 2 <0.1% 0 0% 1,661,739 99.5% 99.5% Y
12 NFLAG_LAST_APPL_IN_DAY int8 1.7 MB 2 <0.1% 0 0% 1,664,314 99.6% 99.6% 1
13 RATE_DOWN_PAYMENT float32 6.7 MB 191,301 11.5% 895,844 53.6% 369,854 22.1% 47.8% 0.0
14 RATE_INTEREST_PRIMARY float32 6.7 MB 148 <0.1% 1,664,263 99.6% 1,218 0.1% 20.5% 0.18913634
15 RATE_INTEREST_PRIVILEGED float32 6.7 MB 25 <0.1% 1,664,263 99.6% 1,717 0.1% 28.9% 0.83509517
16 NAME_CASH_LOAN_PURPOSE category 1.7 MB 25 <0.1% 0 0% 922,661 55.2% 55.2% XAP
17 NAME_CONTRACT_STATUS category 1.7 MB 4 <0.1% 0 0% 1,036,781 62.1% 62.1% Approved
18 DAYS_DECISION int16 3.3 MB 2,922 0.2% 0 0% 2,444 0.1% 0.1% -245
19 NAME_PAYMENT_TYPE category 1.7 MB 4 <0.1% 0 0% 1,033,552 61.9% 61.9% Cash through the bank
20 CODE_REJECT_REASON category 1.7 MB 9 <0.1% 0 0% 1,353,093 81.0% 81.0% XAP
21 NAME_TYPE_SUITE category 1.7 MB 7 <0.1% 820,405 49.1% 508,970 30.5% 59.9% Unaccompanied
22 NAME_CLIENT_TYPE category 1.7 MB 4 <0.1% 0 0% 1,231,261 73.7% 73.7% Repeater
23 NAME_GOODS_CATEGORY category 1.7 MB 28 <0.1% 0 0% 950,809 56.9% 56.9% XNA
24 NAME_PORTFOLIO category 1.7 MB 5 <0.1% 0 0% 691,011 41.4% 41.4% POS
25 NAME_PRODUCT_TYPE category 1.7 MB 3 <0.1% 0 0% 1,063,666 63.7% 63.7% XNA
26 CHANNEL_TYPE category 1.7 MB 8 <0.1% 0 0% 719,968 43.1% 43.1% Credit and cash offices
27 SELLERPLACE_AREA int32 6.7 MB 2,097 0.1% 0 0% 762,675 45.7% 45.7% -1
28 NAME_SELLER_INDUSTRY category 1.7 MB 11 <0.1% 0 0% 855,720 51.2% 51.2% XNA
29 CNT_PAYMENT float32 6.7 MB 49 <0.1% 372,230 22.3% 323,049 19.3% 24.9% 12.0
30 NAME_YIELD_GROUP category 1.7 MB 5 <0.1% 0 0% 517,215 31.0% 31.0% XNA
31 PRODUCT_COMBINATION category 1.7 MB 17 <0.1% 346 <0.1% 285,990 17.1% 17.1% Cash
32 DAYS_FIRST_DRAWING float32 6.7 MB 2,838 0.2% 673,065 40.3% 934,444 55.9% 93.7% 365243.0
33 DAYS_FIRST_DUE float32 6.7 MB 2,892 0.2% 673,065 40.3% 40,645 2.4% 4.1% 365243.0
34 DAYS_LAST_DUE_1ST_VERSION float32 6.7 MB 4,605 0.3% 673,065 40.3% 93,864 5.6% 9.4% 365243.0
35 DAYS_LAST_DUE float32 6.7 MB 2,873 0.2% 673,065 40.3% 211,221 12.6% 21.2% 365243.0
36 DAYS_TERMINATION float32 6.7 MB 2,830 0.2% 673,065 40.3% 225,913 13.5% 22.7% 365243.0
37 NFLAG_INSURED_ON_APPROVAL float32 6.7 MB 2 <0.1% 673,065 40.3% 665,527 39.8% 66.7% 0.0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_4 = sweetviz.analyze(
            [previous_application, "previous_application"],
            pairwise_analysis="off",
        )
        report_inspect_4.show_notebook()

3.3.5 Table pos_cash_balance

Code
pos_cash_balance.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Columns: 8 entries, SK_ID_PREV to SK_DPD_DEF
dtypes: category(1), float32(2), int16(2), int32(2), int8(1)
memory usage: 209.8 MB
Code
pos_cash_balance.head()
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.00 45.00 Active 0 0
1 1715348 367990 -33 36.00 35.00 Active 0 0
2 1784872 397406 -32 12.00 9.00 Active 0 0
3 1903291 269225 -35 48.00 42.00 Active 0 0
4 2341044 334279 -35 36.00 35.00 Active 0 0
Code
an.col_info(pos_cash_balance, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_PREV int32 40.0 MB 936,325 9.4% 0 0% 96 <0.1% <0.1% 1856103
2 SK_ID_CURR int32 40.0 MB 337,252 3.4% 0 0% 295 <0.1% <0.1% 265042
3 MONTHS_BALANCE int8 10.0 MB 96 <0.1% 0 0% 216,441 2.2% 2.2% -10
4 CNT_INSTALMENT float32 40.0 MB 73 <0.1% 26,071 0.3% 2,496,845 25.0% 25.0% 12.0
5 CNT_INSTALMENT_FUTURE float32 40.0 MB 79 <0.1% 26,087 0.3% 1,185,960 11.9% 11.9% 0.0
6 NAME_CONTRACT_STATUS category 10.0 MB 9 <0.1% 0 0% 9,151,119 91.5% 91.5% Active
7 SK_DPD int16 20.0 MB 3,400 <0.1% 0 0% 9,706,131 97.0% 97.0% 0
8 SK_DPD_DEF int16 20.0 MB 2,307 <0.1% 0 0% 9,887,389 98.9% 98.9% 0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_5 = sweetviz.analyze(
            [pos_cash_balance, "pos_cash_balance"],
            pairwise_analysis="off",
        )
        report_inspect_5.show_notebook()

3.3.6 Table credit_card_balance

Code
credit_card_balance.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Columns: 23 entries, SK_ID_PREV to SK_DPD_DEF
dtypes: category(1), float32(4), float64(11), int16(3), int32(3), int8(1)
memory usage: 454.1 MB
Code
credit_card_balance.head()
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.97 135000 0.00 877.50 0.00 877.50 1700.33 1800.00 1800.00 0.00 0.00 0.00 0.00 1 0.00 1.00 35.00 Active 0 0
1 2582071 363914 -1 63975.56 45000 2250.00 2250.00 0.00 0.00 2250.00 2250.00 2250.00 60175.08 64875.56 64875.56 1.00 1 0.00 0.00 69.00 Active 0 0
2 1740877 371185 -7 31815.22 450000 0.00 0.00 0.00 0.00 2250.00 2250.00 2250.00 26926.42 31460.08 31460.08 0.00 0 0.00 0.00 30.00 Active 0 0
3 1389973 337855 -4 236572.11 225000 2250.00 2250.00 0.00 0.00 11795.76 11925.00 11925.00 224949.29 233048.97 233048.97 1.00 1 0.00 0.00 10.00 Active 0 0
4 1891521 126868 -1 453919.46 450000 0.00 11547.00 0.00 11547.00 22924.89 27000.00 27000.00 443044.40 453919.46 453919.46 0.00 1 0.00 1.00 101.00 Active 0 0
Code
an.col_info(credit_card_balance, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_PREV int32 15.4 MB 104,307 2.7% 0 0% 96 <0.1% <0.1% 2377894
2 SK_ID_CURR int32 15.4 MB 103,558 2.7% 0 0% 192 <0.1% <0.1% 186401
3 MONTHS_BALANCE int8 3.8 MB 96 <0.1% 0 0% 102,115 2.7% 2.7% -4
4 AMT_BALANCE float64 30.7 MB 1,347,904 35.1% 0 0% 2,156,420 56.2% 56.2% 0.0
5 AMT_CREDIT_LIMIT_ACTUAL int32 15.4 MB 181 <0.1% 0 0% 753,823 19.6% 19.6% 0
6 AMT_DRAWINGS_ATM_CURRENT float64 30.7 MB 2,267 0.1% 749,816 19.5% 2,665,718 69.4% 86.3% 0.0
7 AMT_DRAWINGS_CURRENT float64 30.7 MB 187,005 4.9% 0 0% 3,223,443 83.9% 83.9% 0.0
8 AMT_DRAWINGS_OTHER_CURRENT float64 30.7 MB 1,832 <0.1% 749,816 19.5% 3,078,163 80.2% 99.6% 0.0
9 AMT_DRAWINGS_POS_CURRENT float64 30.7 MB 168,748 4.4% 749,816 19.5% 2,825,595 73.6% 91.4% 0.0
10 AMT_INST_MIN_REGULARITY float64 30.7 MB 312,266 8.1% 305,236 7.9% 1,928,864 50.2% 54.6% 0.0
11 AMT_PAYMENT_CURRENT float64 30.7 MB 163,209 4.2% 767,988 20.0% 390,507 10.2% 12.7% 0.0
12 AMT_PAYMENT_TOTAL_CURRENT float64 30.7 MB 182,957 4.8% 0 0% 2,172,223 56.6% 56.6% 0.0
13 AMT_RECEIVABLE_PRINCIPAL float64 30.7 MB 1,195,839 31.1% 0 0% 2,296,167 59.8% 59.8% 0.0
14 AMT_RECIVABLE float64 30.7 MB 1,338,878 34.9% 0 0% 2,113,816 55.0% 55.0% 0.0
15 AMT_TOTAL_RECEIVABLE float64 30.7 MB 1,339,008 34.9% 0 0% 2,113,643 55.0% 55.0% 0.0
16 CNT_DRAWINGS_ATM_CURRENT float32 15.4 MB 44 <0.1% 749,816 19.5% 2,665,718 69.4% 86.3% 0.0
17 CNT_DRAWINGS_CURRENT int16 7.7 MB 129 <0.1% 0 0% 3,229,952 84.1% 84.1% 0
18 CNT_DRAWINGS_OTHER_CURRENT float32 15.4 MB 11 <0.1% 749,816 19.5% 3,077,688 80.1% 99.6% 0.0
19 CNT_DRAWINGS_POS_CURRENT float32 15.4 MB 133 <0.1% 749,816 19.5% 2,825,594 73.6% 91.4% 0.0
20 CNT_INSTALMENT_MATURE_CUM float32 15.4 MB 121 <0.1% 305,236 7.9% 551,467 14.4% 15.6% 0.0
21 NAME_CONTRACT_STATUS category 3.8 MB 7 <0.1% 0 0% 3,698,436 96.3% 96.3% Active
22 SK_DPD int16 7.7 MB 917 <0.1% 0 0% 3,686,957 96.0% 96.0% 0
23 SK_DPD_DEF int16 7.7 MB 378 <0.1% 0 0% 3,750,972 97.7% 97.7% 0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_6 = sweetviz.analyze(
            [credit_card_balance, "previous_application"],
            pairwise_analysis="off",
        )
        report_inspect_6.show_notebook()

3.3.7 Table installments_payments

Code
installments_payments.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Columns: 8 entries, SK_ID_PREV to AMT_PAYMENT
dtypes: float32(3), float64(2), int16(1), int32(2)
memory usage: 493.1 MB
Code
installments_payments.head()
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.00 6 -1180.00 -1187.00 6948.36 6948.36
1 1330831 151639 0.00 34 -2156.00 -2156.00 1716.53 1716.53
2 2085231 193053 2.00 1 -63.00 -63.00 25425.00 25425.00
3 2452527 199697 1.00 3 -2418.00 -2426.00 24350.13 24350.13
4 2714724 167756 1.00 2 -1383.00 -1366.00 2165.04 2160.59
Code
an.col_info(installments_payments, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_PREV int32 54.4 MB 997,752 7.3% 0 0% 293 <0.1% <0.1% 2360056
2 SK_ID_CURR int32 54.4 MB 339,587 2.5% 0 0% 372 <0.1% <0.1% 145728
3 NUM_INSTALMENT_VERSION float32 54.4 MB 65 <0.1% 0 0% 8,485,004 62.4% 62.4% 1.0
4 NUM_INSTALMENT_NUMBER int16 27.2 MB 277 <0.1% 0 0% 1,004,160 7.4% 7.4% 1
5 DAYS_INSTALMENT float32 54.4 MB 2,922 <0.1% 0 0% 11,512 0.1% 0.1% -120.0
6 DAYS_ENTRY_PAYMENT float32 54.4 MB 3,039 <0.1% 2,905 <0.1% 13,103 0.1% 0.1% -91.0
7 AMT_INSTALMENT float64 108.8 MB 902,539 6.6% 0 0% 254,062 1.9% 1.9% 9000.0
8 AMT_PAYMENT float64 108.8 MB 944,235 6.9% 2,905 <0.1% 248,757 1.8% 1.8% 9000.0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_inspect_7 = sweetviz.analyze(
            [installments_payments, "installments_payments"],
            pairwise_analysis="off",
        )
        report_inspect_7.show_notebook()

3.3.8 Tables application_test and sample_submission

Table application_test contains the same variables as table application, but without the target variable TARGET. And table sample_submission only contains sample and not real data. These tables will be excluded from the analysis.

Code
application_test.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: category(16), float32(64), float64(1), int16(2), int32(2), int8(36)
memory usage: 15.3 MB
Code
application_test.head()
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.00 568800.00 20560.50 450000.00 Unaccompanied Working Higher education Married House / apartment 0.02 -19241 -2329 -5170.00 -812 NaN 1 1 0 1 0 1 NaN 2.00 2 2 TUESDAY 18 0 0 0 0 0 0 Kindergarten 0.75 0.79 0.16 0.07 0.06 0.97 NaN NaN NaN 0.14 0.12 NaN NaN NaN 0.05 NaN NaN 0.07 0.06 0.97 NaN NaN NaN 0.14 0.12 NaN NaN NaN 0.05 NaN NaN 0.07 0.06 0.97 NaN NaN NaN 0.14 0.12 NaN NaN NaN 0.05 NaN NaN NaN block of flats 0.04 Stone, brick No 0.00 0.00 0.00 0.00 -1740.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
1 100005 Cash loans M N Y 0 99000.00 222768.00 17370.00 180000.00 Unaccompanied Working Secondary / secondary special Married House / apartment 0.04 -18064 -4469 -9118.00 -1623 NaN 1 1 0 1 0 0 Low-skill Laborers 2.00 2 2 FRIDAY 9 0 0 0 0 0 0 Self-employed 0.56 0.29 0.43 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 3.00
2 100013 Cash loans M Y Y 0 202500.00 663264.00 69777.00 630000.00 NaN Working Higher education Married House / apartment 0.02 -20038 -4458 -2175.00 -3503 5.00 1 1 0 1 0 0 Drivers 2.00 2 2 MONDAY 14 0 0 0 0 0 0 Transport: type 3 NaN 0.70 0.61 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -856.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 1.00 4.00
3 100028 Cash loans F N Y 2 315000.00 1575000.00 49018.50 1575000.00 Unaccompanied Working Secondary / secondary special Married House / apartment 0.03 -13976 -1866 -2000.00 -4208 NaN 1 1 0 1 1 0 Sales staff 4.00 2 2 WEDNESDAY 11 0 0 0 0 0 0 Business Entity Type 3 0.53 0.51 0.61 0.31 0.20 1.00 0.96 0.12 0.32 0.28 0.38 0.04 0.20 0.24 0.37 0.04 0.08 0.31 0.20 1.00 0.96 0.12 0.32 0.28 0.38 0.04 0.21 0.26 0.38 0.04 0.08 0.31 0.20 1.00 0.96 0.12 0.32 0.28 0.38 0.04 0.21 0.24 0.37 0.04 0.08 reg oper account block of flats 0.37 Panel No 0.00 0.00 0.00 0.00 -1805.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 3.00
4 100038 Cash loans M Y N 1 180000.00 625500.00 32067.00 625500.00 Unaccompanied Working Secondary / secondary special Married House / apartment 0.01 -13040 -2191 -4000.00 -4262 16.00 1 1 1 1 0 0 NaN 3.00 2 2 FRIDAY 5 0 0 0 0 1 1 Business Entity Type 3 0.20 0.43 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -821.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
Code
sample_submission.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 2 entries, SK_ID_CURR to TARGET
dtypes: float32(1), int32(1)
memory usage: 380.9 KB
Code
sample_submission.head()
SK_ID_CURR TARGET
0 100001 0.50
1 100005 0.50
2 100013 0.50
3 100028 0.50
4 100038 0.50

3.4 Split to Train, Validation, and Test Sets

  • To make models more robust for the unseen data, the data is split into training, validation, and test sets (70%:15%:15%).
  • Stratification by target is used to ensure that the proportions of target values are the same in all sets.
Code
application_train, application_validation = train_test_split(
    application, test_size=0.3, random_state=42, stratify=application.TARGET
)

application_validation, application_test = train_test_split(
    application_validation,
    test_size=0.5,
    random_state=42,
    stratify=application_validation.TARGET,
)
Code
X_train = application_train.drop(columns=["TARGET"])
y_train = application_train["TARGET"]

X_validation = application_validation.drop(columns=["TARGET"])
y_validation = application_validation["TARGET"]

X_test = application_test.drop(columns=["TARGET"])
y_test = application_test["TARGET"]

The sizes of the sets are (“k” stands for thousands):

Code
print(f"{application_train.shape[0]/1e3:.1f}k rows in training set.")
print(f"{application_validation.shape[0]/1e3: .1f}k rows in validation set.")
print(f"{application_test.shape[0]/1e3: .1f}k rows in test set.")
215.3k rows in training set.
 46.1k rows in validation set.
 46.1k rows in test set.

3.5 EDA on Train Set

A more detailed EDA is performed on the training set only. Pay attention that in the sweetviz report, not only the distributions of variables are plotted as bars but also the means of the target variable in each category/interval are indicated by dark blue dots connected with a line.

Code
application_train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
159703 285133 0 Cash loans F Y Y 2 405000.00 1971072.00 68643.00 1800000.00 Unaccompanied Commercial associate Higher education Married House / apartment 0.01 -13587 -1028 -7460.00 -1823 13.00 1 1 0 1 0 0 Accountants 4.00 3 3 SATURDAY 11 0 0 0 0 0 0 Self-employed 0.68 0.33 0.64 0.12 0.10 0.98 0.78 NaN 0.00 0.24 0.17 0.21 0.00 0.10 0.12 NaN 0.03 0.12 0.10 0.98 0.79 NaN 0.00 0.24 0.17 0.21 0.00 0.11 0.13 NaN 0.03 0.12 0.10 0.98 0.79 NaN 0.00 0.24 0.17 0.21 0.00 0.10 0.13 NaN 0.03 reg oper account block of flats 0.10 Stone, brick No 4.00 0.00 4.00 0.00 -2169.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
79269 191894 0 Cash loans M N Y 0 337500.00 508495.50 38146.50 454500.00 Family State servant Higher education Married House / apartment 0.01 -17543 -1208 -4054.00 -1090 NaN 1 1 0 1 0 0 Managers 2.00 2 2 WEDNESDAY 11 0 0 0 0 0 0 Agriculture NaN 0.62 0.44 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 1.00 2.00 1.00 -659.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 6.00
232615 369428 0 Cash loans M N Y 1 112500.00 110146.50 13068.00 90000.00 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.01 -11557 -593 -5554.00 -4130 NaN 1 1 0 1 1 1 Laborers 3.00 2 2 FRIDAY 11 0 0 0 0 0 0 Business Entity Type 3 0.36 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -172.00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
33420 138717 0 Cash loans F N Y 2 40500.00 66384.00 3519.00 45000.00 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.03 -15750 -5376 -5285.00 -5290 NaN 1 1 0 1 0 0 Sales staff 4.00 2 2 TUESDAY 13 0 0 0 0 0 0 Self-employed 0.39 0.60 0.45 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1576.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 1.00 0.00 2.00
88191 202381 0 Cash loans M Y N 0 225000.00 298512.00 31801.50 270000.00 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.02 -19912 -1195 -86.00 -3033 11.00 1 1 0 1 0 0 Drivers 2.00 2 2 FRIDAY 16 0 0 0 0 0 0 Construction 0.74 0.66 0.72 0.30 0.14 1.00 0.99 0.10 0.40 0.17 0.46 0.00 0.00 0.24 0.25 0.00 0.00 0.30 0.14 1.00 0.99 0.10 0.40 0.17 0.46 0.00 0.00 0.26 0.26 0.00 0.00 0.30 0.14 1.00 0.99 0.10 0.40 0.17 0.46 0.00 0.00 0.25 0.25 0.00 0.00 reg oper account block of flats 0.27 Stone, brick No 3.00 0.00 3.00 0.00 -624.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
Code
an.col_info(application_train, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 861.0 kB 215,257 100.0% 0 0% 1 <0.1% <0.1% 285133
2 TARGET int8 215.3 kB 2 <0.1% 0 0% 197,880 91.9% 91.9% 0
3 NAME_CONTRACT_TYPE category 215.5 kB 2 <0.1% 0 0% 194,675 90.4% 90.4% Cash loans
4 CODE_GENDER category 215.5 kB 3 <0.1% 0 0% 141,622 65.8% 65.8% F
5 FLAG_OWN_CAR category 215.5 kB 2 <0.1% 0 0% 142,086 66.0% 66.0% N
6 FLAG_OWN_REALTY category 215.5 kB 2 <0.1% 0 0% 149,412 69.4% 69.4% Y
7 CNT_CHILDREN int8 215.3 kB 12 <0.1% 0 0% 150,641 70.0% 70.0% 0
8 AMT_INCOME_TOTAL float64 1.7 MB 1,949 0.9% 0 0% 24,982 11.6% 11.6% 135000.0
9 AMT_CREDIT float32 861.0 kB 5,097 2.4% 0 0% 6,823 3.2% 3.2% 450000.0
10 AMT_ANNUITY float32 861.0 kB 12,801 5.9% 8 <0.1% 4,499 2.1% 2.1% 9000.0
11 AMT_GOODS_PRICE float32 861.0 kB 828 0.4% 187 0.1% 18,194 8.5% 8.5% 450000.0
12 NAME_TYPE_SUITE category 216.0 kB 7 <0.1% 901 0.4% 174,089 80.9% 81.2% Unaccompanied
13 NAME_INCOME_TYPE category 216.1 kB 8 <0.1% 0 0% 110,984 51.6% 51.6% Working
14 NAME_EDUCATION_TYPE category 215.8 kB 5 <0.1% 0 0% 152,993 71.1% 71.1% Secondary / secondary special
15 NAME_FAMILY_STATUS category 215.8 kB 6 <0.1% 0 0% 137,457 63.9% 63.9% Married
16 NAME_HOUSING_TYPE category 215.9 kB 6 <0.1% 0 0% 191,159 88.8% 88.8% House / apartment
17 REGION_POPULATION_RELATIVE float32 861.0 kB 81 <0.1% 0 0% 11,494 5.3% 5.3% 0.035792
18 DAYS_BIRTH int16 430.5 kB 17,377 8.1% 0 0% 32 <0.1% <0.1% -14890
19 DAYS_EMPLOYED int32 861.0 kB 11,770 5.5% 0 0% 38,756 18.0% 18.0% 365243
20 DAYS_REGISTRATION float32 861.0 kB 15,249 7.1% 0 0% 79 <0.1% <0.1% -7.0
21 DAYS_ID_PUBLISH int16 430.5 kB 6,122 2.8% 0 0% 119 0.1% 0.1% -4074
22 OWN_CAR_AGE float32 861.0 kB 61 <0.1% 142,091 66.0% 5,232 2.4% 7.2% 7.0
23 FLAG_MOBIL int8 215.3 kB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 1
24 FLAG_EMP_PHONE int8 215.3 kB 2 <0.1% 0 0% 176,491 82.0% 82.0% 1
25 FLAG_WORK_PHONE int8 215.3 kB 2 <0.1% 0 0% 172,406 80.1% 80.1% 0
26 FLAG_CONT_MOBILE int8 215.3 kB 2 <0.1% 0 0% 214,855 99.8% 99.8% 1
27 FLAG_PHONE int8 215.3 kB 2 <0.1% 0 0% 154,906 72.0% 72.0% 0
28 FLAG_EMAIL int8 215.3 kB 2 <0.1% 0 0% 203,006 94.3% 94.3% 0
29 OCCUPATION_TYPE category 217.1 kB 18 <0.1% 67,480 31.3% 38,591 17.9% 26.1% Laborers
30 CNT_FAM_MEMBERS float32 861.0 kB 12 <0.1% 1 <0.1% 110,671 51.4% 51.4% 2.0
31 REGION_RATING_CLIENT int8 215.3 kB 3 <0.1% 0 0% 158,846 73.8% 73.8% 2
32 REGION_RATING_CLIENT_W_CITY int8 215.3 kB 3 <0.1% 0 0% 160,564 74.6% 74.6% 2
33 WEEKDAY_APPR_PROCESS_START category 216.0 kB 7 <0.1% 0 0% 37,826 17.6% 17.6% TUESDAY
34 HOUR_APPR_PROCESS_START int8 215.3 kB 24 <0.1% 0 0% 26,465 12.3% 12.3% 10
35 REG_REGION_NOT_LIVE_REGION int8 215.3 kB 2 <0.1% 0 0% 211,999 98.5% 98.5% 0
36 REG_REGION_NOT_WORK_REGION int8 215.3 kB 2 <0.1% 0 0% 204,222 94.9% 94.9% 0
37 LIVE_REGION_NOT_WORK_REGION int8 215.3 kB 2 <0.1% 0 0% 206,386 95.9% 95.9% 0
38 REG_CITY_NOT_LIVE_CITY int8 215.3 kB 2 <0.1% 0 0% 198,549 92.2% 92.2% 0
39 REG_CITY_NOT_WORK_CITY int8 215.3 kB 2 <0.1% 0 0% 165,697 77.0% 77.0% 0
40 LIVE_CITY_NOT_WORK_CITY int8 215.3 kB 2 <0.1% 0 0% 176,518 82.0% 82.0% 0
41 ORGANIZATION_TYPE category 221.4 kB 58 <0.1% 0 0% 47,582 22.1% 22.1% Business Entity Type 3
42 EXT_SOURCE_1 float32 861.0 kB 83,961 39.0% 121,373 56.4% 5 <0.1% <0.1% 0.44398212
43 EXT_SOURCE_2 float32 861.0 kB 102,229 47.5% 464 0.2% 503 0.2% 0.2% 0.28589788
44 EXT_SOURCE_3 float32 861.0 kB 804 0.4% 42,680 19.8% 985 0.5% 0.6% 0.7463002
45 APARTMENTS_AVG float32 861.0 kB 2,207 1.0% 109,076 50.7% 4,712 2.2% 4.4% 0.0825
46 BASEMENTAREA_AVG float32 861.0 kB 3,626 1.7% 125,793 58.4% 10,282 4.8% 11.5% 0.0
47 YEARS_BEGINEXPLUATATION_AVG float32 861.0 kB 260 0.1% 104,910 48.7% 3,073 1.4% 2.8% 0.9871
48 YEARS_BUILD_AVG float32 861.0 kB 146 0.1% 143,036 66.4% 2,132 1.0% 3.0% 0.8232
49 COMMONAREA_AVG float32 861.0 kB 2,964 1.4% 150,300 69.8% 5,899 2.7% 9.1% 0.0
50 ELEVATORS_AVG float32 861.0 kB 241 0.1% 114,570 53.2% 60,109 27.9% 59.7% 0.0
51 ENTRANCES_AVG float32 861.0 kB 266 0.1% 108,270 50.3% 23,867 11.1% 22.3% 0.1379
52 FLOORSMAX_AVG float32 861.0 kB 371 0.2% 106,970 49.7% 43,449 20.2% 40.1% 0.1667
53 FLOORSMIN_AVG float32 861.0 kB 280 0.1% 146,054 67.9% 23,117 10.7% 33.4% 0.2083
54 LANDAREA_AVG float32 861.0 kB 3,360 1.6% 127,644 59.3% 10,845 5.0% 12.4% 0.0
55 LIVINGAPARTMENTS_AVG float32 861.0 kB 1,761 0.8% 147,049 68.3% 2,984 1.4% 4.4% 0.0504
56 LIVINGAREA_AVG float32 861.0 kB 4,983 2.3% 107,990 50.2% 202 0.1% 0.2% 0.0
57 NONLIVINGAPARTMENTS_AVG float32 861.0 kB 345 0.2% 149,354 69.4% 38,319 17.8% 58.1% 0.0
58 NONLIVINGAREA_AVG float32 861.0 kB 3,042 1.4% 118,577 55.1% 41,099 19.1% 42.5% 0.0
59 APARTMENTS_MODE float32 861.0 kB 744 0.3% 109,076 50.7% 5,301 2.5% 5.0% 0.084
60 BASEMENTAREA_MODE float32 861.0 kB 3,687 1.7% 125,793 58.4% 11,561 5.4% 12.9% 0.0
61 YEARS_BEGINEXPLUATATION_MODE float32 861.0 kB 210 0.1% 104,910 48.7% 3,039 1.4% 2.8% 0.9871
62 YEARS_BUILD_MODE float32 861.0 kB 152 0.1% 143,036 66.4% 2,090 1.0% 2.9% 0.8301
63 COMMONAREA_MODE float32 861.0 kB 2,908 1.4% 150,300 69.8% 6,770 3.1% 10.4% 0.0
64 ELEVATORS_MODE float32 861.0 kB 26 <0.1% 114,570 53.2% 62,808 29.2% 62.4% 0.0
65 ENTRANCES_MODE float32 861.0 kB 30 <0.1% 108,270 50.3% 25,310 11.8% 23.7% 0.1379
66 FLOORSMAX_MODE float32 861.0 kB 25 <0.1% 106,970 49.7% 46,048 21.4% 42.5% 0.1667
67 FLOORSMIN_MODE float32 861.0 kB 25 <0.1% 146,054 67.9% 24,209 11.2% 35.0% 0.2083
68 LANDAREA_MODE float32 861.0 kB 3,406 1.6% 127,644 59.3% 12,121 5.6% 13.8% 0.0
69 LIVINGAPARTMENTS_MODE float32 861.0 kB 715 0.3% 147,049 68.3% 3,447 1.6% 5.1% 0.0551
70 LIVINGAREA_MODE float32 861.0 kB 5,083 2.4% 107,990 50.2% 310 0.1% 0.3% 0.0
71 NONLIVINGAPARTMENTS_MODE float32 861.0 kB 148 0.1% 149,354 69.4% 41,574 19.3% 63.1% 0.0
72 NONLIVINGAREA_MODE float32 861.0 kB 3,090 1.4% 118,577 55.1% 46,933 21.8% 48.5% 0.0
73 APARTMENTS_MEDI float32 861.0 kB 1,120 0.5% 109,076 50.7% 5,000 2.3% 4.7% 0.0833
74 BASEMENTAREA_MEDI float32 861.0 kB 3,614 1.7% 125,793 58.4% 10,458 4.9% 11.7% 0.0
75 YEARS_BEGINEXPLUATATION_MEDI float32 861.0 kB 232 0.1% 104,910 48.7% 3,060 1.4% 2.8% 0.9871
76 YEARS_BUILD_MEDI float32 861.0 kB 148 0.1% 143,036 66.4% 2,118 1.0% 2.9% 0.8256
77 COMMONAREA_MEDI float32 861.0 kB 2,982 1.4% 150,300 69.8% 6,068 2.8% 9.3% 0.0
78 ELEVATORS_MEDI float32 861.0 kB 46 <0.1% 114,570 53.2% 61,040 28.4% 60.6% 0.0
79 ENTRANCES_MEDI float32 861.0 kB 46 <0.1% 108,270 50.3% 24,940 11.6% 23.3% 0.1379
80 FLOORSMAX_MEDI float32 861.0 kB 49 <0.1% 106,970 49.7% 44,659 20.7% 41.2% 0.1667
81 FLOORSMIN_MEDI float32 861.0 kB 47 <0.1% 146,054 67.9% 23,733 11.0% 34.3% 0.2083
82 LANDAREA_MEDI float32 861.0 kB 3,393 1.6% 127,644 59.3% 11,058 5.1% 12.6% 0.0
83 LIVINGAPARTMENTS_MEDI float32 861.0 kB 1,063 0.5% 147,049 68.3% 3,142 1.5% 4.6% 0.0513
84 LIVINGAREA_MEDI float32 861.0 kB 5,067 2.4% 107,990 50.2% 210 0.1% 0.2% 0.0
85 NONLIVINGAPARTMENTS_MEDI float32 861.0 kB 190 0.1% 149,354 69.4% 39,384 18.3% 59.8% 0.0
86 NONLIVINGAREA_MEDI float32 861.0 kB 3,083 1.4% 118,577 55.1% 42,610 19.8% 44.1% 0.0
87 FONDKAPREMONT_MODE category 215.7 kB 4 <0.1% 147,099 68.3% 51,785 24.1% 76.0% reg oper account
88 HOUSETYPE_MODE category 215.6 kB 3 <0.1% 107,834 50.1% 105,515 49.0% 98.2% block of flats
89 TOTALAREA_MODE float32 861.0 kB 4,896 2.3% 103,833 48.2% 417 0.2% 0.4% 0.0
90 WALLSMATERIAL_MODE category 216.0 kB 7 <0.1% 109,329 50.8% 46,298 21.5% 43.7% Panel
91 EMERGENCYSTATE_MODE category 215.5 kB 2 <0.1% 101,963 47.4% 111,665 51.9% 98.6% No
92 OBS_30_CNT_SOCIAL_CIRCLE float32 861.0 kB 32 <0.1% 714 0.3% 114,550 53.2% 53.4% 0.0
93 DEF_30_CNT_SOCIAL_CIRCLE float32 861.0 kB 10 <0.1% 714 0.3% 189,988 88.3% 88.6% 0.0
94 OBS_60_CNT_SOCIAL_CIRCLE float32 861.0 kB 32 <0.1% 714 0.3% 115,085 53.5% 53.6% 0.0
95 DEF_60_CNT_SOCIAL_CIRCLE float32 861.0 kB 9 <0.1% 714 0.3% 196,614 91.3% 91.6% 0.0
96 DAYS_LAST_PHONE_CHANGE float32 861.0 kB 3,720 1.7% 1 <0.1% 26,201 12.2% 12.2% 0.0
97 FLAG_DOCUMENT_2 int8 215.3 kB 2 <0.1% 0 0% 215,246 >99.9% >99.9% 0
98 FLAG_DOCUMENT_3 int8 215.3 kB 2 <0.1% 0 0% 152,845 71.0% 71.0% 1
99 FLAG_DOCUMENT_4 int8 215.3 kB 2 <0.1% 0 0% 215,238 >99.9% >99.9% 0
100 FLAG_DOCUMENT_5 int8 215.3 kB 2 <0.1% 0 0% 212,025 98.5% 98.5% 0
101 FLAG_DOCUMENT_6 int8 215.3 kB 2 <0.1% 0 0% 196,348 91.2% 91.2% 0
102 FLAG_DOCUMENT_7 int8 215.3 kB 2 <0.1% 0 0% 215,221 >99.9% >99.9% 0
103 FLAG_DOCUMENT_8 int8 215.3 kB 2 <0.1% 0 0% 197,689 91.8% 91.8% 0
104 FLAG_DOCUMENT_9 int8 215.3 kB 2 <0.1% 0 0% 214,440 99.6% 99.6% 0
105 FLAG_DOCUMENT_10 int8 215.3 kB 2 <0.1% 0 0% 215,253 >99.9% >99.9% 0
106 FLAG_DOCUMENT_11 int8 215.3 kB 2 <0.1% 0 0% 214,448 99.6% 99.6% 0
107 FLAG_DOCUMENT_12 int8 215.3 kB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0
108 FLAG_DOCUMENT_13 int8 215.3 kB 2 <0.1% 0 0% 214,541 99.7% 99.7% 0
109 FLAG_DOCUMENT_14 int8 215.3 kB 2 <0.1% 0 0% 214,614 99.7% 99.7% 0
110 FLAG_DOCUMENT_15 int8 215.3 kB 2 <0.1% 0 0% 215,015 99.9% 99.9% 0
111 FLAG_DOCUMENT_16 int8 215.3 kB 2 <0.1% 0 0% 213,089 99.0% 99.0% 0
112 FLAG_DOCUMENT_17 int8 215.3 kB 2 <0.1% 0 0% 215,200 >99.9% >99.9% 0
113 FLAG_DOCUMENT_18 int8 215.3 kB 2 <0.1% 0 0% 213,525 99.2% 99.2% 0
114 FLAG_DOCUMENT_19 int8 215.3 kB 2 <0.1% 0 0% 215,124 99.9% 99.9% 0
115 FLAG_DOCUMENT_20 int8 215.3 kB 2 <0.1% 0 0% 215,146 99.9% 99.9% 0
116 FLAG_DOCUMENT_21 int8 215.3 kB 2 <0.1% 0 0% 215,187 >99.9% >99.9% 0
117 AMT_REQ_CREDIT_BUREAU_HOUR float32 861.0 kB 5 <0.1% 29,081 13.5% 185,061 86.0% 99.4% 0.0
118 AMT_REQ_CREDIT_BUREAU_DAY float32 861.0 kB 9 <0.1% 29,081 13.5% 185,147 86.0% 99.4% 0.0
119 AMT_REQ_CREDIT_BUREAU_WEEK float32 861.0 kB 9 <0.1% 29,081 13.5% 180,246 83.7% 96.8% 0.0
120 AMT_REQ_CREDIT_BUREAU_MON float32 861.0 kB 22 <0.1% 29,081 13.5% 155,679 72.3% 83.6% 0.0
121 AMT_REQ_CREDIT_BUREAU_QRT float32 861.0 kB 10 <0.1% 29,081 13.5% 150,895 70.1% 81.0% 0.0
122 AMT_REQ_CREDIT_BUREAU_YEAR float32 861.0 kB 24 <0.1% 29,081 13.5% 50,313 23.4% 27.0% 0.0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_eda_1 = sweetviz.analyze(
            [application_train, "application (train)"],
            target_feat="TARGET",
            pairwise_analysis="off",
        )
        report_eda_1.show_notebook()

The distribution of the number of children (CNT_CHILDREN) is right-skewed, with a few outliers. In the sweetviz report, the trends in the distribution are not clear due to the extreme values. The frequency table below shows that the majority of clients have no children and only a few have more than 5-6 children.

Code
application_train.value_counts("CNT_CHILDREN").sort_index()
CNT_CHILDREN
0     150641
1      42945
2      18697
3       2598
4        280
5         67
6         17
7          6
8          2
9          1
10         2
19         1
Name: count, dtype: int64

Total income (AMT_INCOME_TOTAL) is also right-skewed, with a few outliers. Let’s categorize values, draw a frequency table and plot the distribution with the most extreme values removed to see the trends more clearly.

Code
above_1m_count = (
    application_train.assign(
        FLAG_INCOME_TOTAL_ABOVE_1M=lambda df: pd.cut(
            df["AMT_INCOME_TOTAL"],
            bins=[0, 5e5, 1e6, 1.5e6, 2e6, 1e7, np.inf],
            labels=["0-100k", "500k-1M", "1M-1.5M", "1.5M-2M", "2M-10M", "10M+"],
        )
    )
    .value_counts("FLAG_INCOME_TOTAL_ABOVE_1M")
    .sort_index()
)
above_1m_count
FLAG_INCOME_TOTAL_ABOVE_1M
0-100k     213341
500k-1M      1745
1M-1.5M       109
1.5M-2M        29
2M-10M         31
10M+            2
Name: count, dtype: int64
Code
plt.figure(figsize=(10, 3))
plt.hist(application_train["AMT_INCOME_TOTAL"], bins=40, range=(0, 1.5e6), ec="black")
plt.xlabel("AMT_INCOME_TOTAL")
plt.ylabel("Frequency")
plt.title("Distribution of AMT_INCOME_TOTAL (up to 1.5M$)")
plt.show()

Most of the values in DAYS_EMPLOYED are negative, which means that the client is employed. But there is a big positive number (365243) which will be treated as a missing value.

Code
application_train.value_counts("DAYS_EMPLOYED").sort_index()
DAYS_EMPLOYED
-17583         1
-17546         1
-17522         1
-17139         1
-16849         1
           ...  
-3             2
-2             2
-1             1
 0             1
 365243    38756
Name: count, Length: 11770, dtype: int64
Code
application_train.eval("DAYS_EMPLOYED > 0").value_counts().sort_index()
DAYS_EMPLOYED
False    176501
True      38756
Name: count, dtype: int64

There are many types of organizations. Value XNA should be converted to an explicit missing value:

Code
application_train.value_counts("ORGANIZATION_TYPE").sort_index()
ORGANIZATION_TYPE
Advertising                 289
Agriculture                1730
Bank                       1735
Business Entity Type 1     4214
Business Entity Type 2     7374
Business Entity Type 3    47582
Cleaning                    195
Construction               4704
Culture                     269
Electricity                 674
Emergency                   395
Government                 7324
Hotel                       686
Housing                    2055
Industry: type 1            737
Industry: type 10            75
Industry: type 11          1888
Industry: type 12           258
Industry: type 13            46
Industry: type 2            326
Industry: type 3           2292
Industry: type 4            633
Industry: type 5            393
Industry: type 6             77
Industry: type 7            903
Industry: type 8             17
Industry: type 9           2396
Insurance                   415
Kindergarten               4891
Legal Services              218
Medicine                   7917
Military                   1857
Mobile                      211
Other                     11662
Police                     1608
Postal                     1520
Realtor                     279
Religion                     59
Restaurant                 1285
School                     6296
Security                   2302
Security Ministries        1403
Self-employed             26681
Services                   1089
Telecom                     396
Trade: type 1               237
Trade: type 2              1338
Trade: type 3              2425
Trade: type 4                45
Trade: type 5                34
Trade: type 6               425
Trade: type 7              5450
Transport: type 1           145
Transport: type 2          1529
Transport: type 3           851
Transport: type 4          3749
University                  917
XNA                       38756
Name: count, dtype: int64

4 Modeling (w/o Historical Data)

In this section, a model based only on the data from the main table application is built. The historical credit data is not included here.

4.1 Create Pipelines

A few pre-processing steps are defined in this section. The pipelines are created using sklearn’s Pipeline class.

Some general steps:

Code
# Numeric variables except missing value indicators
select_numeric = make_column_selector(dtype_include="number")

# Categorical variables
select_categorical = make_column_selector(dtype_include="category")

# Create the pipelines
# Use median imputation for numeric variables
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median", add_indicator=True))]
)

# Use one-hot encoding for categorical variables
# and clean column names (remove spaces, special characters, etc.)
categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
        ("clean_names", CleanColumnNames()),
    ]
)

# Merge pipelines of numeric and categorical variables
pre_processing = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, select_numeric),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)
pre_processing
ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(add_indicator=True,
                                                                strategy='median'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001E1D640F1D0>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False)),
                                                 ('clean_names',
                                                  CleanColumnNames())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001E1D640E5D0>)],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Some pre-processing steps specific to the main table application are implemented below. These are:

  1. Data cleaning steps;
  2. Feature engineering steps.

In feature engineering, variables such as the number of non-children in the family, income per family member and others are created.

Some variables that might be considered discriminative by the law are (age, sex, and family status) discarded from the analysis. Some, which might also be considered unethical (e.g., the day of the week and the hour of the day when the application started) are also removed.

Code
class PreprocessorForApplications(BaseEstimator, TransformerMixin):
    """Transformer for the loan grade prediction."""

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        education_values = [
            "Lower secondary",
            "Secondary / secondary special",
            "Incomplete higher",
            "Higher education",
            "Academic degree",
        ]

        education_dtype = pd.CategoricalDtype(categories=education_values, ordered=True)

        X = X.assign(
            # Extract features
            FLAG_OWN_CAR=lambda df: (df["FLAG_OWN_CAR"] == "Y").astype("Int8"),
            FLAG_OWN_REALTY=lambda df: (df["FLAG_OWN_REALTY"] == "Y").astype("Int8"),
            FLAG_IS_EMERGENCY=lambda df: (df["EMERGENCYSTATE_MODE"] == "Yes").astype(
                "Int8"
            ),
            NAME_EDUCATION_TYPE=lambda df: df["NAME_EDUCATION_TYPE"].astype(
                education_dtype
            ),
            ord_education_type=lambda df: df["NAME_EDUCATION_TYPE"].cat.codes,
            flag_has_children=lambda df: (df["CNT_CHILDREN"] > 0).astype("Int8"),
            DAYS_EMPLOYED=lambda df: df["DAYS_EMPLOYED"].replace(365243, np.nan),
            years_employed=lambda df: df["DAYS_EMPLOYED"] / -365,
            amt_income_total_per_family_member=lambda df: df["AMT_INCOME_TOTAL"]
            / df["CNT_FAM_MEMBERS"],
            cnt_fam_members_excluding_children=lambda df: df["CNT_FAM_MEMBERS"]
            - df["CNT_CHILDREN"],
            amt_annuity_to_credit_ratio=lambda df: df["AMT_ANNUITY"] / df["AMT_CREDIT"],
            amt_annuity_to_income_ratio=lambda df: df["AMT_ANNUITY"]
            / df["AMT_INCOME_TOTAL"],
            amt_credit_to_income_ratio=lambda df: df["AMT_CREDIT"]
            / df["AMT_INCOME_TOTAL"],
            amt_annuity_to_income_per_family_member=lambda df: df["AMT_ANNUITY"]
            / df["amt_income_total_per_family_member"],
            # Make explicit the missing values: XNA → NaN
            ORGANIZATION_TYPE=lambda df: df["ORGANIZATION_TYPE"].replace("XNA", np.nan),
        )
        return X.drop(
            columns=[
                "SK_ID_CURR",
                # Restricted by legal constraints
                "CODE_GENDER",
                "NAME_FAMILY_STATUS",
                "DAYS_BIRTH",
                # Not useful / Unethical
                "WEEKDAY_APPR_PROCESS_START",
                "HOUR_APPR_PROCESS_START",
                # Almost constant
                "FLAG_MOBIL",
                # Already used/processed
                "EMERGENCYSTATE_MODE",
                "DAYS_EMPLOYED",
            ]
        )

    def get_feature_names_out(self):
        pass

4.2 Train Full Model

A Light Gradient Boosting Machine (LGBM) is used as a model here as it is fast, reasonably accurate and robust to outliers, missing values, and some other issues which means that a few pre-processing steps can be skipped.

Code
lgbm_classifier = LGBMClassifier(
    random_state=1, class_weight="balanced", n_jobs=-1, device="gpu"
)

The model creation in this section will consist of the following steps:

  1. application-specific pre-processing steps (see code in the previous section);
  2. general pre-processing steps for each data type (see the pipeline in the previous section);
  3. feature pre-selection removing duplicated and correlated columns;
  4. training the model.
Code
if "models_default_prediction" not in locals():
    models_default_prediction = {}


@my.cache_results(dir_interim + "task-1-applications-only--model-01_lgbm.pickle")
def fit_lgbm_default():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor_1", PreprocessorForApplications()),
            ("preprocessor_2", clone(pre_processing)),
            ("drop_duplicate_features", DropDuplicateFeatures()),
            (
                "drop_corr_features",
                SmartCorrelatedSelection(selection_method="variance"),
            ),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_train, y_train)
    return pipeline


models_default_prediction["LGBM"] = fit_lgbm_default()
models_default_prediction["LGBM"]
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6162
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 185
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 35 dense feature groups (7.39 MB) transferred to GPU in 0.023343 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Pipeline(steps=[('preprocessor_1', PreprocessorForApplications()),
                ('preprocessor_2',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(add_indicator=True,
                                                                                 strategy='median'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027560D0E090>),
                                                 ('categorical',
                                                  Pipeline(steps=[(...
                                                                   CleanColumnNames())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x00000275A157FD50>)],
                                   verbose_feature_names_out=False)),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance')),
                ('classifier',
                 LGBMClassifier(class_weight='balanced', device='gpu',
                                n_jobs=-1, random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
feature_names_all = (
    models_default_prediction["LGBM"].named_steps["classifier"].feature_name_
)
print(f"N features used by LGBM: {len(feature_names_all)}")
N features used by LGBM: 197

4.3 Evaluate Models

The model is evaluated on the validation set. For reference, the results are also calculated on the training set.

The main metric to rank modes here and in the other sections is the ROC AUC score. The values of other metrics are taken into account too.

print("--- Train ---")

ml.classification_scores(
    models_default_prediction,
    X_train,
    y_train,
    color="orange",
    sort_by="ROC_AUC",
)
--- Train ---
Table 4.1. Classification scores for the train set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM 215257 0.919 0.714 0.721 0.443 0.292 0.821 0.731 0.712 0.182 0.968 0.799
print("--- Validation ---")

ml.classification_scores(
    models_default_prediction,
    X_validation,
    y_validation,
    sort_by="ROC_AUC",
)
--- Validation ---
Table 4.2. Classification scores for the validation set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM 46127 0.919 0.705 0.688 0.375 0.268 0.815 0.667 0.708 0.167 0.960 0.759
Code
sns.set_style("white")
y_pred_validation_lgbm = models_default_prediction["LGBM"].predict(X_validation)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm, figsize=(13, 3));

4.4 Feature Importance

In this section, the feature importance is calculated: both internal LGBM feature importance as well as SHAP values are used. The results indicate that 5 most important features captured by both methods are:

  • the ratio of annuity to credit amount (amt_annuity_to_credit_ratio)
  • EXT_SOURCE_3
  • EXT_SOURCE_2
  • EXT_SOURCE_1
  • length of employment (years_employed).

Note. Feature names in CAPITALS indicate that there are the original features from the application table and feature names in lowercase indicate that the features were derived, extracted or the values were pre-processed.

Find the details below.

Code
@my.cache_results(dir_interim + "task-1-applications-only--shap_lgbm_k=all.pickle")
def get_shap_values_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_default_prediction[model].steps[:-1])
    classifier = models_default_prediction[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm, data_for_lgbm = get_shap_values_lgbm()
Code
vals = np.abs(shap_values_lgbm).mean(0).mean(0)
feature_importance = (
    pd.DataFrame(
        list(zip(data_for_lgbm.columns, vals)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)
Code
sns.set_style("whitegrid")
lgb.plot_importance(
    models_default_prediction["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(8, 10),
    height=0.8,
    title="LGBM Feature Importance",
);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm[1],
    data_for_lgbm,
    plot_type="bar",
    max_display=110,
    plot_size=(10, 15),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm[1], data_for_lgbm, max_display=50, plot_size=(10, 9), show=False
)
plt.title("SHAP Feature Importance", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

Code of the table
feature_importance.query("importance > 0.00001").style.format(precision=5)
Table 4.3. SHAP feature importance for the validation set.
  index col_name importance
0 21 EXT_SOURCE_2 0.34360
1 22 EXT_SOURCE_3 0.33933
2 68 amt_annuity_to_credit_ratio 0.18039
3 20 EXT_SOURCE_1 0.15745
4 66 years_employed 0.11401
5 65 ord_education_type 0.09472
6 4 AMT_ANNUITY 0.06261
7 82 NAME_CONTRACT_TYPE_Cash_loans 0.05932
8 8 OWN_CAR_AGE 0.04880
9 3 AMT_CREDIT 0.04073
10 80 missingindicator_AMT_REQ_CREDIT_BUREAU_HOUR 0.04033
11 7 DAYS_ID_PUBLISH 0.03977
12 36 DEF_30_CNT_SOCIAL_CIRCLE 0.03616
13 37 DAYS_LAST_PHONE_CHANGE 0.02921
14 39 FLAG_DOCUMENT_3 0.02860
15 62 AMT_REQ_CREDIT_BUREAU_QRT 0.02840
16 15 REGION_RATING_CLIENT 0.02754
17 74 missingindicator_EXT_SOURCE_1 0.02751
18 67 cnt_fam_members_excluding_children 0.02707
19 0 FLAG_OWN_CAR 0.02629
20 97 NAME_INCOME_TYPE_Working 0.02496
21 6 DAYS_REGISTRATION 0.02361
22 115 OCCUPATION_TYPE_Laborers 0.02352
23 111 OCCUPATION_TYPE_Drivers 0.01835
24 69 amt_annuity_to_income_ratio 0.01825
25 63 AMT_REQ_CREDIT_BUREAU_YEAR 0.01549
26 18 REG_CITY_NOT_LIVE_CITY 0.01509
27 194 WALLSMATERIAL_MODE_Panel 0.01506
28 10 FLAG_WORK_PHONE 0.01406
29 110 OCCUPATION_TYPE_Core_staff 0.01404
30 27 YEARS_BEGINEXPLUATATION_MODE 0.01197
31 71 amt_annuity_to_income_per_family_member 0.01172
32 32 FLOORSMAX_MEDI 0.01132
33 168 ORGANIZATION_TYPE_Self_employed 0.01129
34 5 REGION_POPULATION_RELATIVE 0.00989
35 70 amt_credit_to_income_ratio 0.00968
36 107 OCCUPATION_TYPE_Accountants 0.00918
37 131 ORGANIZATION_TYPE_Business_Entity_Type_3 0.00904
38 2 AMT_INCOME_TOTAL 0.00888
39 133 ORGANIZATION_TYPE_Construction 0.00708
40 34 LANDAREA_MEDI 0.00694
41 9 FLAG_EMP_PHONE 0.00664
42 125 OCCUPATION_TYPE_nan 0.00643
43 12 FLAG_PHONE 0.00627
44 94 NAME_INCOME_TYPE_State_servant 0.00561
45 30 NONLIVINGAREA_MODE 0.00554
46 152 ORGANIZATION_TYPE_Industry_type_9 0.00502
47 28 ENTRANCES_MODE 0.00459
48 35 OBS_30_CNT_SOCIAL_CIRCLE 0.00429
49 157 ORGANIZATION_TYPE_Military 0.00425
50 14 CNT_FAM_MEMBERS 0.00404
51 31 COMMONAREA_MEDI 0.00387
52 165 ORGANIZATION_TYPE_School 0.00377
53 23 YEARS_BUILD_AVG 0.00369
54 172 ORGANIZATION_TYPE_Trade_type_2 0.00353
55 116 OCCUPATION_TYPE_Low_skill_Laborers 0.00291
56 26 BASEMENTAREA_MODE 0.00290
57 128 ORGANIZATION_TYPE_Bank 0.00287
58 29 LIVINGAPARTMENTS_MODE 0.00261
59 123 OCCUPATION_TYPE_Security_staff 0.00238
60 52 FLAG_DOCUMENT_16 0.00222
61 24 ELEVATORS_AVG 0.00204
62 84 NAME_TYPE_SUITE_Family 0.00178
63 180 ORGANIZATION_TYPE_Transport_type_3 0.00162
64 49 FLAG_DOCUMENT_13 0.00157
65 19 REG_CITY_NOT_WORK_CITY 0.00152
66 44 FLAG_DOCUMENT_8 0.00144
67 25 NONLIVINGAPARTMENTS_AVG 0.00140
68 33 FLOORSMIN_MEDI 0.00129
69 77 missingindicator_YEARS_BUILD_AVG 0.00128
70 104 NAME_HOUSING_TYPE_Office_apartment 0.00124
71 54 FLAG_DOCUMENT_18 0.00123
72 89 NAME_TYPE_SUITE_Unaccompanied 0.00115
73 61 AMT_REQ_CREDIT_BUREAU_MON 0.00107
74 169 ORGANIZATION_TYPE_Services 0.00098
75 154 ORGANIZATION_TYPE_Kindergarten 0.00087
76 1 FLAG_OWN_REALTY 0.00073
77 102 NAME_HOUSING_TYPE_House_apartment 0.00067
78 92 NAME_INCOME_TYPE_Commercial_associate 0.00063
79 160 ORGANIZATION_TYPE_Police 0.00059
80 105 NAME_HOUSING_TYPE_Rented_apartment 0.00056
81 167 ORGANIZATION_TYPE_Security_Ministries 0.00054
82 51 FLAG_DOCUMENT_15 0.00054
83 90 NAME_TYPE_SUITE_nan 0.00052
84 129 ORGANIZATION_TYPE_Business_Entity_Type_1 0.00047
85 155 ORGANIZATION_TYPE_Legal_Services 0.00047
86 162 ORGANIZATION_TYPE_Realtor 0.00045
87 60 AMT_REQ_CREDIT_BUREAU_WEEK 0.00043
88 186 FONDKAPREMONT_MODE_reg_oper_spec_account 0.00041
89 176 ORGANIZATION_TYPE_Trade_type_6 0.00039
90 181 ORGANIZATION_TYPE_Transport_type_4 0.00033
91 195 WALLSMATERIAL_MODE_Stone_brick 0.00030
92 58 AMT_REQ_CREDIT_BUREAU_HOUR 0.00028
93 177 ORGANIZATION_TYPE_Trade_type_7 0.00025
94 103 NAME_HOUSING_TYPE_Municipal_apartment 0.00024
95 16 REG_REGION_NOT_LIVE_REGION 0.00022
96 47 FLAG_DOCUMENT_11 0.00022
97 88 NAME_TYPE_SUITE_Spouse_partner 0.00021
98 13 FLAG_EMAIL 0.00020
99 122 OCCUPATION_TYPE_Secretaries 0.00019
100 83 NAME_TYPE_SUITE_Children 0.00018
101 159 ORGANIZATION_TYPE_Other 0.00018
102 185 FONDKAPREMONT_MODE_reg_oper_account 0.00018
103 147 ORGANIZATION_TYPE_Industry_type_4 0.00015
104 174 ORGANIZATION_TYPE_Trade_type_4 0.00015
105 156 ORGANIZATION_TYPE_Medicine 0.00015
106 108 OCCUPATION_TYPE_Cleaning_staff 0.00014
107 41 FLAG_DOCUMENT_5 0.00014
108 121 OCCUPATION_TYPE_Sales_staff 0.00012
109 191 WALLSMATERIAL_MODE_Mixed 0.00012
110 183 FONDKAPREMONT_MODE_not_specified 0.00011
111 118 OCCUPATION_TYPE_Medicine_staff 0.00010
112 76 missingindicator_EXT_SOURCE_3 0.00010
113 189 HOUSETYPE_MODE_nan 0.00009
114 124 OCCUPATION_TYPE_Waiters_barmen_staff 0.00009
115 101 NAME_HOUSING_TYPE_Co_op_apartment 0.00008
116 109 OCCUPATION_TYPE_Cooking_staff 0.00007
117 179 ORGANIZATION_TYPE_Transport_type_2 0.00007
118 190 WALLSMATERIAL_MODE_Block 0.00006
119 113 OCCUPATION_TYPE_High_skill_tech_staff 0.00006
120 119 OCCUPATION_TYPE_Private_service_staff 0.00005
121 17 REG_REGION_NOT_WORK_REGION 0.00004
122 117 OCCUPATION_TYPE_Managers 0.00004
123 170 ORGANIZATION_TYPE_Telecom 0.00004
124 127 ORGANIZATION_TYPE_Agriculture 0.00002

4.5 Training Models with Feature Selection

In this section, LGBM models are trained on the training on subsets of features. These subsets are determined by SHAP values: 7 thresholds of SHAP values are used to select features.

This method is quicker than sequential feature selection (SFS) but might not be as accurate as not all combinations of features are tested.

Despite the fact, that the validation ROC AUC value was highest in the full model (0.759), the model with 111 was chosen for the next steps because ROC AUC is lower just by 0.001 and most of the other metrics are better than in the full model.

Find the details below.

Code
def fit_lgbm_on_features(features):
    """Template to fit a LGBM model with a smaller number of features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor_1", PreprocessorForApplications()),
            ("preprocessor_2", clone(pre_processing)),
            ("drop_duplicate_features", DropDuplicateFeatures()),
            (
                "drop_corr_features",
                SmartCorrelatedSelection(selection_method="variance"),
            ),
            ("selector", ColumnSelector(features)),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_train, y_train)
    return pipeline


def fit_lgbm_with_shap_threshold(threshold):
    """Function for feature selection based on SHAP values"""
    features = feature_importance.query(f"importance > {threshold}").col_name.to_list()

    k = len(features)

    return f"LGBM ({k} features)", fit_lgbm_on_features(features)
Code
# Restore from file or calculate
file = dir_interim + "task-1-default--lgbm_models_as_dict.pkl"

if os.path.exists(file):
    with open(file, "rb") as f:
        models_default_prediction = joblib.load(f)
else:
    for threshold in [0.00004, 0.0001, 0.0002, 0.001, 0.003, 0.010, 0.050]:
        model_name, model = fit_lgbm_with_shap_threshold(threshold)
        models_default_prediction[model_name] = model

    # Change name of the full model
    models_default_prediction[
        "LGBM (FULL | 197 feat.)"
    ] = models_default_prediction.pop("LGBM")

    with open(file, "wb") as f:
        joblib.dump(models_default_prediction, f)

del file

# Time: 9m 10.8s

Change model’s label to make it easier to understand among other models:

Code
print("--- Train ---")
ml.classification_scores(
    models_default_prediction,
    X_train,
    y_train,
    sort_by="ROC_AUC",
    color="orange",
)
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (123 features) 215257 0.919 0.714 0.722 0.443 0.292 0.821 0.731 0.712 0.182 0.968 0.800
LGBM (FULL | 197 feat.) 215257 0.919 0.714 0.721 0.443 0.292 0.821 0.731 0.712 0.182 0.968 0.799
LGBM (98 features) 215257 0.919 0.714 0.722 0.444 0.292 0.820 0.733 0.712 0.183 0.968 0.799
LGBM (74 features) 215257 0.919 0.713 0.721 0.443 0.292 0.820 0.731 0.712 0.182 0.968 0.799
LGBM (111 features) 215257 0.919 0.714 0.721 0.443 0.292 0.820 0.731 0.712 0.182 0.968 0.799
LGBM (55 features) 215257 0.919 0.713 0.721 0.441 0.291 0.820 0.730 0.711 0.182 0.968 0.799
LGBM (34 features) 215257 0.919 0.711 0.718 0.437 0.289 0.819 0.726 0.710 0.180 0.967 0.796
LGBM (8 features) 215257 0.919 0.705 0.708 0.416 0.280 0.814 0.711 0.704 0.174 0.965 0.784
Code
print("--- Validation ---")
ml.classification_scores(
    models_default_prediction,
    X_validation,
    y_validation,
    sort_by="ROC_AUC",
)
--- Validation ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (FULL | 197 feat.) 46127 0.919 0.705 0.688 0.375 0.268 0.815 0.667 0.708 0.167 0.960 0.759
LGBM (123 features) 46127 0.919 0.704 0.687 0.375 0.267 0.815 0.668 0.707 0.167 0.960 0.759
LGBM (111 features) 46127 0.919 0.705 0.691 0.381 0.269 0.816 0.673 0.708 0.168 0.961 0.758
LGBM (98 features) 46127 0.919 0.704 0.689 0.377 0.268 0.814 0.671 0.707 0.167 0.961 0.758
LGBM (74 features) 46127 0.919 0.704 0.689 0.377 0.268 0.815 0.670 0.707 0.167 0.961 0.758
LGBM (34 features) 46127 0.919 0.703 0.688 0.375 0.267 0.814 0.669 0.706 0.167 0.960 0.758
LGBM (55 features) 46127 0.919 0.703 0.688 0.375 0.267 0.814 0.669 0.706 0.167 0.961 0.758
LGBM (8 features) 46127 0.919 0.698 0.688 0.376 0.265 0.810 0.675 0.700 0.165 0.961 0.754

4.6 Hyperparameter Tuning

The model with 111 features is tuned using Optuna package (Bayesian optimization).

Code
# Features to use: 111 features
feature_names_to_tune_111 = (
    models_default_prediction["LGBM (111 features)"]
    .named_steps["classifier"]
    .feature_name_
)

# Use 3-fold stratified CV
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
Code
# Define objective function for Optuna
def objective_1(trial):
    "Objective function for hyperparameter tuning"
    # LGBM params
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000, step=50),
        "max_depth": trial.suggest_int("max_depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["gbdt"]),
        # Tree Structure and Complexity
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        # Regularization
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 1.0),
        # Learning Rate and Feature Selection
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
        # Other Parameters
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_weight": trial.suggest_float(
            "min_child_weight", 1e-3, 1e3, log=True
        ),
        "min_split_gain": trial.suggest_float("min_split_gain", 0.0, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 50),
        "max_delta_step": trial.suggest_int("max_delta_step", 0, 10),
    }

    model = LGBMClassifier(
        objective="binary",
        metric="auc",
        random_state=1,
        class_weight="balanced",
        n_jobs=-1,
        device="gpu",
        **params,
    )

    pipeline_to_tune = Pipeline(
        steps=[
            ("preprocessor_1", PreprocessorForApplications()),
            ("preprocessor_2", clone(pre_processing)),
            ("selector", ColumnSelector(feature_names_to_tune_111)),
            ("classifier", model),
        ]
    )

    scores = cross_val_score(
        pipeline_to_tune, X_train, y_train, n_jobs=-1, cv=stratified_kfold
    )

    return scores.mean()


study_name_1 = "tune--without-credit-history"
storage_name_1 = f"sqlite:///{dir_interim}/optuna--{study_name_1}.db"

study_1 = optuna.create_study(
    study_name=study_name_1,
    storage=storage_name_1,
    load_if_exists=True,
    direction="maximize",
)
study_1.optimize(objective_1, n_trials=100, timeout=3600)
# Time: 61m 42.5s
[I 2023-12-27 23:19:35,808] A new study created in RDB with name: tune--without-credit-history
[I 2023-12-27 23:21:36,524] Trial 0 finished with value: 0.8748890771937772 and parameters: {'n_estimators': 700, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 70, 'min_child_samples': 35, 'lambda_l1': 0.025550868350601743, 'lambda_l2': 6.238522519933288e-06, 'reg_alpha': 0.8322445184530112, 'reg_lambda': 0.4781666892844665, 'learning_rate': 0.22386006944324777, 'feature_fraction': 0.605273438694538, 'subsample': 0.2815517349153884, 'colsample_bytree': 0.5105571507272018, 'bagging_fraction': 0.6965379020611673, 'bagging_freq': 5, 'min_child_weight': 0.0659826384593787, 'min_split_gain': 0.04703076905732695, 'min_data_in_leaf': 24, 'max_delta_step': 3}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:23:31,224] Trial 1 finished with value: 0.8719995072472626 and parameters: {'n_estimators': 450, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 213, 'min_child_samples': 85, 'lambda_l1': 1.661141373652044e-05, 'lambda_l2': 0.004818453214986181, 'reg_alpha': 0.4467088053506053, 'reg_lambda': 0.8312315360481795, 'learning_rate': 0.14446555019790158, 'feature_fraction': 0.9148147970988095, 'subsample': 0.06025781999414132, 'colsample_bytree': 0.12426740411827782, 'bagging_fraction': 0.819760663190308, 'bagging_freq': 5, 'min_child_weight': 0.13983078649703026, 'min_split_gain': 0.9980604904877477, 'min_data_in_leaf': 39, 'max_delta_step': 8}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:25:13,496] Trial 2 finished with value: 0.8716975552657328 and parameters: {'n_estimators': 1000, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 217, 'min_child_samples': 81, 'lambda_l1': 0.009261946937607285, 'lambda_l2': 0.06187411358260759, 'reg_alpha': 0.8360855695430213, 'reg_lambda': 0.924252800361061, 'learning_rate': 0.2656294979420381, 'feature_fraction': 0.465108015073728, 'subsample': 0.45509562950354704, 'colsample_bytree': 0.8388483403850614, 'bagging_fraction': 0.7223922062864684, 'bagging_freq': 6, 'min_child_weight': 4.831699218012537, 'min_split_gain': 0.962711189116782, 'min_data_in_leaf': 35, 'max_delta_step': 2}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:25:33,913] Trial 3 finished with value: 0.6886373087416217 and parameters: {'n_estimators': 300, 'max_depth': 3, 'boosting_type': 'gbdt', 'num_leaves': 248, 'min_child_samples': 33, 'lambda_l1': 0.004828148870358621, 'lambda_l2': 3.7703945012157024e-08, 'reg_alpha': 0.59959488943298, 'reg_lambda': 0.1437503088550186, 'learning_rate': 0.0378365539137563, 'feature_fraction': 0.8254606113930397, 'subsample': 0.8122840147057663, 'colsample_bytree': 0.9718079730031982, 'bagging_fraction': 0.5958803375842794, 'bagging_freq': 7, 'min_child_weight': 27.943197609701645, 'min_split_gain': 0.22052965204241526, 'min_data_in_leaf': 39, 'max_delta_step': 9}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:27:54,032] Trial 4 finished with value: 0.8427321660875374 and parameters: {'n_estimators': 950, 'max_depth': 7, 'boosting_type': 'gbdt', 'num_leaves': 243, 'min_child_samples': 13, 'lambda_l1': 3.167458423762587e-07, 'lambda_l2': 2.0396359964114315e-05, 'reg_alpha': 0.9294582973364317, 'reg_lambda': 0.5563919876289434, 'learning_rate': 0.07876875308940313, 'feature_fraction': 0.7357755986308083, 'subsample': 0.21739585888299084, 'colsample_bytree': 0.7852352864892946, 'bagging_fraction': 0.7029190171696713, 'bagging_freq': 2, 'min_child_weight': 1.5343452877981487, 'min_split_gain': 0.9847878845340966, 'min_data_in_leaf': 4, 'max_delta_step': 6}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:28:43,962] Trial 5 finished with value: 0.7071918616650716 and parameters: {'n_estimators': 850, 'max_depth': 4, 'boosting_type': 'gbdt', 'num_leaves': 144, 'min_child_samples': 26, 'lambda_l1': 0.01415493365081948, 'lambda_l2': 0.006424681506007423, 'reg_alpha': 0.6180900050335616, 'reg_lambda': 0.29405590359524547, 'learning_rate': 0.02219633676489674, 'feature_fraction': 0.9691971625917661, 'subsample': 0.11557394834032685, 'colsample_bytree': 0.062123273241513156, 'bagging_fraction': 0.6974205405994247, 'bagging_freq': 4, 'min_child_weight': 0.003574530682905926, 'min_split_gain': 0.9711972007891173, 'min_data_in_leaf': 49, 'max_delta_step': 4}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:29:58,291] Trial 6 finished with value: 0.7174493746075338 and parameters: {'n_estimators': 1000, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 25, 'min_child_samples': 9, 'lambda_l1': 1.1640499253732652, 'lambda_l2': 1.4978615689896486, 'reg_alpha': 0.6703920743197584, 'reg_lambda': 0.6951344076255069, 'learning_rate': 0.03934371100464988, 'feature_fraction': 0.5426727265714434, 'subsample': 0.9003412951794741, 'colsample_bytree': 0.9572435750544385, 'bagging_fraction': 0.7122492104249079, 'bagging_freq': 2, 'min_child_weight': 383.3789145469394, 'min_split_gain': 0.13037245311437773, 'min_data_in_leaf': 15, 'max_delta_step': 0}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:30:36,511] Trial 7 finished with value: 0.6979471073413656 and parameters: {'n_estimators': 750, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 25, 'min_child_samples': 45, 'lambda_l1': 0.00036467107284622217, 'lambda_l2': 0.9664719945485747, 'reg_alpha': 0.0037572914004266877, 'reg_lambda': 0.35831306235287763, 'learning_rate': 0.06805395733114952, 'feature_fraction': 0.5522713863361657, 'subsample': 0.28819032440828074, 'colsample_bytree': 0.5917850497085058, 'bagging_fraction': 0.9832808929105473, 'bagging_freq': 4, 'min_child_weight': 0.059396449073026505, 'min_split_gain': 0.8609953778780896, 'min_data_in_leaf': 2, 'max_delta_step': 4}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:32:39,473] Trial 8 finished with value: 0.7632829528249984 and parameters: {'n_estimators': 700, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 183, 'min_child_samples': 86, 'lambda_l1': 0.43707893573098267, 'lambda_l2': 1.1063823368943688e-07, 'reg_alpha': 0.37683976035684597, 'reg_lambda': 0.7766568902763252, 'learning_rate': 0.011339287234428277, 'feature_fraction': 0.5522059526815742, 'subsample': 0.7322810968584744, 'colsample_bytree': 0.17684575464900754, 'bagging_fraction': 0.6152408818469279, 'bagging_freq': 1, 'min_child_weight': 0.07350345998766064, 'min_split_gain': 0.6272191527067923, 'min_data_in_leaf': 46, 'max_delta_step': 9}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:32:58,013] Trial 9 finished with value: 0.6852599447300033 and parameters: {'n_estimators': 450, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 16, 'min_child_samples': 13, 'lambda_l1': 0.0070758192800633914, 'lambda_l2': 0.13717467880445933, 'reg_alpha': 0.7362306944558158, 'reg_lambda': 0.3272592750320643, 'learning_rate': 0.029383915056277837, 'feature_fraction': 0.8289326333877685, 'subsample': 0.9345982070505822, 'colsample_bytree': 0.2984780648394578, 'bagging_fraction': 0.925810197700263, 'bagging_freq': 1, 'min_child_weight': 49.251384853626384, 'min_split_gain': 0.008451976684397788, 'min_data_in_leaf': 2, 'max_delta_step': 6}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:33:28,532] Trial 10 finished with value: 0.7684488721125219 and parameters: {'n_estimators': 150, 'max_depth': 7, 'boosting_type': 'gbdt', 'num_leaves': 87, 'min_child_samples': 65, 'lambda_l1': 4.891527454057723, 'lambda_l2': 1.7766807872936614e-05, 'reg_alpha': 0.9978156714517683, 'reg_lambda': 0.0016067837261933837, 'learning_rate': 0.2611021234227985, 'feature_fraction': 0.6516022331735631, 'subsample': 0.4763179523459447, 'colsample_bytree': 0.4636511039806581, 'bagging_fraction': 0.46156670236544706, 'bagging_freq': 5, 'min_child_weight': 0.0014304249163018315, 'min_split_gain': 0.3246805866112631, 'min_data_in_leaf': 23, 'max_delta_step': 0}. Best is trial 0 with value: 0.8748890771937772.
[I 2023-12-27 23:35:15,562] Trial 11 finished with value: 0.8782710858654724 and parameters: {'n_estimators': 500, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 113, 'min_child_samples': 98, 'lambda_l1': 2.8861819174336533e-06, 'lambda_l2': 0.0007058725506922919, 'reg_alpha': 0.42669822608800045, 'reg_lambda': 0.9783691932791746, 'learning_rate': 0.14222487027417705, 'feature_fraction': 0.9635586125518529, 'subsample': 0.050624589771094415, 'colsample_bytree': 0.342473029527568, 'bagging_fraction': 0.8553943871270485, 'bagging_freq': 5, 'min_child_weight': 0.15274206501441348, 'min_split_gain': 0.4461762335815785, 'min_data_in_leaf': 27, 'max_delta_step': 7}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:36:53,579] Trial 12 finished with value: 0.8738531048568623 and parameters: {'n_estimators': 600, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 87, 'min_child_samples': 99, 'lambda_l1': 1.3423524891267378e-07, 'lambda_l2': 0.00013164904456447555, 'reg_alpha': 0.3753630433121057, 'reg_lambda': 0.9897294778233382, 'learning_rate': 0.14170726301672681, 'feature_fraction': 0.9980175031070498, 'subsample': 0.22833986121884964, 'colsample_bytree': 0.39512367125352293, 'bagging_fraction': 0.8724062667064009, 'bagging_freq': 5, 'min_child_weight': 0.2622582690107168, 'min_split_gain': 0.4416845674532798, 'min_data_in_leaf': 22, 'max_delta_step': 3}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:38:25,091] Trial 13 finished with value: 0.8620160976845547 and parameters: {'n_estimators': 600, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 84, 'min_child_samples': 57, 'lambda_l1': 4.833402031282412e-06, 'lambda_l2': 2.5977647272347386e-06, 'reg_alpha': 0.5195545278104725, 'reg_lambda': 0.5971811647641649, 'learning_rate': 0.14126902389604182, 'feature_fraction': 0.6877658982489735, 'subsample': 0.056309405751741703, 'colsample_bytree': 0.5861630773127292, 'bagging_fraction': 0.8123145203990759, 'bagging_freq': 7, 'min_child_weight': 0.014161748826676928, 'min_split_gain': 0.523022406210781, 'min_data_in_leaf': 30, 'max_delta_step': 7}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:39:35,072] Trial 14 finished with value: 0.8780248740132749 and parameters: {'n_estimators': 300, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 129, 'min_child_samples': 40, 'lambda_l1': 3.159875322258148e-08, 'lambda_l2': 0.0009420741355173971, 'reg_alpha': 0.8079070267854085, 'reg_lambda': 0.47196676315342473, 'learning_rate': 0.28064826534912485, 'feature_fraction': 0.4100906099818725, 'subsample': 0.32040123209157056, 'colsample_bytree': 0.2884983489599452, 'bagging_fraction': 0.823637627022563, 'bagging_freq': 3, 'min_child_weight': 0.01352913024023231, 'min_split_gain': 0.009335610475266924, 'min_data_in_leaf': 14, 'max_delta_step': 2}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:40:07,058] Trial 15 finished with value: 0.7674314856025837 and parameters: {'n_estimators': 100, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 135, 'min_child_samples': 64, 'lambda_l1': 1.1093894693363508e-08, 'lambda_l2': 0.0007252146370295743, 'reg_alpha': 0.26404462746108115, 'reg_lambda': 0.674582771693113, 'learning_rate': 0.09954423258559975, 'feature_fraction': 0.4089650219134541, 'subsample': 0.38088460195602175, 'colsample_bytree': 0.26963460056226346, 'bagging_fraction': 0.9894528142903561, 'bagging_freq': 3, 'min_child_weight': 0.015885521142713823, 'min_split_gain': 0.2231871227491393, 'min_data_in_leaf': 10, 'max_delta_step': 10}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:41:25,977] Trial 16 finished with value: 0.8775463754838221 and parameters: {'n_estimators': 300, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 165, 'min_child_samples': 49, 'lambda_l1': 2.979801132039705e-06, 'lambda_l2': 0.00025417590238099034, 'reg_alpha': 0.7249036921470404, 'reg_lambda': 0.8202778015863035, 'learning_rate': 0.18006312787180753, 'feature_fraction': 0.7370066028683423, 'subsample': 0.596637826806498, 'colsample_bytree': 0.33624300721567546, 'bagging_fraction': 0.8764736731421038, 'bagging_freq': 3, 'min_child_weight': 0.5791617342788797, 'min_split_gain': 0.3512094898270119, 'min_data_in_leaf': 16, 'max_delta_step': 2}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:42:32,478] Trial 17 finished with value: 0.8545180808335937 and parameters: {'n_estimators': 300, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 114, 'min_child_samples': 74, 'lambda_l1': 1.0424205102368452e-08, 'lambda_l2': 8.250065245439023, 'reg_alpha': 0.5771786186662261, 'reg_lambda': 0.46386315662637, 'learning_rate': 0.2743647210971697, 'feature_fraction': 0.43022418099957055, 'subsample': 0.15288268840229252, 'colsample_bytree': 0.212170834584281, 'bagging_fraction': 0.7965514894530638, 'bagging_freq': 3, 'min_child_weight': 0.009215891417021284, 'min_split_gain': 0.13522409806794833, 'min_data_in_leaf': 16, 'max_delta_step': 5}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:43:52,718] Trial 18 finished with value: 0.840214250784558 and parameters: {'n_estimators': 400, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 118, 'min_child_samples': 97, 'lambda_l1': 4.028505692985566e-05, 'lambda_l2': 0.0024298720059222328, 'reg_alpha': 0.8068097048036752, 'reg_lambda': 0.658718828376563, 'learning_rate': 0.11360943815373552, 'feature_fraction': 0.480041144602146, 'subsample': 0.3602857201118521, 'colsample_bytree': 0.39457375811046463, 'bagging_fraction': 0.912398175273994, 'bagging_freq': 6, 'min_child_weight': 0.001030133654874284, 'min_split_gain': 0.6525591832264386, 'min_data_in_leaf': 30, 'max_delta_step': 1}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:44:19,977] Trial 19 finished with value: 0.7405938010394489 and parameters: {'n_estimators': 200, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 57, 'min_child_samples': 39, 'lambda_l1': 3.2047555533872153e-07, 'lambda_l2': 0.020605679980235606, 'reg_alpha': 0.5068287729814005, 'reg_lambda': 0.9227978766682805, 'learning_rate': 0.2064521033259073, 'feature_fraction': 0.8044307556062196, 'subsample': 0.17588926419955378, 'colsample_bytree': 0.208274754943567, 'bagging_fraction': 0.8068835108366229, 'bagging_freq': 2, 'min_child_weight': 0.709777498318447, 'min_split_gain': 0.3098974642919132, 'min_data_in_leaf': 10, 'max_delta_step': 8}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:44:42,086] Trial 20 finished with value: 0.7612528170842955 and parameters: {'n_estimators': 50, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 158, 'min_child_samples': 24, 'lambda_l1': 1.1944409555908412e-06, 'lambda_l2': 0.0006399993856977176, 'reg_alpha': 0.6904445761992442, 'reg_lambda': 0.5190310484924262, 'learning_rate': 0.1782897440083468, 'feature_fraction': 0.9016233224676775, 'subsample': 0.5730556441213499, 'colsample_bytree': 0.06783581520353882, 'bagging_fraction': 0.8596259140100362, 'bagging_freq': 4, 'min_child_weight': 0.026274533589895142, 'min_split_gain': 0.016636042354496304, 'min_data_in_leaf': 29, 'max_delta_step': 6}. Best is trial 11 with value: 0.8782710858654724.
[I 2023-12-27 23:46:01,622] Trial 21 finished with value: 0.8833719634667491 and parameters: {'n_estimators': 300, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 177, 'min_child_samples': 45, 'lambda_l1': 3.4991269944278816e-06, 'lambda_l2': 0.0002491284134385983, 'reg_alpha': 0.7480374305092647, 'reg_lambda': 0.8248409723216801, 'learning_rate': 0.1901342759969906, 'feature_fraction': 0.7462514961118765, 'subsample': 0.5970550676337194, 'colsample_bytree': 0.36799990149343387, 'bagging_fraction': 0.9183731050628999, 'bagging_freq': 3, 'min_child_weight': 0.36135045191782317, 'min_split_gain': 0.40707406736393204, 'min_data_in_leaf': 16, 'max_delta_step': 2}. Best is trial 21 with value: 0.8833719634667491.
[I 2023-12-27 23:47:15,553] Trial 22 finished with value: 0.8881755261066901 and parameters: {'n_estimators': 250, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 197, 'min_child_samples': 54, 'lambda_l1': 6.487095526319997e-08, 'lambda_l2': 0.0017328406272876602, 'reg_alpha': 0.8954019209806247, 'reg_lambda': 0.7574800975978729, 'learning_rate': 0.28442706550634245, 'feature_fraction': 0.7823651612335393, 'subsample': 0.6274888850559157, 'colsample_bytree': 0.3549984446040611, 'bagging_fraction': 0.9350132824135351, 'bagging_freq': 3, 'min_child_weight': 0.18974615877784037, 'min_split_gain': 0.47269394006479004, 'min_data_in_leaf': 20, 'max_delta_step': 2}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:48:14,237] Trial 23 finished with value: 0.851493793278416 and parameters: {'n_estimators': 200, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 190, 'min_child_samples': 57, 'lambda_l1': 9.149873937494441e-08, 'lambda_l2': 9.731946260935053e-05, 'reg_alpha': 0.9412953784916732, 'reg_lambda': 0.7672495740592127, 'learning_rate': 0.18715023155531416, 'feature_fraction': 0.7729667315426292, 'subsample': 0.6467429302022853, 'colsample_bytree': 0.39652678699156313, 'bagging_fraction': 0.9486007346471711, 'bagging_freq': 4, 'min_child_weight': 0.16338460855819342, 'min_split_gain': 0.4593591636130272, 'min_data_in_leaf': 20, 'max_delta_step': 1}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:49:35,797] Trial 24 finished with value: 0.8814626237854742 and parameters: {'n_estimators': 550, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 210, 'min_child_samples': 70, 'lambda_l1': 6.206372509221023e-07, 'lambda_l2': 0.009433328937434073, 'reg_alpha': 0.933946630859976, 'reg_lambda': 0.9954225256080727, 'learning_rate': 0.2988681483708137, 'feature_fraction': 0.8673030919587643, 'subsample': 0.5347153032152265, 'colsample_bytree': 0.34481442679746765, 'bagging_fraction': 0.9428166360548743, 'bagging_freq': 2, 'min_child_weight': 0.41137454052279926, 'min_split_gain': 0.552588353202933, 'min_data_in_leaf': 28, 'max_delta_step': 4}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:50:42,099] Trial 25 finished with value: 0.8713862821743685 and parameters: {'n_estimators': 400, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 215, 'min_child_samples': 66, 'lambda_l1': 5.606353525411963e-07, 'lambda_l2': 0.01645968116387086, 'reg_alpha': 0.900328990911367, 'reg_lambda': 0.8882478647335285, 'learning_rate': 0.2995296376396346, 'feature_fraction': 0.8653931193758865, 'subsample': 0.6956790691548889, 'colsample_bytree': 0.4605688189731501, 'bagging_fraction': 0.9963554434544024, 'bagging_freq': 2, 'min_child_weight': 1.3790999793509264, 'min_split_gain': 0.5500198236137742, 'min_data_in_leaf': 8, 'max_delta_step': 4}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:51:58,432] Trial 26 finished with value: 0.8757299330182601 and parameters: {'n_estimators': 600, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 194, 'min_child_samples': 73, 'lambda_l1': 3.458832037131109e-08, 'lambda_l2': 0.01062829757405875, 'reg_alpha': 0.9090182869071985, 'reg_lambda': 0.8535617462549212, 'learning_rate': 0.2068294241225185, 'feature_fraction': 0.763010580961449, 'subsample': 0.5453704964594214, 'colsample_bytree': 0.24605307868957346, 'bagging_fraction': 0.942644568029684, 'bagging_freq': 1, 'min_child_weight': 0.42312410212590734, 'min_split_gain': 0.616416087546423, 'min_data_in_leaf': 20, 'max_delta_step': 3}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:52:43,882] Trial 27 finished with value: 0.8116483935336053 and parameters: {'n_estimators': 200, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 229, 'min_child_samples': 52, 'lambda_l1': 8.577767779820766e-07, 'lambda_l2': 0.0018848346117680019, 'reg_alpha': 0.9868091766180361, 'reg_lambda': 0.9828569897195489, 'learning_rate': 0.21901708186437807, 'feature_fraction': 0.8703601172344868, 'subsample': 0.4981492493361206, 'colsample_bytree': 0.1845769619284794, 'bagging_fraction': 0.9078603509428422, 'bagging_freq': 3, 'min_child_weight': 3.1809205085934575, 'min_split_gain': 0.6988582226953822, 'min_data_in_leaf': 34, 'max_delta_step': 1}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:54:13,932] Trial 28 finished with value: 0.8534635288851998 and parameters: {'n_estimators': 350, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 173, 'min_child_samples': 60, 'lambda_l1': 1.145739349972685e-07, 'lambda_l2': 0.00013318361338986002, 'reg_alpha': 0.7665287665052277, 'reg_lambda': 0.7432817893085424, 'learning_rate': 0.10513798925474571, 'feature_fraction': 0.7015728921275299, 'subsample': 0.6058590655055107, 'colsample_bytree': 0.3606173051524219, 'bagging_fraction': 0.9616218937977218, 'bagging_freq': 2, 'min_child_weight': 0.46816099764162866, 'min_split_gain': 0.39738802885664537, 'min_data_in_leaf': 18, 'max_delta_step': 5}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:55:11,491] Trial 29 finished with value: 0.8186632554762338 and parameters: {'n_estimators': 550, 'max_depth': 6, 'boosting_type': 'gbdt', 'num_leaves': 199, 'min_child_samples': 71, 'lambda_l1': 1.9300245738733512e-05, 'lambda_l2': 0.06529332322007117, 'reg_alpha': 0.8540002668256893, 'reg_lambda': 0.8910470873527566, 'learning_rate': 0.2322779764126315, 'feature_fraction': 0.7815730411921928, 'subsample': 0.4178043783507136, 'colsample_bytree': 0.44913002804719304, 'bagging_fraction': 0.997686972664943, 'bagging_freq': 3, 'min_child_weight': 0.047450020650408674, 'min_split_gain': 0.5261641319776578, 'min_data_in_leaf': 27, 'max_delta_step': 3}. Best is trial 22 with value: 0.8881755261066901.
[I 2023-12-27 23:57:55,258] Trial 30 finished with value: 0.8921939752528024 and parameters: {'n_estimators': 700, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 152, 'min_child_samples': 33, 'lambda_l1': 0.00011264523436153885, 'lambda_l2': 0.002535426217214707, 'reg_alpha': 0.8531783896338508, 'reg_lambda': 0.8386482461746168, 'learning_rate': 0.22932817006049094, 'feature_fraction': 0.6501957138290259, 'subsample': 0.519408739568785, 'colsample_bytree': 0.5360604044432666, 'bagging_fraction': 0.9107998816579961, 'bagging_freq': 2, 'min_child_weight': 0.23203258978937152, 'min_split_gain': 0.3946794330666563, 'min_data_in_leaf': 25, 'max_delta_step': 3}. Best is trial 30 with value: 0.8921939752528024.
[I 2023-12-28 00:00:16,380] Trial 31 finished with value: 0.8922264966138842 and parameters: {'n_estimators': 700, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 156, 'min_child_samples': 29, 'lambda_l1': 0.00014582924208031534, 'lambda_l2': 0.00207070770891591, 'reg_alpha': 0.879099482559331, 'reg_lambda': 0.8226420082257161, 'learning_rate': 0.23972844391778333, 'feature_fraction': 0.6389072863404655, 'subsample': 0.4944031203227879, 'colsample_bytree': 0.5442477245125785, 'bagging_fraction': 0.9100444612104912, 'bagging_freq': 2, 'min_child_weight': 0.26729999120841763, 'min_split_gain': 0.39571056657151543, 'min_data_in_leaf': 24, 'max_delta_step': 3}. Best is trial 31 with value: 0.8922264966138842.
[I 2023-12-28 00:02:49,485] Trial 32 finished with value: 0.8989997991337638 and parameters: {'n_estimators': 800, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 149, 'min_child_samples': 26, 'lambda_l1': 9.803860851134294e-05, 'lambda_l2': 0.0030044424373395283, 'reg_alpha': 0.8485579521271137, 'reg_lambda': 0.823632033229195, 'learning_rate': 0.16123569549530453, 'feature_fraction': 0.6545602810054852, 'subsample': 0.5129109164629315, 'colsample_bytree': 0.5255478300943521, 'bagging_fraction': 0.9073804905823786, 'bagging_freq': 1, 'min_child_weight': 0.12693393736639316, 'min_split_gain': 0.3846614060379614, 'min_data_in_leaf': 25, 'max_delta_step': 2}. Best is trial 32 with value: 0.8989997991337638.
[I 2023-12-28 00:05:09,170] Trial 33 finished with value: 0.8958036145465975 and parameters: {'n_estimators': 750, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 151, 'min_child_samples': 25, 'lambda_l1': 0.00013239045157530194, 'lambda_l2': 0.0032071894959281267, 'reg_alpha': 0.8662145021497873, 'reg_lambda': 0.7533395165967287, 'learning_rate': 0.23146274771366598, 'feature_fraction': 0.6289110073229573, 'subsample': 0.45122154653169216, 'colsample_bytree': 0.5406432957366871, 'bagging_fraction': 0.8907563452214535, 'bagging_freq': 1, 'min_child_weight': 0.08022134072683336, 'min_split_gain': 0.3830191281331384, 'min_data_in_leaf': 24, 'max_delta_step': 2}. Best is trial 32 with value: 0.8989997991337638.
[I 2023-12-28 00:07:22,684] Trial 34 finished with value: 0.8897318053882265 and parameters: {'n_estimators': 750, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 150, 'min_child_samples': 23, 'lambda_l1': 0.0002756081233310541, 'lambda_l2': 0.004038521806636954, 'reg_alpha': 0.8368395098367797, 'reg_lambda': 0.8351955687445882, 'learning_rate': 0.15985743765335148, 'feature_fraction': 0.6189789865249556, 'subsample': 0.4704072511809242, 'colsample_bytree': 0.5594715372860741, 'bagging_fraction': 0.7656943167789485, 'bagging_freq': 1, 'min_child_weight': 0.09350196583068483, 'min_split_gain': 0.3716033548247211, 'min_data_in_leaf': 24, 'max_delta_step': 3}. Best is trial 32 with value: 0.8989997991337638.
[I 2023-12-28 00:09:19,708] Trial 35 finished with value: 0.8992227839303806 and parameters: {'n_estimators': 850, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 158, 'min_child_samples': 31, 'lambda_l1': 0.00012594780677761396, 'lambda_l2': 0.04190854332710144, 'reg_alpha': 0.8578304010160532, 'reg_lambda': 0.9082046808688874, 'learning_rate': 0.2269423609657828, 'feature_fraction': 0.6267650752224884, 'subsample': 0.45372996056218884, 'colsample_bytree': 0.6231626398111181, 'bagging_fraction': 0.8942737662563495, 'bagging_freq': 1, 'min_child_weight': 0.03139373537452999, 'min_split_gain': 0.2889146281291505, 'min_data_in_leaf': 34, 'max_delta_step': 1}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:12:12,040] Trial 36 finished with value: 0.8967002186582534 and parameters: {'n_estimators': 850, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 133, 'min_child_samples': 17, 'lambda_l1': 0.0012579716529809257, 'lambda_l2': 0.13286685086210484, 'reg_alpha': 0.7970066890973759, 'reg_lambda': 0.9052548375736079, 'learning_rate': 0.12827308333727266, 'feature_fraction': 0.5947904298434539, 'subsample': 0.42544021303549495, 'colsample_bytree': 0.6185053528543994, 'bagging_fraction': 0.8897744944480148, 'bagging_freq': 1, 'min_child_weight': 0.03348700916069735, 'min_split_gain': 0.28462873394322535, 'min_data_in_leaf': 35, 'max_delta_step': 1}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:15:01,246] Trial 37 finished with value: 0.8975457216031145 and parameters: {'n_estimators': 900, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 102, 'min_child_samples': 17, 'lambda_l1': 0.0010511873817839884, 'lambda_l2': 0.04511572731653093, 'reg_alpha': 0.8013630825405028, 'reg_lambda': 0.9251516562576017, 'learning_rate': 0.1589772264584597, 'feature_fraction': 0.6038001674080635, 'subsample': 0.42037162655371074, 'colsample_bytree': 0.6502318100866276, 'bagging_fraction': 0.8411653909921795, 'bagging_freq': 1, 'min_child_weight': 0.03848395291096782, 'min_split_gain': 0.27887820629397675, 'min_data_in_leaf': 39, 'max_delta_step': 0}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:17:09,325] Trial 38 finished with value: 0.85945636749193 and parameters: {'n_estimators': 900, 'max_depth': 7, 'boosting_type': 'gbdt', 'num_leaves': 100, 'min_child_samples': 18, 'lambda_l1': 0.0009059956557422825, 'lambda_l2': 0.2372632702197369, 'reg_alpha': 0.7913680828324612, 'reg_lambda': 0.9399333038053995, 'learning_rate': 0.12402294496183226, 'feature_fraction': 0.5967861272024316, 'subsample': 0.39973665639568184, 'colsample_bytree': 0.6506821200780779, 'bagging_fraction': 0.835482168603026, 'bagging_freq': 1, 'min_child_weight': 0.02841779979609808, 'min_split_gain': 0.26635896836483625, 'min_data_in_leaf': 41, 'max_delta_step': 0}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:18:37,903] Trial 39 finished with value: 0.8140501732349567 and parameters: {'n_estimators': 850, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 50, 'min_child_samples': 8, 'lambda_l1': 0.0014391679638090364, 'lambda_l2': 0.04193995020208812, 'reg_alpha': 0.6704979008319145, 'reg_lambda': 0.9220114570199825, 'learning_rate': 0.079993010687535, 'feature_fraction': 0.5948035295441012, 'subsample': 0.3540651650868418, 'colsample_bytree': 0.6798399781297993, 'bagging_fraction': 0.752945071168631, 'bagging_freq': 1, 'min_child_weight': 0.007388820142358995, 'min_split_gain': 0.2852848467749537, 'min_data_in_leaf': 35, 'max_delta_step': 1}. Best is trial 35 with value: 0.8992227839303806.
[I 2023-12-28 00:21:16,910] Trial 40 finished with value: 0.9046627973391566 and parameters: {'n_estimators': 950, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 134, 'min_child_samples': 16, 'lambda_l1': 0.002865600014675554, 'lambda_l2': 0.2596842570474358, 'reg_alpha': 0.7882191530750605, 'reg_lambda': 0.8807986419485984, 'learning_rate': 0.1590297123957936, 'feature_fraction': 0.6798887463051443, 'subsample': 0.4338721676814267, 'colsample_bytree': 0.7119935931199451, 'bagging_fraction': 0.8456859403262338, 'bagging_freq': 1, 'min_child_weight': 0.004745177681063355, 'min_split_gain': 0.17818587330243982, 'min_data_in_leaf': 40, 'max_delta_step': 0}. Best is trial 40 with value: 0.9046627973391566.

The best set of hyperparameters was found in trial 40 with CV ROC AUC of 0.905.

  • ‘n_estimators’: 950,
  • ‘max_depth’: 10,
  • ‘boosting_type’: ‘gbdt’,
  • ‘num_leaves’: 134,
  • ‘min_child_samples’: 16,
  • ‘lambda_l1’: 0.002865600014675554,
  • ‘lambda_l2’: 0.2596842570474358,
  • ‘reg_alpha’: 0.7882191530750605,
  • ‘reg_lambda’: 0.8807986419485984,
  • ‘learning_rate’: 0.1590297123957936,
  • ‘feature_fraction’: 0.6798887463051443,
  • ‘subsample’: 0.4338721676814267,
  • ‘colsample_bytree’: 0.7119935931199451,
  • ‘bagging_fraction’: 0.8456859403262338,
  • ‘bagging_freq’: 1,
  • ‘min_child_weight’: 0.004745177681063355,
  • ‘min_split_gain’: 0.17818587330243982,
  • ‘min_data_in_leaf’: 40,
  • ‘max_delta_step’: 0

4.7 Evaluate Tuned Model

The tuned model with the best set of hyperparameters is evaluated on the validation set. For reference, the results are also calculated on the training set as well.

Warning messages indicate, that as values of some parameters are set, other parameters are ignored. The ignored parameters will not be set.

The results show that on the training set the tuned model performs best, but on the validation set it performs worst. This indicates overfitting. Feasible options in this situation may include:

  1. tunning the model with fewer features (e.g., 34, as ROC AUC is almost the same as in the model with 111 features);
  2. using a simpler model (e.g., Logistic Regression).

Unfortunately, currently, there is not enough time to implement that so the untuned model with 111 features will be selected as the best one.

See the details below.

An example of warnings
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=16 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=0.7119935931199451 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.7882191530750605 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.8807986419485984 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=0.4338721676814267 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=16 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=0.7119935931199451 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.7882191530750605 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.8807986419485984 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=0.4338721676814267 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
Code
# The ignored parameters are left commented out
params_tuned_1 = {
    "n_estimators": 950,
    "max_depth": 10,
    "boosting_type": "gbdt",
    "num_leaves": 134,
    # "min_child_samples": 16,
    "lambda_l1": 0.002865600014675554,
    "lambda_l2": 0.2596842570474358,
    # "reg_alpha": 0.7882191530750605,
    # "reg_lambda": 0.8807986419485984,
    "learning_rate": 0.1590297123957936,
    "feature_fraction": 0.6798887463051443,
    # "subsample": 0.4338721676814267,
    # "colsample_bytree": 0.7119935931199451,
    "bagging_fraction": 0.8456859403262338,
    "bagging_freq": 1,
    "min_child_weight": 0.004745177681063355,
    "min_split_gain": 0.17818587330243982,
    "min_data_in_leaf": 40,
    "max_delta_step": 0,
}

model_tuned_1 = LGBMClassifier(
    objective="binary",
    metric="auc",
    random_state=1,
    class_weight="balanced",
    n_jobs=-1,
    device="gpu",
    **params_tuned_1,
)

pipeline_with_tuned_model = Pipeline(
    steps=[
        ("preprocessor_1", PreprocessorForApplications()),
        ("preprocessor_2", clone(pre_processing)),
        ("selector", ColumnSelector(feature_names_to_tune_111)),
        ("classifier", clone(model_tuned_1)),
    ]
)

models_default_prediction["LGBM (111 feat. | tuned)"] = pipeline_with_tuned_model.fit(
    X_train, y_train
)
# Time: 3m 20.7s
Code
performance_train_1 = ml.classification_scores(
    models_default_prediction,
    X_train,
    y_train,
    sort_by="ROC_AUC",
    color="orange",
)
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
Code
performance_validation_1 = ml.classification_scores(
    models_default_prediction,
    X_validation,
    y_validation,
    sort_by="ROC_AUC",
)
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] feature_fraction is set=0.6798887463051443, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.6798887463051443
[LightGBM] [Warning] lambda_l1 is set=0.002865600014675554, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.002865600014675554
[LightGBM] [Warning] lambda_l2 is set=0.2596842570474358, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.2596842570474358
[LightGBM] [Warning] bagging_fraction is set=0.8456859403262338, subsample=1.0 will be ignored. Current value: bagging_fraction=0.8456859403262338
[LightGBM] [Warning] bagging_freq is set=1, subsample_freq=0 will be ignored. Current value: bagging_freq=1
Code
print("--- Train ---")
performance_train_1
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (111 feat. | tuned) 215257 0.919 0.995 0.997 0.995 0.972 0.997 1.000 0.995 0.945 1.000 1.000
LGBM (123 features) 215257 0.919 0.714 0.722 0.443 0.292 0.821 0.731 0.712 0.182 0.968 0.800
LGBM (FULL | 197 feat.) 215257 0.919 0.714 0.721 0.443 0.292 0.821 0.731 0.712 0.182 0.968 0.799
LGBM (98 features) 215257 0.919 0.714 0.722 0.444 0.292 0.820 0.733 0.712 0.183 0.968 0.799
LGBM (74 features) 215257 0.919 0.713 0.721 0.443 0.292 0.820 0.731 0.712 0.182 0.968 0.799
LGBM (111 features) 215257 0.919 0.714 0.721 0.443 0.292 0.820 0.731 0.712 0.182 0.968 0.799
LGBM (55 features) 215257 0.919 0.713 0.721 0.441 0.291 0.820 0.730 0.711 0.182 0.968 0.799
LGBM (34 features) 215257 0.919 0.711 0.718 0.437 0.289 0.819 0.726 0.710 0.180 0.967 0.796
LGBM (8 features) 215257 0.919 0.705 0.708 0.416 0.280 0.814 0.711 0.704 0.174 0.965 0.784
Code
print("--- Validation ---")
performance_validation_1
--- Validation ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (FULL | 197 feat.) 46127 0.919 0.705 0.688 0.375 0.268 0.815 0.667 0.708 0.167 0.960 0.759
LGBM (123 features) 46127 0.919 0.704 0.687 0.375 0.267 0.815 0.668 0.707 0.167 0.960 0.759
LGBM (111 features) 46127 0.919 0.705 0.691 0.381 0.269 0.816 0.673 0.708 0.168 0.961 0.758
LGBM (98 features) 46127 0.919 0.704 0.689 0.377 0.268 0.814 0.671 0.707 0.167 0.961 0.758
LGBM (74 features) 46127 0.919 0.704 0.689 0.377 0.268 0.815 0.670 0.707 0.167 0.961 0.758
LGBM (34 features) 46127 0.919 0.703 0.688 0.375 0.267 0.814 0.669 0.706 0.167 0.960 0.758
LGBM (55 features) 46127 0.919 0.703 0.688 0.375 0.267 0.814 0.669 0.706 0.167 0.961 0.758
LGBM (8 features) 46127 0.919 0.698 0.688 0.376 0.265 0.810 0.675 0.700 0.165 0.961 0.754
LGBM (111 feat. | tuned) 46127 0.919 0.896 0.568 0.136 0.215 0.944 0.178 0.959 0.274 0.930 0.716

4.8 Final Evaluation

After hyperparameter tuning, the trade-off between model complexity and accuracy was re-considered. Instead of the best-performing model based on 111 features, a much less complex model based on 34 features with comparable performance (AUC = 0.758 which differs by less than 0.0005) was chosen as the final model to be deployed.

The final performance of the model based on these features is AUC = 0.763 (slightly better which can be related to the fact that the model was trained on a larger dataset).

Code
features_34 = feature_importance.head(34).col_name.to_list()

pipeline_final_1_with_34_feat = Pipeline(
    steps=[
        ("preprocessor_1", PreprocessorForApplications()),
        ("preprocessor_2", clone(pre_processing)),
        ("selector", ColumnSelector(features_34)),
        ("classifier", clone(lgbm_classifier)),
    ]
)
Code
# For performance evaluation
X_train_validation = pd.concat([X_train, X_validation])
y_train_validation = pd.concat([y_train, y_validation])

file = dir_interim + "task-1-applications-only--lgbm_models_final_1.pkl"

if os.path.exists(file):
    with open(file, "rb") as f:
        models_final_1 = joblib.load(f)
else:
    models_final_1 = {}
    models_final_1["LGBM (34 feat. | final)"] = pipeline_final_1_with_34_feat.fit(
        X_train_validation, y_train_validation
    )
    with open(file, "wb") as f:
        joblib.dump(models_final_1, f)

del file
[LightGBM] [Info] Number of positive: 21101, number of negative: 240283
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 3410
[LightGBM] [Info] Number of data points in the train set: 261384, number of used features: 34
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 17 dense feature groups (4.99 MB) transferred to GPU in 0.013168 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Code
print("--- Train ---")

ml.classification_scores(
    models_final_1,
    X_train_validation,
    y_train_validation,
    color="orange",
    sort_by="ROC_AUC",
)
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (34 feat. | final) 261384 0.919 0.710 0.714 0.429 0.286 0.818 0.720 0.709 0.178 0.966 0.791
Code of the figure
sns.set_style("white")
y_pred_train_val_1 = models_final_1["LGBM (34 feat. | final)"].predict(
    X_train_validation
)
ml.plot_confusion_matrices(y_train_validation, y_pred_train_val_1, figsize=(13, 3));
Fig. 4.1. Confusion matrices for the joint train and validation set.
Code
print("--- Test ---")

ml.classification_scores(
    models_final_1,
    X_test,
    y_test,
    sort_by="ROC_AUC",
)
--- Test ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (34 feat. | final) 46127 0.919 0.707 0.696 0.392 0.273 0.816 0.683 0.709 0.171 0.962 0.763
Code of the figure
sns.set_style("white")
y_pred_test_1 = models_final_1["LGBM (34 feat. | final)"].predict(X_test)
ml.plot_confusion_matrices(y_test, y_pred_test_1, figsize=(13, 3));
Fig. 4.2. Confusion matrices for the test set.
Code
# SHAP values for the final model
@my.cache_results(dir_interim + "task-1-applications-only--shap_lgbm_k=34-final.pickle")
def get_shap_values_lgbm_final_1():
    model = "LGBM (34 feat. | final)"
    preproc = Pipeline(steps=models_final_1[model].steps[:-1])
    classifier = models_final_1[model]["classifier"]
    X_test_preproc = preproc.transform(X_test)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_test_preproc)
    return shap_values, X_test_preproc


shap_values_lgbm_test_1, data_for_lgbm_test_1 = get_shap_values_lgbm_final_1()

feature_importance_test_1 = (
    pd.DataFrame(
        list(
            zip(
                data_for_lgbm_test_1.columns,
                np.abs(shap_values_lgbm_test_1).mean(0).mean(0),
            )
        ),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)
Code
sns.set_style("whitegrid")
lgb.plot_importance(
    models_final_1["LGBM (34 feat. | final)"]["classifier"],
    max_num_features=50,
    figsize=(8, 8),
    height=0.8,
    title="LGBM Feature Importance (Final Model)",
);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_1[1],
    data_for_lgbm_test_1,
    plot_type="bar",
    max_display=50,
    plot_size=(10, 5),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_1[1],
    data_for_lgbm_test_1,
    max_display=50,
    plot_size=(10, 5),
    show=False,
)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

4.9 Model for Deployment (w/o Historical Data)

Code
# For deployment
X_all = pd.concat([X_train, X_validation, X_test], axis=0)
y_all = pd.concat([y_train, y_validation, y_test], axis=0)

pipeline_to_deploy_1 = clone(pipeline_final_1_with_34_feat)
pipeline_to_deploy_1 = pipeline_to_deploy_1.fit(X_all, y_all)

For simplicity, the model will be deployed without pre-processing pipeline.

Code
# Extract and save classifier
classifier_to_deploy_1 = pipeline_to_deploy_1.named_steps["classifier"]

with open("models/classifier-1--without_credit_history.pickle", "wb") as f:
    joblib.dump(classifier_to_deploy_1, f)

5 Feature Engineering

In this section, data from the tables with historical credit data will be prepared for the modeling. At first, each subsection will reveal the steps used to pre-process each dataset to get it ready for merging. Then, the datasets will be merged and a joint dataset will be created.

Note.

  1. In this section, all the features from the application dataset will be used again even though some of them were not used in the previous model.
  2. The feature selection will be performed after all the features are created and merged into a single dataset.

The main strategy to aggregate features was:

  • for numeric features, mean, median, standard deviation, max, min and range were calculated. On rare occasions, other statistics were calculated too.
  • for categorical features, the frequency either of each category or of the biggest categories was calculated.

As each dataset was different and had different features, the steps to pre-process were modified each time.

5.1 Table bureau

Code
bureau.head()
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.00 -153.00 NaN 0 91323.00 0.00 NaN 0.00 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.00 NaN NaN 0 225000.00 171342.00 NaN 0.00 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.00 NaN NaN 0 464323.50 NaN NaN 0.00 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.00 NaN NaN 0.00 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.00 NaN 77674.50 0 2700000.00 NaN NaN 0.00 Consumer credit -21 NaN
Code
file = dir_interim + "aggregated--bureau_aggregated.feather"

if os.path.exists(file):
    bureau_aggregated = pd.read_feather(file)

else:
    bureau_aggregated = (
        bureau.assign(
            CREDIT_TYPE=lambda df: df["CREDIT_TYPE"].apply(
                lambda x: x
                if x
                in [
                    "Consumer credit",
                    "Credit card",
                    "Car loan",
                    "Mortgage",
                    "Microloan",
                ]
                else "Other"
            ),
        )
        .groupby("SK_ID_CURR")
        .agg(
            n_credits_total=("SK_ID_BUREAU", "count"),
            n_credits_active=("CREDIT_ACTIVE", lambda x: (x == "Active").sum()),
            n_credits_closed=("CREDIT_ACTIVE", lambda x: (x == "Closed").sum()),
            n_credits_bad_debt=("CREDIT_ACTIVE", lambda x: (x == "Bad debt").sum()),
            n_credits_sold=("CREDIT_ACTIVE", lambda x: (x == "Sold").sum()),
            mode_credit_currency=(
                "CREDIT_CURRENCY",
                lambda x: x.mode().iloc[0] if not x.empty else None,
            ),
            n_different_currencies=("CREDIT_CURRENCY", "nunique"),
            n_currency_1=("CREDIT_CURRENCY", lambda x: (x == "currency 1").sum()),
            n_currency_2=("CREDIT_CURRENCY", lambda x: (x == "currency 2").sum()),
            n_currency_3=("CREDIT_CURRENCY", lambda x: (x == "currency 3").sum()),
            n_currency_4=("CREDIT_CURRENCY", lambda x: (x == "currency 4").sum()),
            days_credit_min=("DAYS_CREDIT", "min"),
            days_credit_max=("DAYS_CREDIT", "max"),
            days_credit_mean=("DAYS_CREDIT", "mean"),
            days_credit_std=("DAYS_CREDIT", "std"),
            days_credit_median=("DAYS_CREDIT", "median"),
            days_credit_range=("DAYS_CREDIT", lambda x: x.max() - x.min()),
            days_credit_overdue_min=("CREDIT_DAY_OVERDUE", "min"),
            days_credit_overdue_max=("CREDIT_DAY_OVERDUE", "max"),
            days_credit_overdue_mean=("CREDIT_DAY_OVERDUE", "mean"),
            days_credit_overdue_std=("CREDIT_DAY_OVERDUE", "std"),
            days_credit_overdue_median=("CREDIT_DAY_OVERDUE", "median"),
            days_credit_overdue_range=(
                "CREDIT_DAY_OVERDUE",
                lambda x: x.max() - x.min(),
            ),
            days_credit_enddate_min=("DAYS_CREDIT_ENDDATE", "min"),
            days_credit_enddate_max=("DAYS_CREDIT_ENDDATE", "max"),
            days_credit_enddate_mean=("DAYS_CREDIT_ENDDATE", "mean"),
            days_credit_enddate_std=("DAYS_CREDIT_ENDDATE", "std"),
            days_credit_enddate_median=("DAYS_CREDIT_ENDDATE", "median"),
            days_credit_enddate_range=(
                "DAYS_CREDIT_ENDDATE",
                lambda x: x.max() - x.min(),
            ),
            days_enddate_fact_min=("DAYS_ENDDATE_FACT", "min"),
            days_enddate_fact_max=("DAYS_ENDDATE_FACT", "max"),
            days_enddate_fact_mean=("DAYS_ENDDATE_FACT", "mean"),
            days_enddate_fact_std=("DAYS_ENDDATE_FACT", "std"),
            days_enddate_fact_median=("DAYS_ENDDATE_FACT", "median"),
            days_enddate_fact_range=("DAYS_ENDDATE_FACT", lambda x: x.max() - x.min()),
            amt_credit_max_overdue_min=("AMT_CREDIT_MAX_OVERDUE", "min"),
            amt_credit_max_overdue_max=("AMT_CREDIT_MAX_OVERDUE", "max"),
            amt_credit_max_overdue_mean=("AMT_CREDIT_MAX_OVERDUE", "mean"),
            amt_credit_max_overdue_std=("AMT_CREDIT_MAX_OVERDUE", "std"),
            amt_credit_max_overdue_median=("AMT_CREDIT_MAX_OVERDUE", "median"),
            amt_credit_max_overdue_range=(
                "AMT_CREDIT_MAX_OVERDUE",
                lambda x: x.max() - x.min(),
            ),
            cnt_credit_prolong_min=("CNT_CREDIT_PROLONG", "min"),
            cnt_credit_prolong_max=("CNT_CREDIT_PROLONG", "max"),
            cnt_credit_prolong_mean=("CNT_CREDIT_PROLONG", "mean"),
            cnt_credit_prolong_std=("CNT_CREDIT_PROLONG", "std"),
            cnt_credit_prolong_median=("CNT_CREDIT_PROLONG", "median"),
            cnt_credit_prolong_range=(
                "CNT_CREDIT_PROLONG",
                lambda x: x.max() - x.min(),
            ),
            cnt_credit_prolong_sum=("CNT_CREDIT_PROLONG", "sum"),
            amt_credit_sum_min=("AMT_CREDIT_SUM", "min"),
            amt_credit_sum_max=("AMT_CREDIT_SUM", "max"),
            amt_credit_sum_mean=("AMT_CREDIT_SUM", "mean"),
            amt_credit_sum_std=("AMT_CREDIT_SUM", "std"),
            amt_credit_sum_median=("AMT_CREDIT_SUM", "median"),
            amt_credit_sum_range=("AMT_CREDIT_SUM", lambda x: x.max() - x.min()),
            amt_credit_sum_sum=("AMT_CREDIT_SUM", "sum"),
            amt_credit_sum_debt_min=("AMT_CREDIT_SUM_DEBT", "min"),
            amt_credit_sum_debt_max=("AMT_CREDIT_SUM_DEBT", "max"),
            amt_credit_sum_debt_mean=("AMT_CREDIT_SUM_DEBT", "mean"),
            amt_credit_sum_debt_std=("AMT_CREDIT_SUM_DEBT", "std"),
            amt_credit_sum_debt_median=("AMT_CREDIT_SUM_DEBT", "median"),
            amt_credit_sum_debt_range=(
                "AMT_CREDIT_SUM_DEBT",
                lambda x: x.max() - x.min(),
            ),
            amt_credit_sum_debt_sum=("AMT_CREDIT_SUM_DEBT", "sum"),
            amt_credit_sum_limit_min=("AMT_CREDIT_SUM_LIMIT", "min"),
            amt_credit_sum_limit_max=("AMT_CREDIT_SUM_LIMIT", "max"),
            amt_credit_sum_limit_mean=("AMT_CREDIT_SUM_LIMIT", "mean"),
            amt_credit_sum_limit_std=("AMT_CREDIT_SUM_LIMIT", "std"),
            amt_credit_sum_limit_median=("AMT_CREDIT_SUM_LIMIT", "median"),
            amt_credit_sum_limit_range=(
                "AMT_CREDIT_SUM_LIMIT",
                lambda x: x.max() - x.min(),
            ),
            amt_credit_sum_limit_sum=("AMT_CREDIT_SUM_LIMIT", "sum"),
            amt_credit_sum_overdue_min=("AMT_CREDIT_SUM_OVERDUE", "min"),
            amt_credit_sum_overdue_max=("AMT_CREDIT_SUM_OVERDUE", "max"),
            amt_credit_sum_overdue_mean=("AMT_CREDIT_SUM_OVERDUE", "mean"),
            amt_credit_sum_overdue_std=("AMT_CREDIT_SUM_OVERDUE", "std"),
            amt_credit_sum_overdue_median=("AMT_CREDIT_SUM_OVERDUE", "median"),
            amt_credit_sum_overdue_range=(
                "AMT_CREDIT_SUM_OVERDUE",
                lambda x: x.max() - x.min(),
            ),
            amt_credit_sum_overdue_sum=("AMT_CREDIT_SUM_OVERDUE", "sum"),
            mode_credit_type=(
                "CREDIT_TYPE",
                lambda x: x.mode().iloc[0] if not x.empty else None,
            ),
            n_different_credit_types=("CREDIT_TYPE", "nunique"),
            n_consumer_credits=(
                "CREDIT_TYPE",
                lambda x: (x == "Consumer credit").sum(),
            ),
            n_credit_card_credits=("CREDIT_TYPE", lambda x: (x == "Credit card").sum()),
            n_car_loans=("CREDIT_TYPE", lambda x: (x == "Car loan").sum()),
            n_mortgages=("CREDIT_TYPE", lambda x: (x == "Mortgage").sum()),
            n_microloans=("CREDIT_TYPE", lambda x: (x == "Microloan").sum()),
            n_other_type_credit=("CREDIT_TYPE", lambda x: (x == "Other").sum()),
            days_credit_update_min=("DAYS_CREDIT_UPDATE", "min"),
            days_credit_update_max=("DAYS_CREDIT_UPDATE", "max"),
            days_credit_update_mean=("DAYS_CREDIT_UPDATE", "mean"),
            days_credit_update_std=("DAYS_CREDIT_UPDATE", "std"),
            days_credit_update_median=("DAYS_CREDIT_UPDATE", "median"),
            days_credit_update_range=(
                "DAYS_CREDIT_UPDATE",
                lambda x: x.max() - x.min(),
            ),
            amt_annuity_min=("AMT_ANNUITY", "min"),
            amt_annuity_max=("AMT_ANNUITY", "max"),
            amt_annuity_mean=("AMT_ANNUITY", "mean"),
            amt_annuity_std=("AMT_ANNUITY", "std"),
            amt_annuity_median=("AMT_ANNUITY", "median"),
            amt_annuity_range=("AMT_ANNUITY", lambda x: x.max() - x.min()),
        )
        .reset_index()
    )

    bureau_aggregated.to_feather(file)

del file
# Time: 17m 16.8s
Code
bureau_aggregated.shape
(305811, 97)
Code
bureau_aggregated.head()
SK_ID_CURR n_credits_total n_credits_active n_credits_closed n_credits_bad_debt n_credits_sold mode_credit_currency n_different_currencies n_currency_1 n_currency_2 n_currency_3 n_currency_4 days_credit_min days_credit_max days_credit_mean days_credit_std days_credit_median days_credit_range days_credit_overdue_min days_credit_overdue_max days_credit_overdue_mean days_credit_overdue_std days_credit_overdue_median days_credit_overdue_range days_credit_enddate_min days_credit_enddate_max days_credit_enddate_mean days_credit_enddate_std days_credit_enddate_median days_credit_enddate_range days_enddate_fact_min days_enddate_fact_max days_enddate_fact_mean days_enddate_fact_std days_enddate_fact_median days_enddate_fact_range amt_credit_max_overdue_min amt_credit_max_overdue_max amt_credit_max_overdue_mean amt_credit_max_overdue_std amt_credit_max_overdue_median amt_credit_max_overdue_range cnt_credit_prolong_min cnt_credit_prolong_max cnt_credit_prolong_mean cnt_credit_prolong_std cnt_credit_prolong_median cnt_credit_prolong_range cnt_credit_prolong_sum amt_credit_sum_min amt_credit_sum_max amt_credit_sum_mean amt_credit_sum_std amt_credit_sum_median amt_credit_sum_range amt_credit_sum_sum amt_credit_sum_debt_min amt_credit_sum_debt_max amt_credit_sum_debt_mean amt_credit_sum_debt_std amt_credit_sum_debt_median amt_credit_sum_debt_range amt_credit_sum_debt_sum amt_credit_sum_limit_min amt_credit_sum_limit_max amt_credit_sum_limit_mean amt_credit_sum_limit_std amt_credit_sum_limit_median amt_credit_sum_limit_range amt_credit_sum_limit_sum amt_credit_sum_overdue_min amt_credit_sum_overdue_max amt_credit_sum_overdue_mean amt_credit_sum_overdue_std amt_credit_sum_overdue_median amt_credit_sum_overdue_range amt_credit_sum_overdue_sum mode_credit_type n_different_credit_types n_consumer_credits n_credit_card_credits n_car_loans n_mortgages n_microloans n_other_type_credit days_credit_update_min days_credit_update_max days_credit_update_mean days_credit_update_std days_credit_update_median days_credit_update_range amt_annuity_min amt_annuity_max amt_annuity_mean amt_annuity_std amt_annuity_median amt_annuity_range
0 100001 7 3 4 0 0 currency 1 1 7 0 0 0 -1572 -49 -735.00 489.94 -857.00 1523 0 0 0.00 0.00 0.00 0 -1329 1778 82.43 1032.86 -179.00 3107 -1328 -544 -825.50 369.08 -715.00 784 NaN NaN NaN NaN NaN NaN 0 0 0.00 0.00 0.00 0 0 85500.00 378000.00 207623.57 122544.54 168345.00 292500.00 1453365.00 0.00 373239.00 85240.93 137485.63 0.00 373239.00 596686.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Consumer credit 1 7 0 0 0 0 0 -155 -6 -93.14 77.20 -155.00 149 0.00 10822.50 3545.36 4800.61 0.00 10822.50
1 100002 8 2 6 0 0 currency 1 1 8 0 0 0 -1437 -103 -874.00 431.45 -1042.50 1334 0 0 0.00 0.00 0.00 0 -1072 780 -349.00 767.49 -424.50 1852 -1185 -36 -697.50 515.99 -939.00 1149 0.00 5043.65 1681.03 2363.25 40.50 5043.65 0 0 0.00 0.00 0.00 0 0 0.00 450000.00 108131.95 146075.56 54130.50 450000.00 865055.56 0.00 245781.00 49156.20 109916.60 0.00 245781.00 245781.00 0.00 31988.56 7997.14 15994.28 0.00 31988.56 31988.56 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Consumer credit 2 4 4 0 0 0 0 -1185 -7 -499.88 518.52 -402.50 1178 0.00 0.00 0.00 0.00 0.00 0.00
2 100003 4 1 3 0 0 currency 1 1 4 0 0 0 -2586 -606 -1400.75 909.83 -1205.50 1980 0 0 0.00 0.00 0.00 0 -2434 1216 -544.50 1492.77 -480.00 3650 -2131 -540 -1097.33 896.10 -621.00 1591 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0 0 22248.00 810000.00 254350.12 372269.47 92576.25 787752.00 1017400.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 810000.00 202500.00 405000.00 0.00 810000.00 810000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Consumer credit 2 2 2 0 0 0 0 -2131 -43 -816.00 908.05 -545.00 2088 NaN NaN NaN NaN NaN NaN
3 100004 2 0 2 0 0 currency 1 1 2 0 0 0 -1326 -408 -867.00 649.12 -867.00 918 0 0 0.00 0.00 0.00 0 -595 -382 -488.50 150.61 -488.50 213 -683 -382 -532.50 212.84 -532.50 301 0.00 0.00 0.00 NaN 0.00 0.00 0 0 0.00 0.00 0.00 0 0 94500.00 94537.80 94518.90 26.73 94518.90 37.80 189037.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Consumer credit 1 2 0 0 0 0 0 -682 -382 -532.00 212.13 -532.00 300 NaN NaN NaN NaN NaN NaN
4 100005 3 2 1 0 0 currency 1 1 3 0 0 0 -373 -62 -190.67 162.30 -137.00 311 0 0 0.00 0.00 0.00 0 -128 1324 439.33 776.27 122.00 1452 -123 -123 -123.00 <NA> -123.00 0 0.00 0.00 0.00 NaN 0.00 0.00 0 0 0.00 0.00 0.00 0 0 29826.00 568800.00 219042.00 303238.43 58500.00 538974.00 657126.00 0.00 543087.00 189469.50 306503.34 25321.50 543087.00 568408.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Consumer credit 2 2 1 0 0 0 0 -121 -11 -54.33 58.59 -31.00 110 0.00 4261.50 1420.50 2460.38 0.00 4261.50
Code
an.col_info(bureau_aggregated, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int64 2.4 MB 305,811 100.0% 0 0% 1 <0.1% <0.1% 100001
2 n_credits_total int64 2.4 MB 64 <0.1% 0 0% 41,520 13.6% 13.6% 1
3 n_credits_active int64 2.4 MB 23 <0.1% 0 0% 85,488 28.0% 28.0% 1
4 n_credits_closed int64 2.4 MB 57 <0.1% 0 0% 61,695 20.2% 20.2% 1
5 n_credits_bad_debt int64 2.4 MB 2 <0.1% 0 0% 305,790 >99.9% >99.9% 0
6 n_credits_sold int64 2.4 MB 8 <0.1% 0 0% 299,790 98.0% 98.0% 0
7 mode_credit_currency object 20.5 MB 3 <0.1% 0 0% 305,759 >99.9% >99.9% currency 1
8 n_different_currencies int64 2.4 MB 3 <0.1% 0 0% 304,739 99.6% 99.6% 1
9 n_currency_1 int64 2.4 MB 65 <0.1% 0 0% 41,555 13.6% 13.6% 1
10 n_currency_2 int64 2.4 MB 7 <0.1% 0 0% 304,841 99.7% 99.7% 0
11 n_currency_3 int64 2.4 MB 4 <0.1% 0 0% 305,654 99.9% 99.9% 0
12 n_currency_4 int64 2.4 MB 2 <0.1% 0 0% 305,801 >99.9% >99.9% 0
13 days_credit_min int64 2.4 MB 2,922 1.0% 0 0% 323 0.1% 0.1% -2871
14 days_credit_max int64 2.4 MB 2,923 1.0% 0 0% 751 0.2% 0.2% -91
15 days_credit_mean float64 2.4 MB 69,801 22.8% 0 0% 94 <0.1% <0.1% -441.0
16 days_credit_std float64 2.4 MB 219,744 71.9% 41,520 13.6% 2,219 0.7% 0.8% 0.0
17 days_credit_median float64 2.4 MB 5,774 1.9% 0 0% 195 0.1% 0.1% -911.0
18 days_credit_range int64 2.4 MB 2,917 1.0% 0 0% 43,739 14.3% 14.3% 0
19 days_credit_overdue_min int64 2.4 MB 95 <0.1% 0 0% 305,650 99.9% 99.9% 0
20 days_credit_overdue_max int64 2.4 MB 917 0.3% 0 0% 301,947 98.7% 98.7% 0
21 days_credit_overdue_mean float64 2.4 MB 1,657 0.5% 0 0% 301,947 98.7% 98.7% 0.0
22 days_credit_overdue_std float64 2.4 MB 2,150 0.7% 41,520 13.6% 260,586 85.2% 98.6% 0.0
23 days_credit_overdue_median float64 2.4 MB 227 0.1% 0 0% 305,305 99.8% 99.8% 0.0
24 days_credit_overdue_range int64 2.4 MB 895 0.3% 0 0% 302,106 98.8% 98.8% 0
25 days_credit_enddate_min Int64 2.8 MB 7,154 2.3% 2,585 0.8% 191 0.1% 0.1% -2359
26 days_credit_enddate_max Int64 2.8 MB 13,537 4.4% 2,585 0.8% 279 0.1% 0.1% 31060
27 days_credit_enddate_mean Float64 2.8 MB 108,600 35.5% 2,585 0.8% 70 <0.1% <0.1% -99.0
28 days_credit_enddate_std Float64 2.8 MB 219,344 71.7% 46,899 15.3% 2,242 0.7% 0.9% 0.0
29 days_credit_enddate_median Float64 2.8 MB 15,834 5.2% 2,585 0.8% 181 0.1% 0.1% 0.0
30 days_credit_enddate_range Int64 2.8 MB 19,727 6.5% 2,585 0.8% 46,556 15.2% 15.4% 0
31 days_enddate_fact_min Int64 2.8 MB 2,917 1.0% 37,656 12.3% 191 0.1% 0.1% -2353
32 days_enddate_fact_max Int64 2.8 MB 2,816 0.9% 37,656 12.3% 559 0.2% 0.2% -35
33 days_enddate_fact_mean Float64 2.8 MB 45,759 15.0% 37,656 12.3% 106 <0.1% <0.1% -448.0
34 days_enddate_fact_std Float64 2.8 MB 154,869 50.6% 99,195 32.4% 1,591 0.5% 0.8% 0.0
35 days_enddate_fact_median Float64 2.8 MB 5,425 1.8% 37,656 12.3% 199 0.1% 0.1% -525.0
36 days_enddate_fact_range Int64 2.8 MB 2,818 0.9% 37,656 12.3% 63,130 20.6% 23.5% 0
37 amt_credit_max_overdue_min float64 2.4 MB 15,665 5.1% 92,840 30.4% 192,372 62.9% 90.3% 0.0
38 amt_credit_max_overdue_max float64 2.4 MB 50,443 16.5% 92,840 30.4% 132,669 43.4% 62.3% 0.0
39 amt_credit_max_overdue_mean float64 2.4 MB 62,613 20.5% 92,840 30.4% 132,669 43.4% 62.3% 0.0
40 amt_credit_max_overdue_std float64 2.4 MB 56,744 18.6% 169,242 55.3% 71,930 23.5% 52.7% 0.0
41 amt_credit_max_overdue_median float64 2.4 MB 33,054 10.8% 92,840 30.4% 166,468 54.4% 78.2% 0.0
42 amt_credit_max_overdue_range float64 2.4 MB 42,074 13.8% 92,840 30.4% 148,332 48.5% 69.6% 0.0
43 cnt_credit_prolong_min int64 2.4 MB 7 <0.1% 0 0% 305,499 99.9% 99.9% 0
44 cnt_credit_prolong_max int64 2.4 MB 10 <0.1% 0 0% 297,015 97.1% 97.1% 0
45 cnt_credit_prolong_mean float64 2.4 MB 111 <0.1% 0 0% 297,015 97.1% 97.1% 0.0
46 cnt_credit_prolong_std float64 2.4 MB 262 0.1% 41,520 13.6% 255,807 83.6% 96.8% 0.0
47 cnt_credit_prolong_median float64 2.4 MB 9 <0.1% 0 0% 304,960 99.7% 99.7% 0.0
48 cnt_credit_prolong_range int64 2.4 MB 10 <0.1% 0 0% 297,327 97.2% 97.2% 0
49 cnt_credit_prolong_sum int64 2.4 MB 10 <0.1% 0 0% 297,015 97.1% 97.1% 0
50 amt_credit_sum_min float64 2.4 MB 61,581 20.1% 2 <0.1% 50,710 16.6% 16.6% 0.0
51 amt_credit_sum_max float64 2.4 MB 73,784 24.1% 2 <0.1% 10,288 3.4% 3.4% 450000.0
52 amt_credit_sum_mean float64 2.4 MB 241,361 78.9% 2 <0.1% 1,534 0.5% 0.5% 225000.0
53 amt_credit_sum_std float64 2.4 MB 243,565 79.6% 41,521 13.6% 1,924 0.6% 0.7% 0.0
54 amt_credit_sum_median float64 2.4 MB 114,504 37.4% 2 <0.1% 8,249 2.7% 2.7% 225000.0
55 amt_credit_sum_range float64 2.4 MB 144,282 47.2% 2 <0.1% 43,443 14.2% 14.2% 0.0
56 amt_credit_sum_sum float64 2.4 MB 236,430 77.3% 0 0% 1,513 0.5% 0.5% 225000.0
57 amt_credit_sum_debt_min float64 2.4 MB 31,581 10.3% 8,372 2.7% 259,741 84.9% 87.3% 0.0
58 amt_credit_sum_debt_max float64 2.4 MB 157,148 51.4% 8,372 2.7% 81,812 26.8% 27.5% 0.0
59 amt_credit_sum_debt_mean float64 2.4 MB 195,128 63.8% 8,372 2.7% 80,654 26.4% 27.1% 0.0
60 amt_credit_sum_debt_std float64 2.4 MB 191,562 62.6% 56,661 18.5% 52,345 17.1% 21.0% 0.0
61 amt_credit_sum_debt_median float64 2.4 MB 73,411 24.0% 8,372 2.7% 201,540 65.9% 67.8% 0.0
62 amt_credit_sum_debt_range float64 2.4 MB 150,306 49.1% 8,372 2.7% 100,634 32.9% 33.8% 0.0
63 amt_credit_sum_debt_sum float64 2.4 MB 176,861 57.8% 0 0% 89,026 29.1% 29.1% 0.0
64 amt_credit_sum_limit_min float64 2.4 MB 3,544 1.2% 25,308 8.3% 276,325 90.4% 98.5% 0.0
65 amt_credit_sum_limit_max float64 2.4 MB 39,697 13.0% 25,308 8.3% 224,051 73.3% 79.9% 0.0
66 amt_credit_sum_limit_mean float64 2.4 MB 44,905 14.7% 25,308 8.3% 223,992 73.2% 79.9% 0.0
67 amt_credit_sum_limit_std float64 2.4 MB 43,893 14.4% 84,010 27.5% 168,698 55.2% 76.1% 0.0
68 amt_credit_sum_limit_median float64 2.4 MB 9,814 3.2% 25,308 8.3% 268,175 87.7% 95.6% 0.0
69 amt_credit_sum_limit_range float64 2.4 MB 37,439 12.2% 25,308 8.3% 227,400 74.4% 81.1% 0.0
70 amt_credit_sum_limit_sum float64 2.4 MB 42,987 14.1% 0 0% 249,300 81.5% 81.5% 0.0
71 amt_credit_sum_overdue_min float64 2.4 MB 116 <0.1% 0 0% 305,649 99.9% 99.9% 0.0
72 amt_credit_sum_overdue_max float64 2.4 MB 1,350 0.4% 0 0% 302,010 98.8% 98.8% 0.0
73 amt_credit_sum_overdue_mean float64 2.4 MB 2,081 0.7% 0 0% 302,010 98.8% 98.8% 0.0
74 amt_credit_sum_overdue_std float64 2.4 MB 2,399 0.8% 41,520 13.6% 260,648 85.2% 98.6% 0.0
75 amt_credit_sum_overdue_median float64 2.4 MB 295 0.1% 0 0% 305,321 99.8% 99.8% 0.0
76 amt_credit_sum_overdue_range float64 2.4 MB 1,312 0.4% 0 0% 302,168 98.8% 98.8% 0.0
77 amt_credit_sum_overdue_sum float64 2.4 MB 1,369 0.4% 0 0% 302,010 98.8% 98.8% 0.0
78 mode_credit_type object 21.8 MB 6 <0.1% 0 0% 266,665 87.2% 87.2% Consumer credit
79 n_different_credit_types int64 2.4 MB 5 <0.1% 0 0% 166,664 54.5% 54.5% 2
80 n_consumer_credits int64 2.4 MB 57 <0.1% 0 0% 55,195 18.0% 18.0% 1
81 n_credit_card_credits int64 2.4 MB 22 <0.1% 0 0% 105,846 34.6% 34.6% 0
82 n_car_loans int64 2.4 MB 11 <0.1% 0 0% 283,015 92.5% 92.5% 0
83 n_mortgages int64 2.4 MB 8 <0.1% 0 0% 288,957 94.5% 94.5% 0
84 n_microloans int64 2.4 MB 37 <0.1% 0 0% 301,246 98.5% 98.5% 0
85 n_other_type_credit int64 2.4 MB 10 <0.1% 0 0% 302,296 98.9% 98.9% 0
86 days_credit_update_min int64 2.4 MB 2,971 1.0% 0 0% 925 0.3% 0.3% -18
87 days_credit_update_max int64 2.4 MB 2,694 0.9% 0 0% 12,598 4.1% 4.1% -7
88 days_credit_update_mean float64 2.4 MB 59,481 19.5% 0 0% 872 0.3% 0.3% -14.0
89 days_credit_update_std float64 2.4 MB 215,765 70.6% 41,520 13.6% 2,983 1.0% 1.1% 0.0
90 days_credit_update_median float64 2.4 MB 4,968 1.6% 0 0% 1,841 0.6% 0.6% -18.0
91 days_credit_update_range int64 2.4 MB 2,951 1.0% 0 0% 44,503 14.6% 14.6% 0
92 amt_annuity_min float64 2.4 MB 15,274 5.0% 187,587 61.3% 83,584 27.3% 70.7% 0.0
93 amt_annuity_max float64 2.4 MB 30,558 10.0% 187,587 61.3% 28,057 9.2% 23.7% 0.0
94 amt_annuity_mean float64 2.4 MB 58,097 19.0% 187,587 61.3% 28,057 9.2% 23.7% 0.0
95 amt_annuity_std float64 2.4 MB 58,197 19.0% 213,412 69.8% 25,973 8.5% 28.1% 0.0
96 amt_annuity_median float64 2.4 MB 27,073 8.9% 187,587 61.3% 53,165 17.4% 45.0% 0.0
97 amt_annuity_range float64 2.4 MB 27,081 8.9% 187,587 61.3% 51,798 16.9% 43.8% 0.0

5.2 Table bureau_balance

Code
bureau_balance.head()
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

To table bureau_balance:

  1. Add identifier SK_ID_CURR
  2. Remove rows with irrelevant SK_ID_BUREAU values
Code
bureau_balance_relevant = pd.merge(
    bureau[["SK_ID_CURR", "SK_ID_BUREAU"]].drop_duplicates(),
    bureau_balance,
    on="SK_ID_BUREAU",
    how="inner",
)
Code
n_bureau_total = bureau["SK_ID_BUREAU"].nunique()
n_bureau_only = len(set(bureau["SK_ID_BUREAU"]) - set(bureau_balance["SK_ID_BUREAU"]))
n_blance_total = bureau_balance["SK_ID_BUREAU"].nunique()
n_balance_only = len(set(bureau_balance["SK_ID_BUREAU"]) - set(bureau["SK_ID_BUREAU"]))
n_common = len(
    set(bureau_balance["SK_ID_BUREAU"]).intersection(set(bureau["SK_ID_BUREAU"]))
)
Code
print(
    "Number of unique SK_ID_BUREAU values:\n",
    f"{n_bureau_total:8.0f} - in `bureau` table (total);\n",
    f"{n_bureau_only:8.0f} - in `bureau` but not in `bureau_balance`;\n",
    f"{n_blance_total:8.0f} - in `bureau_balance` table (total);\n",
    f"{n_balance_only:8.0f} - in `bureau_balance` but not in `bureau`;\n",
    f"{n_common:8.0f} - common in `bureau` and `bureau_balance`.\n",
)
Number of unique SK_ID_BUREAU values:
  1716428 - in `bureau` table (total);
   942074 - in `bureau` but not in `bureau_balance`;
   817395 - in `bureau_balance` table (total);
    43041 - in `bureau_balance` but not in `bureau`;
   774354 - common in `bureau` and `bureau_balance`.
Code
print("In `application`:")
print(application[["SK_ID_CURR"]].nunique())

print("\nIn `bureau`:")
print(bureau[["SK_ID_CURR", "SK_ID_BUREAU"]].nunique())

print("\nIn `bureau_balance_relevant`:")
print(bureau_balance_relevant[["SK_ID_CURR"]].nunique())
In `application`:
SK_ID_CURR    307511
dtype: int64

In `bureau`:
SK_ID_CURR       305811
SK_ID_BUREAU    1716428
dtype: int64

In `bureau_balance_relevant`:
SK_ID_CURR    134542
dtype: int64
Code
file = dir_interim + "aggregated--bureau_balance_aggregated.feather"

if os.path.exists(file):
    bureau_balance_aggregated = pd.read_feather(file)
else:
    bureau_balance_aggregated_1 = (
        bureau_balance_relevant.groupby("SK_ID_CURR")
        .agg(
            bureau_months_balance_min=("MONTHS_BALANCE", "min"),
            bureau_months_balance_max=("MONTHS_BALANCE", "max"),
        )
        .reset_index()
    )

    bureau_balance_aggregated_2 = (
        bureau_balance_relevant
        # Remove non-numeric status C and X
        .query("STATUS not in ['C', 'X']")
        # drop unused categories and convert STATUS to numeric
        .assign(STATUS=lambda df: df["STATUS"].astype(str).astype(int))
        .groupby("SK_ID_CURR")
        .agg(
            bureau_dpd_status_min=("STATUS", "min"),
            bureau_dpd_status_max=("STATUS", "max"),
            bureau_dpd_status_mean=("STATUS", "mean"),
            bureau_dpd_status_std=("STATUS", "std"),
            bureau_dpd_status_median=("STATUS", "median"),
            bureau_dpd_status_range=("STATUS", lambda x: x.max() - x.min()),
        )
        .reset_index()
    )

    # merge bureau_balance_aggregated_1 and bureau_balance_aggregated_2
    bureau_balance_aggregated = pd.merge(
        bureau_balance_aggregated_1,
        bureau_balance_aggregated_2,
        on="SK_ID_CURR",
        how="inner",
    )

    bureau_balance_aggregated.to_feather(file)

del file
Code
bureau_balance_aggregated.shape
(130773, 9)
Code
bureau_balance_aggregated.head()
SK_ID_CURR bureau_months_balance_min bureau_months_balance_max bureau_dpd_status_min bureau_dpd_status_max bureau_dpd_status_mean bureau_dpd_status_std bureau_dpd_status_median bureau_dpd_status_range
0 100001 -51 0 0 1 0.03 0.18 0.00 1
1 100002 -47 0 0 1 0.38 0.49 0.00 1
2 100005 -12 0 0 0 0.00 0.00 0.00 0
3 100010 -90 -2 0 0 0.00 0.00 0.00 0
4 100013 -68 0 0 1 0.08 0.28 0.00 1
Code
an.col_info(bureau_balance_aggregated, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 523.1 kB 130,773 100.0% 0 0% 1 <0.1% <0.1% 100001
2 bureau_months_balance_min int8 130.8 kB 97 0.1% 0 0% 3,241 2.5% 2.5% -95
3 bureau_months_balance_max int8 130.8 kB 91 0.1% 0 0% 126,223 96.5% 96.5% 0
4 bureau_dpd_status_min int32 523.1 kB 6 <0.1% 0 0% 130,695 99.9% 99.9% 0
5 bureau_dpd_status_max int32 523.1 kB 6 <0.1% 0 0% 82,251 62.9% 62.9% 0
6 bureau_dpd_status_mean float64 1.0 MB 5,801 4.4% 0 0% 82,251 62.9% 62.9% 0.0
7 bureau_dpd_status_std float64 1.0 MB 16,075 12.3% 1,188 0.9% 81,108 62.0% 62.6% 0.0
8 bureau_dpd_status_median float64 1.0 MB 11 <0.1% 0 0% 129,117 98.7% 98.7% 0.0
9 bureau_dpd_status_range int32 523.1 kB 6 <0.1% 0 0% 82,296 62.9% 62.9% 0

5.3 Table previous_application

Code
previous_application.head()
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.43 17145.00 17145.00 0.00 17145.00 SATURDAY 15 Y 1 0.00 0.18 0.87 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.00 middle POS mobile with interest 365243.00 -42.00 300.00 -42.00 -37.00 0.00
1 2802425 108129 Cash loans 25188.62 607500.00 679671.00 NaN 607500.00 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.00 low_action Cash X-Sell: low 365243.00 -134.00 916.00 365243.00 365243.00 1.00
2 2523466 122040 Cash loans 15060.74 112500.00 136444.50 NaN 112500.00 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 high Cash X-Sell: high 365243.00 -271.00 59.00 365243.00 365243.00 1.00
3 2819243 176158 Cash loans 47041.33 450000.00 470790.00 NaN 450000.00 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 middle Cash X-Sell: middle 365243.00 -482.00 -152.00 -182.00 -177.00 1.00
4 1784265 202054 Cash loans 31924.40 337500.00 404055.00 NaN 337500.00 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.00 high Cash Street: high NaN NaN NaN NaN NaN NaN
Code
previous_application[["SK_ID_PREV", "SK_ID_CURR"]].nunique()
SK_ID_PREV    1670214
SK_ID_CURR     338857
dtype: int64
Code
file = dir_interim + "aggregated--previous_application_aggregated.feather"

if os.path.exists(file):
    previous_application_aggregated = pd.read_feather(file)

else:
    previous_application_aggregated = (
        previous_application.groupby("SK_ID_CURR")
        .agg(
            n_different_loans=("NAME_CONTRACT_TYPE", "nunique"),
            n_cash_loans=("NAME_CONTRACT_TYPE", lambda x: (x == "Cash loans").sum()),
            n_consumer_loans=(
                "NAME_CONTRACT_TYPE",
                lambda x: (x == "Consumer loans").sum(),
            ),
            n_revolving_loans=(
                "NAME_CONTRACT_TYPE",
                lambda x: (x == "Revolving loans").sum(),
            ),
            amt_annuity_min=("AMT_ANNUITY", "min"),
            amt_annuity_max=("AMT_ANNUITY", "max"),
            amt_annuity_mean=("AMT_ANNUITY", "mean"),
            amt_annuity_std=("AMT_ANNUITY", "std"),
            amt_annuity_median=("AMT_ANNUITY", "median"),
            amt_annuity_range=("AMT_ANNUITY", lambda x: x.max() - x.min()),
            amt_application_min=("AMT_APPLICATION", "min"),
            amt_application_max=("AMT_APPLICATION", "max"),
            amt_application_mean=("AMT_APPLICATION", "mean"),
            amt_application_std=("AMT_APPLICATION", "std"),
            amt_application_median=("AMT_APPLICATION", "median"),
            amt_application_range=("AMT_APPLICATION", lambda x: x.max() - x.min()),
            amt_credit_min=("AMT_CREDIT", "min"),
            amt_credit_max=("AMT_CREDIT", "max"),
            amt_credit_mean=("AMT_CREDIT", "mean"),
            amt_credit_std=("AMT_CREDIT", "std"),
            amt_credit_median=("AMT_CREDIT", "median"),
            amt_credit_range=("AMT_CREDIT", lambda x: x.max() - x.min()),
            amt_down_payment_min=("AMT_DOWN_PAYMENT", "min"),
            amt_down_payment_max=("AMT_DOWN_PAYMENT", "max"),
            amt_down_payment_mean=("AMT_DOWN_PAYMENT", "mean"),
            amt_down_payment_std=("AMT_DOWN_PAYMENT", "std"),
            amt_down_payment_median=("AMT_DOWN_PAYMENT", "median"),
            amt_down_payment_range=("AMT_DOWN_PAYMENT", lambda x: x.max() - x.min()),
            amt_goods_price_min=("AMT_GOODS_PRICE", "min"),
            amt_goods_price_max=("AMT_GOODS_PRICE", "max"),
            amt_goods_price_mean=("AMT_GOODS_PRICE", "mean"),
            amt_goods_price_std=("AMT_GOODS_PRICE", "std"),
            amt_goods_price_median=("AMT_GOODS_PRICE", "median"),
            amt_goods_price_range=("AMT_GOODS_PRICE", lambda x: x.max() - x.min()),
            rate_down_payment_min=("RATE_DOWN_PAYMENT", "min"),
            rate_down_payment_max=("RATE_DOWN_PAYMENT", "max"),
            rate_down_payment_mean=("RATE_DOWN_PAYMENT", "mean"),
            rate_down_payment_std=("RATE_DOWN_PAYMENT", "std"),
            rate_down_payment_median=("RATE_DOWN_PAYMENT", "median"),
            rate_down_payment_range=("RATE_DOWN_PAYMENT", lambda x: x.max() - x.min()),
            # Many missing values
            rate_interest_primary_min=("RATE_INTEREST_PRIMARY", "min"),
            rate_interest_primary_max=("RATE_INTEREST_PRIMARY", "max"),
            rate_interest_primary_mean=("RATE_INTEREST_PRIMARY", "mean"),
            rate_interest_primary_std=("RATE_INTEREST_PRIMARY", "std"),
            rate_interest_primary_median=("RATE_INTEREST_PRIMARY", "median"),
            rate_interest_primary_range=(
                "RATE_INTEREST_PRIMARY",
                lambda x: x.max() - x.min(),
            ),
            rate_interest_primary_count=("RATE_INTEREST_PRIMARY", "count"),
            # Many missing values
            rate_interest_privileged_min=("RATE_INTEREST_PRIVILEGED", "min"),
            rate_interest_privileged_max=("RATE_INTEREST_PRIVILEGED", "max"),
            rate_interest_privileged_mean=("RATE_INTEREST_PRIVILEGED", "mean"),
            rate_interest_privileged_std=("RATE_INTEREST_PRIVILEGED", "std"),
            rate_interest_privileged_median=("RATE_INTEREST_PRIVILEGED", "median"),
            rate_interest_privileged_range=(
                "RATE_INTEREST_PRIVILEGED",
                lambda x: x.max() - x.min(),
            ),
            rate_interest_privileged_count=("RATE_INTEREST_PRIVILEGED", "count"),
            n_different_contract_types=("NAME_CONTRACT_TYPE", "nunique"),
            n_contract_status_approved=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Approved").sum(),
            ),
            n_contract_status_canceled=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Canceled").sum(),
            ),
            n_contract_status_refused=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Refused").sum(),
            ),
            n_contract_status_unused_offer=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Unused offer").sum(),
            ),
            days_decision_min=("DAYS_DECISION", "min"),
            days_decision_max=("DAYS_DECISION", "max"),
            days_decision_mean=("DAYS_DECISION", "mean"),
            days_decision_std=("DAYS_DECISION", "std"),
            days_decision_median=("DAYS_DECISION", "median"),
            days_decision_range=("DAYS_DECISION", lambda x: x.max() - x.min()),
            n_payment_type_cash_through_bank=(
                "NAME_PAYMENT_TYPE",
                lambda x: (x == "Cash through the bank").sum(),
            ),
            n_payment_type_cash_from_account=(
                "NAME_PAYMENT_TYPE",
                lambda x: (x == "on-cash from your account").sum(),
            ),
            n_payment_type_not_available=(
                "NAME_PAYMENT_TYPE",
                lambda x: (x == "XNA").sum(),
            ),
            n_reject_reason_not_applicable=(
                "CODE_REJECT_REASON",
                lambda x: (x == "XAP").sum(),
            ),
            n_reject_reason_hc=("CODE_REJECT_REASON", lambda x: (x == "HC").sum()),
            n_reject_reason_limit=(
                "CODE_REJECT_REASON",
                lambda x: (x == "LIMIT").sum(),
            ),
            n_reject_reason_scoc=("CODE_REJECT_REASON", lambda x: (x == "SCO").sum()),
            n_reject_reason_client=(
                "CODE_REJECT_REASON",
                lambda x: (x == "CLIENT").sum(),
            ),
            n_reject_reason_scofr=(
                "CODE_REJECT_REASON",
                lambda x: (x == "SCOFR").sum(),
            ),
            n_client_type_new=("NAME_CLIENT_TYPE", lambda x: (x == "New").sum()),
            n_client_type_repeater=(
                "NAME_CLIENT_TYPE",
                lambda x: (x == "Repeater").sum(),
            ),
            n_client_type_refreshed=(
                "NAME_CLIENT_TYPE",
                lambda x: (x == "Refreshed").sum(),
            ),
            n_portfolio_pos=("NAME_PORTFOLIO", lambda x: (x == "POS").sum()),
            n_portfolio_cash=("NAME_PORTFOLIO", lambda x: (x == "Cash").sum()),
            n_portfolio_cards=("NAME_PORTFOLIO", lambda x: (x == "Cards").sum()),
            n_product_type_xsell=("NAME_PRODUCT_TYPE", lambda x: (x == "x-sell").sum()),
            n_product_type_walk_in=(
                "NAME_PRODUCT_TYPE",
                lambda x: (x == "walk-in").sum(),
            ),
            n_different_channels=("CHANNEL_TYPE", "nunique"),
            n_channel_type_credit_and_cash=(
                "CHANNEL_TYPE",
                lambda x: (x == "Credit and cash offices").sum(),
            ),
            n_channel_type_countrywide=(
                "CHANNEL_TYPE",
                lambda x: (x == "Country-wide").sum(),
            ),
            n_channel_type_stone=("CHANNEL_TYPE", lambda x: (x == "Stone").sum()),
            n_channel_type_regional_and_local=(
                "CHANNEL_TYPE",
                lambda x: (x == "Regional / Local").sum(),
            ),
            n_channel_type_contact_center=(
                "CHANNEL_TYPE",
                lambda x: (x == "Contact center").sum(),
            ),
            n_channel_type_ap_minus=(
                "CHANNEL_TYPE",
                lambda x: (x == "AP+ (Cash loan)").sum(),
            ),
            n_channel_type_channel_corporate_sales=(
                "CHANNEL_TYPE",
                lambda x: (x == "Channel of corporate sales").sum(),
            ),
            n_channel_type_car_dealer=(
                "CHANNEL_TYPE",
                lambda x: (x == "Car dealer").sum(),
            ),
            n_cnt_payment_0=("CNT_PAYMENT", lambda x: (x == 0).sum()),
            cnt_payment_min=("CNT_PAYMENT", "min"),
            cnt_payment_max=("CNT_PAYMENT", "max"),
            cnt_payment_mean=("CNT_PAYMENT", "mean"),
            cnt_payment_std=("CNT_PAYMENT", "std"),
            cnt_payment_median=("CNT_PAYMENT", "median"),
            cnt_payment_range=("CNT_PAYMENT", lambda x: x.max() - x.min()),
            n_yield_group_low_action=(
                "NAME_YIELD_GROUP",
                lambda x: (x == "low_action").sum(),
            ),
            n_yield_group_low_normal=(
                "NAME_YIELD_GROUP",
                lambda x: (x == "low_normal").sum(),
            ),
            n_yield_group_middle=("NAME_YIELD_GROUP", lambda x: (x == "middle").sum()),
            n_yield_group_high=("NAME_YIELD_GROUP", lambda x: (x == "high").sum()),
            days_first_draw_min=("DAYS_FIRST_DRAWING", "min"),
            days_first_draw_max=("DAYS_FIRST_DRAWING", "max"),
            days_first_draw_mean=("DAYS_FIRST_DRAWING", "mean"),
            days_first_draw_std=("DAYS_FIRST_DRAWING", "std"),
            days_first_draw_median=("DAYS_FIRST_DRAWING", "median"),
            days_first_draw_range=("DAYS_FIRST_DRAWING", lambda x: x.max() - x.min()),
            days_last_due_1st_version_min=("DAYS_LAST_DUE_1ST_VERSION", "min"),
            days_last_due_1st_version_max=("DAYS_LAST_DUE_1ST_VERSION", "max"),
            days_last_due_1st_version_mean=("DAYS_LAST_DUE_1ST_VERSION", "mean"),
            days_last_due_1st_version_std=("DAYS_LAST_DUE_1ST_VERSION", "std"),
            days_last_due_1st_version_median=("DAYS_LAST_DUE_1ST_VERSION", "median"),
            days_last_due_1st_version_range=(
                "DAYS_LAST_DUE_1ST_VERSION",
                lambda x: x.max() - x.min(),
            ),
            days_last_due_min=("DAYS_LAST_DUE", "min"),
            days_last_due_max=("DAYS_LAST_DUE", "max"),
            days_last_due_mean=("DAYS_LAST_DUE", "mean"),
            days_last_due_std=("DAYS_LAST_DUE", "std"),
            days_last_due_median=("DAYS_LAST_DUE", "median"),
            days_last_due_range=("DAYS_LAST_DUE", lambda x: x.max() - x.min()),
            days_termination_min=("DAYS_TERMINATION", "min"),
            days_termination_max=("DAYS_TERMINATION", "max"),
            days_termination_mean=("DAYS_TERMINATION", "mean"),
            days_termination_std=("DAYS_TERMINATION", "std"),
            days_termination_median=("DAYS_TERMINATION", "median"),
            days_termination_range=("DAYS_TERMINATION", lambda x: x.max() - x.min()),
            n_nflag_insured_on_approval_sum=("NFLAG_INSURED_ON_APPROVAL", "sum"),
            n_nflag_insured_on_approval_mean=("NFLAG_INSURED_ON_APPROVAL", "mean"),
            # Output 0/1
            n_nflag_insured_on_approval_any=(
                "NFLAG_INSURED_ON_APPROVAL",
                lambda x: x.any(),
            ),
        )
        .reset_index()
    )

    previous_application_aggregated.to_feather(file)

del file
# Time: 37m 56.3s
Code
previous_application_aggregated.shape
(338857, 130)
Code
previous_application_aggregated.head()
SK_ID_CURR n_different_loans n_cash_loans n_consumer_loans n_revolving_loans amt_annuity_min amt_annuity_max amt_annuity_mean amt_annuity_std amt_annuity_median amt_annuity_range amt_application_min amt_application_max amt_application_mean amt_application_std amt_application_median amt_application_range amt_credit_min amt_credit_max amt_credit_mean amt_credit_std amt_credit_median amt_credit_range amt_down_payment_min amt_down_payment_max amt_down_payment_mean amt_down_payment_std amt_down_payment_median amt_down_payment_range amt_goods_price_min amt_goods_price_max amt_goods_price_mean amt_goods_price_std amt_goods_price_median amt_goods_price_range rate_down_payment_min rate_down_payment_max rate_down_payment_mean rate_down_payment_std rate_down_payment_median rate_down_payment_range rate_interest_primary_min rate_interest_primary_max rate_interest_primary_mean rate_interest_primary_std rate_interest_primary_median rate_interest_primary_range rate_interest_primary_count rate_interest_privileged_min rate_interest_privileged_max rate_interest_privileged_mean rate_interest_privileged_std rate_interest_privileged_median rate_interest_privileged_range rate_interest_privileged_count n_different_contract_types n_contract_status_approved n_contract_status_canceled n_contract_status_refused n_contract_status_unused_offer days_decision_min days_decision_max days_decision_mean days_decision_std days_decision_median days_decision_range n_payment_type_cash_through_bank n_payment_type_cash_from_account n_payment_type_not_available n_reject_reason_not_applicable n_reject_reason_hc n_reject_reason_limit n_reject_reason_scoc n_reject_reason_client n_reject_reason_scofr n_client_type_new n_client_type_repeater n_client_type_refreshed n_portfolio_pos n_portfolio_cash n_portfolio_cards n_product_type_xsell n_product_type_walk_in n_different_channels n_channel_type_credit_and_cash n_channel_type_countrywide n_channel_type_stone n_channel_type_regional_and_local n_channel_type_contact_center n_channel_type_ap_minus n_channel_type_channel_corporate_sales n_channel_type_car_dealer n_cnt_payment_0 cnt_payment_min cnt_payment_max cnt_payment_mean cnt_payment_std cnt_payment_median cnt_payment_range n_yield_group_low_action n_yield_group_low_normal n_yield_group_middle n_yield_group_high days_first_draw_min days_first_draw_max days_first_draw_mean days_first_draw_std days_first_draw_median days_first_draw_range days_last_due_1st_version_min days_last_due_1st_version_max days_last_due_1st_version_mean days_last_due_1st_version_std days_last_due_1st_version_median days_last_due_1st_version_range days_last_due_min days_last_due_max days_last_due_mean days_last_due_std days_last_due_median days_last_due_range days_termination_min days_termination_max days_termination_mean days_termination_std days_termination_median days_termination_range n_nflag_insured_on_approval_sum n_nflag_insured_on_approval_mean n_nflag_insured_on_approval_any
0 100001 1 0 1 0 3951.00 3951.00 3951.00 NaN 3951.00 0.00 24835.50 24835.50 24835.50 NaN 24835.50 0.00 23787.00 23787.00 23787.00 NaN 23787.00 0.00 2520.00 2520.00 2520.00 NaN 2520.00 0.00 24835.50 24835.50 24835.50 NaN 24835.50 0.00 0.10 0.10 0.10 NaN 0.10 0.00 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN 0 1 1 0 0 0 -1740 -1740 -1740.00 NaN -1740.00 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 8.00 8.00 8.00 NaN 8.00 0.00 0 0 0 1 365243.00 365243.00 365243.00 NaN 365243.00 0.00 -1499.00 -1499.00 -1499.00 NaN -1499.00 0.00 -1619.00 -1619.00 -1619.00 NaN -1619.00 0.00 -1612.00 -1612.00 -1612.00 NaN -1612.00 0.00 0.00 0.00 False
1 100002 1 0 1 0 9251.77 9251.77 9251.77 NaN 9251.77 0.00 179055.00 179055.00 179055.00 NaN 179055.00 0.00 179055.00 179055.00 179055.00 NaN 179055.00 0.00 0.00 0.00 0.00 NaN 0.00 0.00 179055.00 179055.00 179055.00 NaN 179055.00 0.00 0.00 0.00 0.00 NaN 0.00 0.00 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN 0 1 1 0 0 0 -606 -606 -606.00 NaN -606.00 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 24.00 24.00 24.00 NaN 24.00 0.00 0 1 0 0 365243.00 365243.00 365243.00 NaN 365243.00 0.00 125.00 125.00 125.00 NaN 125.00 0.00 -25.00 -25.00 -25.00 NaN -25.00 0.00 -17.00 -17.00 -17.00 NaN -17.00 0.00 0.00 0.00 False
2 100003 2 1 2 0 6737.31 98356.99 56553.99 46332.56 64567.67 91619.68 68809.50 900000.00 435436.50 424161.62 337500.00 831190.50 68053.50 1035882.00 484191.00 497949.86 348637.50 967828.50 0.00 6885.00 3442.50 4868.43 3442.50 6885.00 68809.50 900000.00 435436.50 424161.62 337500.00 831190.50 0.00 0.10 0.05 0.07 0.05 0.10 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN 0 2 3 0 0 0 -2341 -746 -1305.00 898.14 -828.00 1595 2 0 1 3 0 0 0 0 0 0 1 2 2 1 0 1 0 3 1 1 1 0 0 0 0 0 0 6.00 12.00 10.00 3.46 12.00 6.00 0 1 2 0 365243.00 365243.00 365243.00 0.00 365243.00 0.00 -1980.00 -386.00 -1004.33 854.97 -647.00 1594.00 -1980.00 -536.00 -1054.33 803.57 -647.00 1444.00 -1976.00 -527.00 -1047.33 806.20 -639.00 1449.00 2.00 0.67 True
3 100004 1 0 1 0 5357.25 5357.25 5357.25 NaN 5357.25 0.00 24282.00 24282.00 24282.00 NaN 24282.00 0.00 20106.00 20106.00 20106.00 NaN 20106.00 0.00 4860.00 4860.00 4860.00 NaN 4860.00 0.00 24282.00 24282.00 24282.00 NaN 24282.00 0.00 0.21 0.21 0.21 NaN 0.21 0.00 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN 0 1 1 0 0 0 -815 -815 -815.00 NaN -815.00 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 4.00 4.00 4.00 NaN 4.00 0.00 0 0 1 0 365243.00 365243.00 365243.00 NaN 365243.00 0.00 -694.00 -694.00 -694.00 NaN -694.00 0.00 -724.00 -724.00 -724.00 NaN -724.00 0.00 -714.00 -714.00 -714.00 NaN -714.00 0.00 0.00 0.00 False
4 100005 2 1 1 0 4813.20 4813.20 4813.20 NaN 4813.20 0.00 0.00 44617.50 22308.75 31549.34 22308.75 44617.50 0.00 40153.50 20076.75 28392.81 20076.75 40153.50 4464.00 4464.00 4464.00 NaN 4464.00 0.00 44617.50 44617.50 44617.50 NaN 44617.50 0.00 0.11 0.11 0.11 NaN 0.11 0.00 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN 0 2 1 1 0 0 -757 -315 -536.00 312.54 -536.00 442 1 0 1 2 0 0 0 0 0 1 1 0 1 0 0 0 0 2 1 1 0 0 0 0 0 0 0 12.00 12.00 12.00 NaN 12.00 0.00 0 0 0 1 365243.00 365243.00 365243.00 NaN 365243.00 0.00 -376.00 -376.00 -376.00 NaN -376.00 0.00 -466.00 -466.00 -466.00 NaN -466.00 0.00 -460.00 -460.00 -460.00 NaN -460.00 0.00 0.00 0.00 False
Code
an.col_info(previous_application_aggregated, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 1.4 MB 338,857 100.0% 0 0% 1 <0.1% <0.1% 100001
2 n_different_loans int64 2.7 MB 4 <0.1% 0 0% 129,371 38.2% 38.2% 2
3 n_cash_loans int64 2.7 MB 60 <0.1% 0 0% 138,032 40.7% 40.7% 0
4 n_consumer_loans int64 2.7 MB 37 <0.1% 0 0% 130,690 38.6% 38.6% 1
5 n_revolving_loans int64 2.7 MB 28 <0.1% 0 0% 215,412 63.6% 63.6% 0
6 amt_annuity_min float64 2.7 MB 159,918 47.2% 480 0.1% 27,542 8.1% 8.1% 2250.0
7 amt_annuity_max float64 2.7 MB 164,390 48.5% 480 0.1% 3,835 1.1% 1.1% 22500.0
8 amt_annuity_mean float64 2.7 MB 311,139 91.8% 480 0.1% 567 0.2% 0.2% 2250.0
9 amt_annuity_std float64 2.7 MB 262,010 77.3% 73,917 21.8% 465 0.1% 0.2% 0.0
10 amt_annuity_median float64 2.7 MB 241,854 71.4% 480 0.1% 2,229 0.7% 0.7% 11250.0
11 amt_annuity_range float64 2.7 MB 234,308 69.1% 480 0.1% 73,902 21.8% 21.8% 0.0
12 amt_application_min float64 2.7 MB 39,315 11.6% 0 0% 162,024 47.8% 47.8% 0.0
13 amt_application_max float64 2.7 MB 53,054 15.7% 0 0% 15,919 4.7% 4.7% 450000.0
14 amt_application_mean float64 2.7 MB 218,595 64.5% 0 0% 1,105 0.3% 0.3% 0.0
15 amt_application_std float64 2.7 MB 246,788 72.8% 60,458 17.8% 1,810 0.5% 0.7% 0.0
16 amt_application_median float64 2.7 MB 85,476 25.2% 0 0% 18,425 5.4% 5.4% 0.0
17 amt_application_range float64 2.7 MB 72,070 21.3% 0 0% 62,268 18.4% 18.4% 0.0
18 amt_credit_min float64 2.7 MB 40,983 12.1% 0 0% 136,261 40.2% 40.2% 0.0
19 amt_credit_max float64 2.7 MB 62,833 18.5% 0 0% 7,875 2.3% 2.3% 450000.0
20 amt_credit_mean float64 2.7 MB 239,733 70.7% 0 0% 456 0.1% 0.1% 45000.0
21 amt_credit_std float64 2.7 MB 256,680 75.7% 60,458 17.8% 528 0.2% 0.2% 0.0
22 amt_credit_median float64 2.7 MB 95,228 28.1% 0 0% 14,039 4.1% 4.1% 0.0
23 amt_credit_range float64 2.7 MB 97,356 28.7% 0 0% 60,986 18.0% 18.0% 0.0
24 amt_down_payment_min float64 2.7 MB 13,418 4.0% 20,104 5.9% 209,795 61.9% 65.8% 0.0
25 amt_down_payment_max float64 2.7 MB 23,434 6.9% 20,104 5.9% 91,088 26.9% 28.6% 0.0
26 amt_down_payment_mean float64 2.7 MB 59,904 17.7% 20,104 5.9% 91,087 26.9% 28.6% 0.0
27 amt_down_payment_std float64 2.7 MB 88,724 26.2% 146,201 43.1% 33,106 9.8% 17.2% 0.0
28 amt_down_payment_median float64 2.7 MB 26,428 7.8% 20,104 5.9% 126,088 37.2% 39.6% 0.0
29 amt_down_payment_range float64 2.7 MB 23,245 6.9% 20,104 5.9% 159,203 47.0% 49.9% 0.0
30 amt_goods_price_min float64 2.7 MB 50,839 15.0% 1,064 0.3% 20,036 5.9% 5.9% 45000.0
31 amt_goods_price_max float64 2.7 MB 53,050 15.7% 1,064 0.3% 15,923 4.7% 4.7% 450000.0
32 amt_goods_price_mean float64 2.7 MB 211,422 62.4% 1,064 0.3% 1,315 0.4% 0.4% 135000.0
33 amt_goods_price_std float64 2.7 MB 227,746 67.2% 74,570 22.0% 2,191 0.6% 0.8% 0.0
34 amt_goods_price_median float64 2.7 MB 91,665 27.1% 1,064 0.3% 7,493 2.2% 2.2% 135000.0
35 amt_goods_price_range float64 2.7 MB 110,829 32.7% 1,064 0.3% 75,697 22.3% 22.4% 0.0
36 rate_down_payment_min float32 1.4 MB 69,639 20.6% 20,104 5.9% 209,795 61.9% 65.8% 0.0
37 rate_down_payment_max float32 1.4 MB 125,138 36.9% 20,104 5.9% 91,088 26.9% 28.6% 0.0
38 rate_down_payment_mean float32 1.4 MB 184,996 54.6% 20,104 5.9% 91,087 26.9% 28.6% 0.0
39 rate_down_payment_std float32 1.4 MB 141,668 41.8% 146,201 43.1% 32,914 9.7% 17.1% 0.0
40 rate_down_payment_median float32 1.4 MB 134,571 39.7% 20,104 5.9% 126,088 37.2% 39.6% 0.0
41 rate_down_payment_range float32 1.4 MB 113,325 33.4% 20,104 5.9% 159,011 46.9% 49.9% 0.0
42 rate_interest_primary_min float32 1.4 MB 146 <0.1% 333,136 98.3% 1,161 0.3% 20.3% 0.18913634
43 rate_interest_primary_max float32 1.4 MB 146 <0.1% 333,136 98.3% 1,185 0.3% 20.7% 0.18913634
44 rate_interest_primary_mean float32 1.4 MB 210 0.1% 333,136 98.3% 1,146 0.3% 20.0% 0.18913634
45 rate_interest_primary_std float32 1.4 MB 61 <0.1% 338,639 99.9% 65 <0.1% 29.8% 0.0
46 rate_interest_primary_median float32 1.4 MB 202 0.1% 333,136 98.3% 1,148 0.3% 20.1% 0.18913634
47 rate_interest_primary_range float32 1.4 MB 55 <0.1% 333,136 98.3% 5,568 1.6% 97.3% 0.0
48 rate_interest_primary_count int64 2.7 MB 5 <0.1% 0 0% 333,136 98.3% 98.3% 0
49 rate_interest_privileged_min float32 1.4 MB 25 <0.1% 333,136 98.3% 1,642 0.5% 28.7% 0.83509517
50 rate_interest_privileged_max float32 1.4 MB 25 <0.1% 333,136 98.3% 1,669 0.5% 29.2% 0.83509517
51 rate_interest_privileged_mean float32 1.4 MB 52 <0.1% 333,136 98.3% 1,622 0.5% 28.4% 0.83509517
52 rate_interest_privileged_std float32 1.4 MB 26 <0.1% 338,639 99.9% 97 <0.1% 44.5% 0.0
53 rate_interest_privileged_median float32 1.4 MB 45 <0.1% 333,136 98.3% 1,624 0.5% 28.4% 0.83509517
54 rate_interest_privileged_range float32 1.4 MB 21 <0.1% 333,136 98.3% 5,600 1.7% 97.9% 0.0
55 rate_interest_privileged_count int64 2.7 MB 5 <0.1% 0 0% 333,136 98.3% 98.3% 0
56 n_different_contract_types int64 2.7 MB 4 <0.1% 0 0% 129,371 38.2% 38.2% 2
57 n_contract_status_approved int64 2.7 MB 26 <0.1% 0 0% 88,369 26.1% 26.1% 1
58 n_contract_status_canceled int64 2.7 MB 40 <0.1% 0 0% 206,163 60.8% 60.8% 0
59 n_contract_status_refused int64 2.7 MB 47 <0.1% 0 0% 220,580 65.1% 65.1% 0
60 n_contract_status_unused_offer int64 2.7 MB 13 <0.1% 0 0% 316,778 93.5% 93.5% 0
61 days_decision_min int16 677.7 kB 2,921 0.9% 0 0% 218 0.1% 0.1% -476
62 days_decision_max int16 677.7 kB 2,922 0.9% 0 0% 1,005 0.3% 0.3% -7
63 days_decision_mean float64 2.7 MB 65,447 19.3% 0 0% 174 0.1% 0.1% -355.0
64 days_decision_std float64 2.7 MB 213,594 63.0% 60,458 17.8% 6,599 1.9% 2.4% 0.0
65 days_decision_median float64 2.7 MB 5,723 1.7% 0 0% 413 0.1% 0.1% -364.0
66 days_decision_range int16 677.7 kB 2,920 0.9% 0 0% 67,057 19.8% 19.8% 0
67 n_payment_type_cash_through_bank int64 2.7 MB 48 <0.1% 0 0% 91,254 26.9% 26.9% 1
68 n_payment_type_cash_from_account int64 2.7 MB 1 <0.1% 0 0% 338,857 100.0% 100.0% 0
69 n_payment_type_not_available int64 2.7 MB 49 <0.1% 0 0% 117,205 34.6% 34.6% 0
70 n_reject_reason_not_applicable int64 2.7 MB 48 <0.1% 0 0% 72,626 21.4% 21.4% 1
71 n_reject_reason_hc int64 2.7 MB 38 <0.1% 0 0% 260,046 76.7% 76.7% 0
72 n_reject_reason_limit int64 2.7 MB 23 <0.1% 0 0% 305,796 90.2% 90.2% 0
73 n_reject_reason_scoc int64 2.7 MB 21 <0.1% 0 0% 313,863 92.6% 92.6% 0
74 n_reject_reason_client int64 2.7 MB 13 <0.1% 0 0% 316,778 93.5% 93.5% 0
75 n_reject_reason_scofr int64 2.7 MB 19 <0.1% 0 0% 330,820 97.6% 97.6% 0
76 n_client_type_new int64 2.7 MB 20 <0.1% 0 0% 254,343 75.1% 75.1% 1
77 n_client_type_repeater int64 2.7 MB 66 <0.1% 0 0% 81,285 24.0% 24.0% 0
78 n_client_type_refreshed int64 2.7 MB 25 <0.1% 0 0% 249,259 73.6% 73.6% 0
79 n_portfolio_pos int64 2.7 MB 35 <0.1% 0 0% 136,271 40.2% 40.2% 1
80 n_portfolio_cash int64 2.7 MB 39 <0.1% 0 0% 164,386 48.5% 48.5% 0
81 n_portfolio_cards int64 2.7 MB 22 <0.1% 0 0% 222,739 65.7% 65.7% 0
82 n_product_type_xsell int64 2.7 MB 35 <0.1% 0 0% 161,249 47.6% 47.6% 0
83 n_product_type_walk_in int64 2.7 MB 34 <0.1% 0 0% 253,278 74.7% 74.7% 0
84 n_different_channels int64 2.7 MB 7 <0.1% 0 0% 131,033 38.7% 38.7% 2
85 n_channel_type_credit_and_cash int64 2.7 MB 58 <0.1% 0 0% 158,669 46.8% 46.8% 0
86 n_channel_type_countrywide int64 2.7 MB 36 <0.1% 0 0% 112,423 33.2% 33.2% 1
87 n_channel_type_stone int64 2.7 MB 25 <0.1% 0 0% 202,971 59.9% 59.9% 0
88 n_channel_type_regional_and_local int64 2.7 MB 20 <0.1% 0 0% 263,217 77.7% 77.7% 0
89 n_channel_type_contact_center int64 2.7 MB 23 <0.1% 0 0% 290,870 85.8% 85.8% 0
90 n_channel_type_ap_minus int64 2.7 MB 36 <0.1% 0 0% 312,654 92.3% 92.3% 0
91 n_channel_type_channel_corporate_sales int64 2.7 MB 23 <0.1% 0 0% 336,352 99.3% 99.3% 0
92 n_channel_type_car_dealer int64 2.7 MB 6 <0.1% 0 0% 338,506 99.9% 99.9% 0
93 n_cnt_payment_0 int64 2.7 MB 22 <0.1% 0 0% 222,739 65.7% 65.7% 0
94 cnt_payment_min float32 1.4 MB 33 <0.1% 478 0.1% 116,118 34.3% 34.3% 0.0
95 cnt_payment_max float32 1.4 MB 44 <0.1% 478 0.1% 87,946 26.0% 26.0% 12.0
96 cnt_payment_mean float32 1.4 MB 3,000 0.9% 478 0.1% 41,964 12.4% 12.4% 12.0
97 cnt_payment_std float32 1.4 MB 19,288 5.7% 73,917 21.8% 16,773 4.9% 6.3% 0.0
98 cnt_payment_median float32 1.4 MB 92 <0.1% 478 0.1% 90,394 26.7% 26.7% 12.0
99 cnt_payment_range float32 1.4 MB 69 <0.1% 478 0.1% 90,212 26.6% 26.7% 0.0
100 n_yield_group_low_action int64 2.7 MB 24 <0.1% 0 0% 271,402 80.1% 80.1% 0
101 n_yield_group_low_normal int64 2.7 MB 26 <0.1% 0 0% 156,824 46.3% 46.3% 0
102 n_yield_group_middle int64 2.7 MB 29 <0.1% 0 0% 131,496 38.8% 38.8% 0
103 n_yield_group_high int64 2.7 MB 31 <0.1% 0 0% 149,675 44.2% 44.2% 0
104 days_first_draw_min float32 1.4 MB 2,838 0.8% 1,517 0.4% 274,837 81.1% 81.5% 365243.0
105 days_first_draw_max float32 1.4 MB 1,118 0.3% 1,517 0.4% 334,702 98.8% 99.2% 365243.0
106 days_first_draw_mean float32 1.4 MB 17,379 5.1% 1,517 0.4% 274,837 81.1% 81.5% 365243.0
107 days_first_draw_std float64 2.7 MB 17,027 5.0% 93,408 27.6% 185,577 54.8% 75.6% 0.0
108 days_first_draw_median float32 1.4 MB 3,264 1.0% 1,517 0.4% 322,224 95.1% 95.5% 365243.0
109 days_first_draw_range float32 1.4 MB 2,844 0.8% 1,517 0.4% 277,468 81.9% 82.3% 0.0
110 days_last_due_1st_version_min float32 1.4 MB 4,222 1.2% 1,517 0.4% 2,920 0.9% 0.9% 365243.0
111 days_last_due_1st_version_max float32 1.4 MB 4,560 1.3% 1,517 0.4% 93,314 27.5% 27.7% 365243.0
112 days_last_due_1st_version_mean float32 1.4 MB 65,986 19.5% 1,517 0.4% 2,920 0.9% 0.9% 365243.0
113 days_last_due_1st_version_std float64 2.7 MB 169,901 50.1% 93,408 27.6% 76 <0.1% <0.1% 241.83051916579925
114 days_last_due_1st_version_median float32 1.4 MB 11,486 3.4% 1,517 0.4% 2,966 0.9% 0.9% 365243.0
115 days_last_due_1st_version_range float32 1.4 MB 8,111 2.4% 1,517 0.4% 91,940 27.1% 27.3% 0.0
116 days_last_due_min float32 1.4 MB 2,873 0.8% 1,517 0.4% 23,418 6.9% 6.9% 365243.0
117 days_last_due_max float32 1.4 MB 2,793 0.8% 1,517 0.4% 164,234 48.5% 48.7% 365243.0
118 days_last_due_mean float32 1.4 MB 66,778 19.7% 1,517 0.4% 23,418 6.9% 6.9% 365243.0
119 days_last_due_std float64 2.7 MB 161,247 47.6% 93,408 27.6% 5,082 1.5% 2.1% 0.0
120 days_last_due_median float32 1.4 MB 8,067 2.4% 1,517 0.4% 34,523 10.2% 10.2% 365243.0
121 days_last_due_range float32 1.4 MB 5,628 1.7% 1,517 0.4% 96,973 28.6% 28.7% 0.0
122 days_termination_min float32 1.4 MB 2,830 0.8% 1,517 0.4% 25,760 7.6% 7.6% 365243.0
123 days_termination_max float32 1.4 MB 2,733 0.8% 1,517 0.4% 174,661 51.5% 51.8% 365243.0
124 days_termination_mean float32 1.4 MB 65,778 19.4% 1,517 0.4% 25,760 7.6% 7.6% 365243.0
125 days_termination_std float64 2.7 MB 153,047 45.2% 93,408 27.6% 5,733 1.7% 2.3% 0.0
126 days_termination_median float32 1.4 MB 7,915 2.3% 1,517 0.4% 37,945 11.2% 11.2% 365243.0
127 days_termination_range float32 1.4 MB 5,298 1.6% 1,517 0.4% 97,624 28.8% 28.9% 0.0
128 n_nflag_insured_on_approval_sum float32 1.4 MB 19 <0.1% 0 0% 158,702 46.8% 46.8% 0.0
129 n_nflag_insured_on_approval_mean float32 1.4 MB 113 <0.1% 1,517 0.4% 157,185 46.4% 46.6% 0.0
130 n_nflag_insured_on_approval_any bool 338.9 kB 2 <0.1% 0 0% 180,155 53.2% 53.2% True

5.4 Table installments_payments

Code
installments_payments.head()
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.00 6 -1180.00 -1187.00 6948.36 6948.36
1 1330831 151639 0.00 34 -2156.00 -2156.00 1716.53 1716.53
2 2085231 193053 2.00 1 -63.00 -63.00 25425.00 25425.00
3 2452527 199697 1.00 3 -2418.00 -2426.00 24350.13 24350.13
4 2714724 167756 1.00 2 -1383.00 -1366.00 2165.04 2160.59
Code
file = dir_interim + "aggregated--installments_payments_aggregated.feather"

if os.path.exists(file):
    installments_payments_aggregated = pd.read_feather(file)

else:
    installments_payments_aggregated = (
        installments_payments.assign(
            diff_days_installment_payment=lambda df: df["DAYS_INSTALMENT"]
            - df["DAYS_ENTRY_PAYMENT"],
            diff_amt_installment_payment=lambda df: df["AMT_INSTALMENT"]
            - df["AMT_PAYMENT"],
            diff_percent_installment_payment=lambda df: np.where(
                df["AMT_PAYMENT"] == 0,  # To avoid infinite values
                np.nan,
                df["AMT_INSTALMENT"] / df["AMT_PAYMENT"],
            ),
        )
        .groupby("SK_ID_CURR")
        .agg(
            n_installments_total=("SK_ID_PREV", "count"),
            n_installments_late=(
                "diff_days_installment_payment",
                lambda x: (x < 0).sum(),
            ),
            n_installments_early=(
                "diff_days_installment_payment",
                lambda x: (x > 0).sum(),
            ),
            n_installments_on_time=(
                "diff_days_installment_payment",
                lambda x: (x == 0).sum(),
            ),
            percent_installments_late=(
                "diff_days_installment_payment",
                lambda x: (x < 0).mean(),
            ),
            percent_installments_early=(
                "diff_days_installment_payment",
                lambda x: (x > 0).mean(),
            ),
            percent_installments_on_time=(
                "diff_days_installment_payment",
                lambda x: (x == 0).mean(),
            ),
            n_installments_late_7=(
                "diff_days_installment_payment",
                lambda x: (x < -7).sum(),
            ),
            n_installments_late_30=(
                "diff_days_installment_payment",
                lambda x: (x < -30).sum(),
            ),
            n_installments_late_60=(
                "diff_days_installment_payment",
                lambda x: (x < -60).sum(),
            ),
            any_installments_late_7=(
                "diff_days_installment_payment",
                lambda x: (x < -7).any().astype("int8"),
            ),
            any_installments_late_30=(
                "diff_days_installment_payment",
                lambda x: (x < -30).any().astype("int8"),
            ),
            any_installments_late_60=(
                "diff_days_installment_payment",
                lambda x: (x < -60).any().astype("int8"),
            ),
            percent_installments_late_7=(
                "diff_days_installment_payment",
                lambda x: (x < -7).mean(),
            ),
            percent_installments_late_30=(
                "diff_days_installment_payment",
                lambda x: (x < -30).mean(),
            ),
            percent_installments_late_60=(
                "diff_days_installment_payment",
                lambda x: (x < -60).mean(),
            ),
            diff_days_installment_payment_min=("diff_days_installment_payment", "min"),
            diff_days_installment_payment_max=("diff_days_installment_payment", "max"),
            diff_days_installment_payment_mean=(
                "diff_days_installment_payment",
                "mean",
            ),
            diff_days_installment_payment_std=("diff_days_installment_payment", "std"),
            diff_days_installment_payment_median=(
                "diff_days_installment_payment",
                "median",
            ),
            diff_days_installment_payment_range=(
                "diff_days_installment_payment",
                lambda x: x.max() - x.min(),
            ),
            diff_days_installment_payment_sum=("diff_days_installment_payment", "sum"),
            diff_days_installment_payment_sum_late_only=(
                "diff_days_installment_payment",
                lambda x: x[x < 0].sum(),
            ),
            diff_amt_installment_payment_min=("diff_amt_installment_payment", "min"),
            diff_amt_installment_payment_max=("diff_amt_installment_payment", "max"),
            diff_amt_installment_payment_mean=("diff_amt_installment_payment", "mean"),
            diff_amt_installment_payment_std=("diff_amt_installment_payment", "std"),
            diff_amt_installment_payment_median=(
                "diff_amt_installment_payment",
                "median",
            ),
            diff_amt_installment_payment_range=(
                "diff_amt_installment_payment",
                lambda x: x.max() - x.min(),
            ),
            diff_percent_installment_payment_min=(
                "diff_percent_installment_payment",
                "min",
            ),
            diff_percent_installment_payment_max=(
                "diff_percent_installment_payment",
                "max",
            ),
            diff_percent_installment_payment_mean=(
                "diff_percent_installment_payment",
                "mean",
            ),
            diff_percent_installment_payment_std=(
                "diff_percent_installment_payment",
                "std",
            ),
            diff_percent_installment_payment_median=(
                "diff_percent_installment_payment",
                "median",
            ),
            diff_percent_installment_payment_range=(
                "diff_percent_installment_payment",
                lambda x: x.max() - x.min(),
            ),
        )
    ).reset_index()

    installments_payments_aggregated.to_feather(file)

del file
# Time: 13m 13.4s
Code
installments_payments_aggregated.shape
(339587, 37)
Code
installments_payments_aggregated.head()
SK_ID_CURR n_installments_total n_installments_late n_installments_early n_installments_on_time percent_installments_late percent_installments_early percent_installments_on_time n_installments_late_7 n_installments_late_30 n_installments_late_60 any_installments_late_7 any_installments_late_30 any_installments_late_60 percent_installments_late_7 percent_installments_late_30 percent_installments_late_60 diff_days_installment_payment_min diff_days_installment_payment_max diff_days_installment_payment_mean diff_days_installment_payment_std diff_days_installment_payment_median diff_days_installment_payment_range diff_days_installment_payment_sum diff_days_installment_payment_sum_late_only diff_amt_installment_payment_min diff_amt_installment_payment_max diff_amt_installment_payment_mean diff_amt_installment_payment_std diff_amt_installment_payment_median diff_amt_installment_payment_range diff_percent_installment_payment_min diff_percent_installment_payment_max diff_percent_installment_payment_mean diff_percent_installment_payment_std diff_percent_installment_payment_median diff_percent_installment_payment_range
0 100001 7 1 4 2 0.14 0.57 0.29 1 0 0 1 0 0 0.14 0.00 0.00 -11.00 36.00 7.29 14.63 6.00 47.00 51.00 -11.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00 0.00
1 100002 19 0 19 0 0.00 1.00 0.00 0 0 0 0 0 0 0.00 0.00 0.00 12.00 31.00 20.42 4.93 19.00 19.00 388.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00 0.00
2 100003 25 0 25 0 0.00 1.00 0.00 0 0 0 0 0 0 0.00 0.00 0.00 1.00 14.00 7.16 3.73 6.00 13.00 179.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00 0.00
3 100004 3 0 3 0 0.00 1.00 0.00 0 0 0 0 0 0 0.00 0.00 0.00 3.00 11.00 7.67 4.16 9.00 8.00 23.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00 0.00
4 100005 9 1 8 0 0.11 0.89 0.00 0 0 0 0 0 0 0.00 0.00 0.00 -1.00 37.00 23.56 13.51 29.00 38.00 212.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00 0.00
Code
an.col_info(installments_payments_aggregated, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 1.4 MB 339,587 100.0% 0 0% 1 <0.1% <0.1% 100001
2 n_installments_total int64 2.7 MB 323 0.1% 0 0% 14,187 4.2% 4.2% 12
3 n_installments_late int64 2.7 MB 103 <0.1% 0 0% 159,742 47.0% 47.0% 0
4 n_installments_early int64 2.7 MB 227 0.1% 0 0% 15,226 4.5% 4.5% 6
5 n_installments_on_time int64 2.7 MB 150 <0.1% 0 0% 146,945 43.3% 43.3% 0
6 percent_installments_late float64 2.7 MB 5,237 1.5% 0 0% 159,742 47.0% 47.0% 0.0
7 percent_installments_early float64 2.7 MB 9,175 2.7% 0 0% 107,885 31.8% 31.8% 1.0
8 percent_installments_on_time float64 2.7 MB 9,266 2.7% 0 0% 146,945 43.3% 43.3% 0.0
9 n_installments_late_7 int64 2.7 MB 61 <0.1% 0 0% 246,891 72.7% 72.7% 0
10 n_installments_late_30 int64 2.7 MB 47 <0.1% 0 0% 318,171 93.7% 93.7% 0
11 n_installments_late_60 int64 2.7 MB 42 <0.1% 0 0% 329,789 97.1% 97.1% 0
12 any_installments_late_7 int8 339.6 kB 2 <0.1% 0 0% 246,891 72.7% 72.7% 0
13 any_installments_late_30 int8 339.6 kB 2 <0.1% 0 0% 318,171 93.7% 93.7% 0
14 any_installments_late_60 int8 339.6 kB 2 <0.1% 0 0% 329,789 97.1% 97.1% 0
15 percent_installments_late_7 float64 2.7 MB 3,033 0.9% 0 0% 246,891 72.7% 72.7% 0.0
16 percent_installments_late_30 float64 2.7 MB 1,061 0.3% 0 0% 318,171 93.7% 93.7% 0.0
17 percent_installments_late_60 float64 2.7 MB 762 0.2% 0 0% 329,789 97.1% 97.1% 0.0
18 diff_days_installment_payment_min float32 1.4 MB 1,736 0.5% 9 <0.1% 51,800 15.3% 15.3% 0.0
19 diff_days_installment_payment_max float32 1.4 MB 455 0.1% 9 <0.1% 25,531 7.5% 7.5% 30.0
20 diff_days_installment_payment_mean float32 1.4 MB 68,196 20.1% 9 <0.1% 1,295 0.4% 0.4% 9.0
21 diff_days_installment_payment_std float32 1.4 MB 252,379 74.3% 977 0.3% 586 0.2% 0.2% 0.0
22 diff_days_installment_payment_median float32 1.4 MB 351 0.1% 9 <0.1% 35,844 10.6% 10.6% 0.0
23 diff_days_installment_payment_range float32 1.4 MB 1,760 0.5% 9 <0.1% 9,072 2.7% 2.7% 30.0
24 diff_days_installment_payment_sum float32 1.4 MB 5,110 1.5% 0 0% 885 0.3% 0.3% 77.0
25 diff_days_installment_payment_sum_late_only float32 1.4 MB 2,155 0.6% 0 0% 159,742 47.0% 47.0% 0.0
26 diff_amt_installment_payment_min float64 2.7 MB 41,900 12.3% 9 <0.1% 294,774 86.8% 86.8% 0.0
27 diff_amt_installment_payment_max float64 2.7 MB 119,803 35.3% 9 <0.1% 194,218 57.2% 57.2% 0.0
28 diff_amt_installment_payment_mean float64 2.7 MB 160,502 47.3% 9 <0.1% 171,327 50.5% 50.5% 0.0
29 diff_amt_installment_payment_std float64 2.7 MB 167,917 49.4% 977 0.3% 170,360 50.2% 50.3% 0.0
30 diff_amt_installment_payment_median float64 2.7 MB 10,695 3.1% 9 <0.1% 326,204 96.1% 96.1% 0.0
31 diff_amt_installment_payment_range float64 2.7 MB 145,084 42.7% 9 <0.1% 171,328 50.5% 50.5% 0.0
32 diff_percent_installment_payment_min float64 2.7 MB 42,855 12.6% 9 <0.1% 294,774 86.8% 86.8% 1.0
33 diff_percent_installment_payment_max float64 2.7 MB 135,199 39.8% 9 <0.1% 194,452 57.3% 57.3% 1.0
34 diff_percent_installment_payment_mean float64 2.7 MB 145,164 42.7% 9 <0.1% 171,537 50.5% 50.5% 1.0
35 diff_percent_installment_payment_std float64 2.7 MB 167,620 49.4% 977 0.3% 170,571 50.2% 50.4% 0.0
36 diff_percent_installment_payment_median float64 2.7 MB 13,240 3.9% 9 <0.1% 326,204 96.1% 96.1% 1.0
37 diff_percent_installment_payment_range float64 2.7 MB 159,045 46.8% 9 <0.1% 171,539 50.5% 50.5% 0.0

5.5 Table pos_cash_balance

Code
pos_cash_balance.head()
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.00 45.00 Active 0 0
1 1715348 367990 -33 36.00 35.00 Active 0 0
2 1784872 397406 -32 12.00 9.00 Active 0 0
3 1903291 269225 -35 48.00 42.00 Active 0 0
4 2341044 334279 -35 36.00 35.00 Active 0 0
Code
file = dir_interim + "aggregated--pos_cash_balance_aggregated.feather"

if os.path.exists(file):
    pos_cash_balance_aggregated = pd.read_feather(file)
else:
    pos_cash_balance_aggregated = (
        pos_cash_balance.assign(
            cnt_installments_diff=lambda df: df["CNT_INSTALMENT"]
            - df["CNT_INSTALMENT_FUTURE"]
        )
        .groupby("SK_ID_CURR")
        .agg(
            n_previous_pos_applications=("SK_ID_PREV", "count"),
            n_previous_pos_applications_active=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Active").sum(),
            ),
            n_previous_pos_applications_signed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Signed").sum(),
            ),
            n_previous_pos_applications_completed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Completed").sum(),
            ),
            cnt_installment_min=("CNT_INSTALMENT", "min"),
            cnt_installment_max=("CNT_INSTALMENT", "max"),
            cnt_installment_mean=("CNT_INSTALMENT", "mean"),
            cnt_installment_std=("CNT_INSTALMENT", "std"),
            cnt_installment_median=("CNT_INSTALMENT", "median"),
            cnt_installment_range=("CNT_INSTALMENT", lambda x: x.max() - x.min()),
            cnt_installment_future_min=("CNT_INSTALMENT_FUTURE", "min"),
            cnt_installment_future_max=("CNT_INSTALMENT_FUTURE", "max"),
            cnt_installment_future_mean=("CNT_INSTALMENT_FUTURE", "mean"),
            cnt_installment_future_std=("CNT_INSTALMENT_FUTURE", "std"),
            cnt_installment_future_median=("CNT_INSTALMENT_FUTURE", "median"),
            cnt_installment_future_range=(
                "CNT_INSTALMENT_FUTURE",
                lambda x: x.max() - x.min(),
            ),
            cnt_installments_diff_min=("cnt_installments_diff", "min"),
            cnt_installments_diff_max=("cnt_installments_diff", "max"),
            cnt_installments_diff_mean=("cnt_installments_diff", "mean"),
            cnt_installments_diff_std=("cnt_installments_diff", "std"),
            cnt_installments_diff_median=("cnt_installments_diff", "median"),
            cnt_installments_diff_range=(
                "cnt_installments_diff",
                lambda x: x.max() - x.min(),
            ),
            sk_dpd_pos_applications_min=("SK_DPD", "min"),
            sk_dpd_pos_applications_max=("SK_DPD", "max"),
            sk_dpd_pos_applications_mean=("SK_DPD", "mean"),
            sk_dpd_pos_applications_std=("SK_DPD", "std"),
            sk_dpd_pos_applications_median=("SK_DPD", "median"),
            sk_dpd_pos_applications_range=("SK_DPD", lambda x: x.max() - x.min()),
            sk_dpd_def_pos_applications_min=("SK_DPD_DEF", "min"),
            sk_dpd_def_pos_applications_max=("SK_DPD_DEF", "max"),
            sk_dpd_def_pos_applications_mean=("SK_DPD_DEF", "mean"),
            sk_dpd_def_pos_applications_std=("SK_DPD_DEF", "std"),
            sk_dpd_def_pos_applications_median=("SK_DPD_DEF", "median"),
            sk_dpd_def_pos_applications_range=(
                "SK_DPD_DEF",
                lambda x: x.max() - x.min(),
            ),
        )
        .reset_index()
    )

    pos_cash_balance_aggregated.to_feather(file)

del file
# Time: 4m 50.1s
Code
pos_cash_balance_aggregated.shape
(337252, 35)
Code
an.col_info(pos_cash_balance_aggregated, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 1.3 MB 337,252 100.0% 0 0% 1 <0.1% <0.1% 100001
2 n_previous_pos_applications int64 2.7 MB 234 0.1% 0 0% 15,747 4.7% 4.7% 13
3 n_previous_pos_applications_active int64 2.7 MB 217 0.1% 0 0% 18,950 5.6% 5.6% 12
4 n_previous_pos_applications_signed int64 2.7 MB 32 <0.1% 0 0% 269,536 79.9% 79.9% 0
5 n_previous_pos_applications_completed int64 2.7 MB 52 <0.1% 0 0% 121,689 36.1% 36.1% 1
6 cnt_installment_min float32 1.3 MB 58 <0.1% 28 <0.1% 70,190 20.8% 20.8% 6.0
7 cnt_installment_max float32 1.3 MB 65 <0.1% 28 <0.1% 96,939 28.7% 28.7% 12.0
8 cnt_installment_mean float32 1.3 MB 45,080 13.4% 28 <0.1% 24,836 7.4% 7.4% 12.0
9 cnt_installment_std float32 1.3 MB 134,889 40.0% 394 0.1% 81,052 24.0% 24.1% 0.0
10 cnt_installment_median float32 1.3 MB 106 <0.1% 28 <0.1% 102,288 30.3% 30.3% 12.0
11 cnt_installment_range float32 1.3 MB 72 <0.1% 28 <0.1% 81,418 24.1% 24.1% 0.0
12 cnt_installment_future_min float32 1.3 MB 61 <0.1% 28 <0.1% 305,633 90.6% 90.6% 0.0
13 cnt_installment_future_max float32 1.3 MB 65 <0.1% 28 <0.1% 95,391 28.3% 28.3% 12.0
14 cnt_installment_future_mean float32 1.3 MB 43,319 12.8% 28 <0.1% 11,968 3.5% 3.5% 6.0
15 cnt_installment_future_std float32 1.3 MB 145,833 43.2% 394 0.1% 11,689 3.5% 3.5% 2.1602468
16 cnt_installment_future_median float32 1.3 MB 121 <0.1% 28 <0.1% 36,769 10.9% 10.9% 6.0
17 cnt_installment_future_range float32 1.3 MB 68 <0.1% 28 <0.1% 86,320 25.6% 25.6% 12.0
18 cnt_installments_diff_min float32 1.3 MB 63 <0.1% 28 <0.1% 329,680 97.8% 97.8% 0.0
19 cnt_installments_diff_max float32 1.3 MB 68 <0.1% 28 <0.1% 59,647 17.7% 17.7% 12.0
20 cnt_installments_diff_mean float32 1.3 MB 26,103 7.7% 28 <0.1% 14,771 4.4% 4.4% 3.0
21 cnt_installments_diff_std float32 1.3 MB 112,198 33.3% 394 0.1% 12,391 3.7% 3.7% 2.1602468
22 cnt_installments_diff_median float32 1.3 MB 67 <0.1% 28 <0.1% 49,529 14.7% 14.7% 4.0
23 cnt_installments_diff_range float32 1.3 MB 89 <0.1% 28 <0.1% 59,158 17.5% 17.5% 12.0
24 sk_dpd_pos_applications_min int16 674.5 kB 65 <0.1% 0 0% 337,185 >99.9% >99.9% 0
25 sk_dpd_pos_applications_max int16 674.5 kB 2,025 0.6% 0 0% 274,268 81.3% 81.3% 0
26 sk_dpd_pos_applications_mean float64 2.7 MB 11,737 3.5% 0 0% 274,268 81.3% 81.3% 0.0
27 sk_dpd_pos_applications_std float64 2.7 MB 34,423 10.2% 372 0.1% 273,896 81.2% 81.3% 0.0
28 sk_dpd_pos_applications_median float64 2.7 MB 1,210 0.4% 0 0% 334,735 99.3% 99.3% 0.0
29 sk_dpd_pos_applications_range int16 674.5 kB 1,984 0.6% 0 0% 274,268 81.3% 81.3% 0
30 sk_dpd_def_pos_applications_min int16 674.5 kB 3 <0.1% 0 0% 337,250 >99.9% >99.9% 0
31 sk_dpd_def_pos_applications_max int16 674.5 kB 217 0.1% 0 0% 291,303 86.4% 86.4% 0
32 sk_dpd_def_pos_applications_mean float64 2.7 MB 4,722 1.4% 0 0% 291,303 86.4% 86.4% 0.0
33 sk_dpd_def_pos_applications_std float64 2.7 MB 20,867 6.2% 372 0.1% 290,931 86.3% 86.4% 0.0
34 sk_dpd_def_pos_applications_median float64 2.7 MB 80 <0.1% 0 0% 336,957 99.9% 99.9% 0.0
35 sk_dpd_def_pos_applications_range int16 674.5 kB 216 0.1% 0 0% 291,303 86.4% 86.4% 0

5.6 Table credit_card_balance

Code
credit_card_balance.head()
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.97 135000 0.00 877.50 0.00 877.50 1700.33 1800.00 1800.00 0.00 0.00 0.00 0.00 1 0.00 1.00 35.00 Active 0 0
1 2582071 363914 -1 63975.56 45000 2250.00 2250.00 0.00 0.00 2250.00 2250.00 2250.00 60175.08 64875.56 64875.56 1.00 1 0.00 0.00 69.00 Active 0 0
2 1740877 371185 -7 31815.22 450000 0.00 0.00 0.00 0.00 2250.00 2250.00 2250.00 26926.42 31460.08 31460.08 0.00 0 0.00 0.00 30.00 Active 0 0
3 1389973 337855 -4 236572.11 225000 2250.00 2250.00 0.00 0.00 11795.76 11925.00 11925.00 224949.29 233048.97 233048.97 1.00 1 0.00 0.00 10.00 Active 0 0
4 1891521 126868 -1 453919.46 450000 0.00 11547.00 0.00 11547.00 22924.89 27000.00 27000.00 443044.40 453919.46 453919.46 0.00 1 0.00 1.00 101.00 Active 0 0
Code
file = dir_interim + "aggregated--credit_card_balance_aggregated.feather"

if os.path.exists(file):
    credit_card_balance_aggregated = pd.read_feather(file)
else:
    credit_card_balance_aggregated = (
        credit_card_balance.groupby("SK_ID_CURR")
        .agg(
            n_previous_credit_card_applications=("SK_ID_PREV", "count"),
            n_previous_credit_card_applications_completed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Completed").sum(),
            ),
            n_previous_credit_card_applications_active=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Active").sum(),
            ),
            n_previous_credit_card_applications_signed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Signed").sum(),
            ),
            n_contracts_credit_card_active=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Active").sum(),
            ),
            n_contracts_credit_card_completed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Completed").sum(),
            ),
            n_contracts_credit_card_signed=(
                "NAME_CONTRACT_STATUS",
                lambda x: (x == "Signed").sum(),
            ),
            amt_balance_credit_card_min=("AMT_BALANCE", "min"),
            amt_balance_credit_card_max=("AMT_BALANCE", "max"),
            amt_balance_credit_card_mean=("AMT_BALANCE", "mean"),
            amt_balance_credit_card_std=("AMT_BALANCE", "std"),
            amt_balance_credit_card_median=("AMT_BALANCE", "median"),
            amt_balance_credit_card_range=("AMT_BALANCE", lambda x: x.max() - x.min()),
            amt_credit_limit_actual_min=("AMT_CREDIT_LIMIT_ACTUAL", "min"),
            amt_credit_limit_actual_max=("AMT_CREDIT_LIMIT_ACTUAL", "max"),
            amt_credit_limit_actual_mean=("AMT_CREDIT_LIMIT_ACTUAL", "mean"),
            amt_credit_limit_actual_std=("AMT_CREDIT_LIMIT_ACTUAL", "std"),
            amt_credit_limit_actual_median=("AMT_CREDIT_LIMIT_ACTUAL", "median"),
            amt_credit_limit_actual_range=(
                "AMT_CREDIT_LIMIT_ACTUAL",
                lambda x: x.max() - x.min(),
            ),
            amt_drawings_atm_current_min=("AMT_DRAWINGS_ATM_CURRENT", "min"),
            amt_drawings_atm_current_max=("AMT_DRAWINGS_ATM_CURRENT", "max"),
            amt_drawings_atm_current_mean=("AMT_DRAWINGS_ATM_CURRENT", "mean"),
            amt_drawings_atm_current_std=("AMT_DRAWINGS_ATM_CURRENT", "std"),
            amt_drawings_atm_current_median=("AMT_DRAWINGS_ATM_CURRENT", "median"),
            amt_drawings_atm_current_range=(
                "AMT_DRAWINGS_ATM_CURRENT",
                lambda x: x.max() - x.max(),
            ),
            amt_drawings_current_min=("AMT_DRAWINGS_CURRENT", "min"),
            amt_drawings_current_max=("AMT_DRAWINGS_CURRENT", "max"),
            amt_drawings_current_mean=("AMT_DRAWINGS_CURRENT", "mean"),
            amt_drawings_current_std=("AMT_DRAWINGS_CURRENT", "std"),
            amt_drawings_current_median=("AMT_DRAWINGS_CURRENT", "median"),
            amt_drawings_current_range=(
                "AMT_DRAWINGS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_drawings_other_current_min=("AMT_DRAWINGS_OTHER_CURRENT", "min"),
            amt_drawings_other_current_max=("AMT_DRAWINGS_OTHER_CURRENT", "max"),
            amt_drawings_other_current_mean=("AMT_DRAWINGS_OTHER_CURRENT", "mean"),
            amt_drawings_other_current_std=("AMT_DRAWINGS_OTHER_CURRENT", "std"),
            amt_drawings_other_current_median=("AMT_DRAWINGS_OTHER_CURRENT", "median"),
            amt_drawings_other_current_range=(
                "AMT_DRAWINGS_OTHER_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_drawings_pos_current_min=("AMT_DRAWINGS_POS_CURRENT", "min"),
            amt_drawings_pos_current_max=("AMT_DRAWINGS_POS_CURRENT", "max"),
            amt_drawings_pos_current_mean=("AMT_DRAWINGS_POS_CURRENT", "mean"),
            amt_drawings_pos_current_std=("AMT_DRAWINGS_POS_CURRENT", "std"),
            amt_drawings_pos_current_median=("AMT_DRAWINGS_POS_CURRENT", "median"),
            amt_drawings_pos_current_range=(
                "AMT_DRAWINGS_POS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_inst_min_regularity_min=("AMT_INST_MIN_REGULARITY", "min"),
            amt_inst_min_regularity_max=("AMT_INST_MIN_REGULARITY", "max"),
            amt_inst_min_regularity_mean=("AMT_INST_MIN_REGULARITY", "mean"),
            amt_inst_min_regularity_std=("AMT_INST_MIN_REGULARITY", "std"),
            amt_inst_min_regularity_median=("AMT_INST_MIN_REGULARITY", "median"),
            amt_inst_min_regularity_range=(
                "AMT_INST_MIN_REGULARITY",
                lambda x: x.max() - x.min(),
            ),
            amt_payment_current_min=("AMT_PAYMENT_CURRENT", "min"),
            amt_payment_current_max=("AMT_PAYMENT_CURRENT", "max"),
            amt_payment_current_mean=("AMT_PAYMENT_CURRENT", "mean"),
            amt_payment_current_std=("AMT_PAYMENT_CURRENT", "std"),
            amt_payment_current_median=("AMT_PAYMENT_CURRENT", "median"),
            amt_payment_current_range=(
                "AMT_PAYMENT_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_payment_total_current_min=("AMT_PAYMENT_TOTAL_CURRENT", "min"),
            amt_payment_total_current_max=("AMT_PAYMENT_TOTAL_CURRENT", "max"),
            amt_payment_total_current_mean=("AMT_PAYMENT_TOTAL_CURRENT", "mean"),
            amt_payment_total_current_std=("AMT_PAYMENT_TOTAL_CURRENT", "std"),
            amt_payment_total_current_median=("AMT_PAYMENT_TOTAL_CURRENT", "median"),
            amt_payment_total_current_range=(
                "AMT_PAYMENT_TOTAL_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            amt_receivable_principal_min=("AMT_RECEIVABLE_PRINCIPAL", "min"),
            amt_receivable_principal_max=("AMT_RECEIVABLE_PRINCIPAL", "max"),
            amt_receivable_principal_mean=("AMT_RECEIVABLE_PRINCIPAL", "mean"),
            amt_receivable_principal_std=("AMT_RECEIVABLE_PRINCIPAL", "std"),
            amt_receivable_principal_median=("AMT_RECEIVABLE_PRINCIPAL", "median"),
            amt_receivable_principal_range=(
                "AMT_RECEIVABLE_PRINCIPAL",
                lambda x: x.max() - x.min(),
            ),
            amt_receivable_min=("AMT_RECIVABLE", "min"),
            amt_receivable_max=("AMT_RECIVABLE", "max"),
            amt_receivable_mean=("AMT_RECIVABLE", "mean"),
            amt_receivable_std=("AMT_RECIVABLE", "std"),
            amt_receivable_median=("AMT_RECIVABLE", "median"),
            amt_receivable_range=("AMT_RECIVABLE", lambda x: x.max() - x.min()),
            amt_total_receivable_min=("AMT_TOTAL_RECEIVABLE", "min"),
            amt_total_receivable_max=("AMT_TOTAL_RECEIVABLE", "max"),
            amt_total_receivable_mean=("AMT_TOTAL_RECEIVABLE", "mean"),
            amt_total_receivable_std=("AMT_TOTAL_RECEIVABLE", "std"),
            amt_total_receivable_median=("AMT_TOTAL_RECEIVABLE", "median"),
            amt_total_receivable_range=(
                "AMT_TOTAL_RECEIVABLE",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_atm_current_min=("CNT_DRAWINGS_ATM_CURRENT", "min"),
            cnt_drawings_atm_current_max=("CNT_DRAWINGS_ATM_CURRENT", "max"),
            cnt_drawings_atm_current_mean=("CNT_DRAWINGS_ATM_CURRENT", "mean"),
            cnt_drawings_atm_current_std=("CNT_DRAWINGS_ATM_CURRENT", "std"),
            cnt_drawings_atm_current_median=("CNT_DRAWINGS_ATM_CURRENT", "median"),
            cnt_drawings_atm_current_range=(
                "CNT_DRAWINGS_ATM_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_current_min=("CNT_DRAWINGS_CURRENT", "min"),
            cnt_drawings_current_max=("CNT_DRAWINGS_CURRENT", "max"),
            cnt_drawings_current_mean=("CNT_DRAWINGS_CURRENT", "mean"),
            cnt_drawings_current_std=("CNT_DRAWINGS_CURRENT", "std"),
            cnt_drawings_current_median=("CNT_DRAWINGS_CURRENT", "median"),
            cnt_drawings_current_range=(
                "CNT_DRAWINGS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_other_current_min=("CNT_DRAWINGS_OTHER_CURRENT", "min"),
            cnt_drawings_other_current_max=("CNT_DRAWINGS_OTHER_CURRENT", "max"),
            cnt_drawings_other_current_mean=("CNT_DRAWINGS_OTHER_CURRENT", "mean"),
            cnt_drawings_other_current_std=("CNT_DRAWINGS_OTHER_CURRENT", "std"),
            cnt_drawings_other_current_median=("CNT_DRAWINGS_OTHER_CURRENT", "median"),
            cnt_drawings_other_current_range=(
                "CNT_DRAWINGS_OTHER_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_drawings_pos_current_min=("CNT_DRAWINGS_POS_CURRENT", "min"),
            cnt_drawings_pos_current_max=("CNT_DRAWINGS_POS_CURRENT", "max"),
            cnt_drawings_pos_current_mean=("CNT_DRAWINGS_POS_CURRENT", "mean"),
            cnt_drawings_pos_current_std=("CNT_DRAWINGS_POS_CURRENT", "std"),
            cnt_drawings_pos_current_median=("CNT_DRAWINGS_POS_CURRENT", "median"),
            cnt_drawings_pos_current_range=(
                "CNT_DRAWINGS_POS_CURRENT",
                lambda x: x.max() - x.min(),
            ),
            cnt_installment_mature_cum_min=("CNT_INSTALMENT_MATURE_CUM", "min"),
            cnt_installment_mature_cum_max=("CNT_INSTALMENT_MATURE_CUM", "max"),
            cnt_installment_mature_cum_mean=("CNT_INSTALMENT_MATURE_CUM", "mean"),
            cnt_installment_mature_cum_std=("CNT_INSTALMENT_MATURE_CUM", "std"),
            cnt_installment_mature_cum_median=("CNT_INSTALMENT_MATURE_CUM", "median"),
            cnt_installment_mature_cum_range=(
                "CNT_INSTALMENT_MATURE_CUM",
                lambda x: x.max() - x.min(),
            ),
            sk_dpd_credit_card_min=("SK_DPD", "min"),
            sk_dpd_credit_card_max=("SK_DPD", "max"),
            sk_dpd_credit_card_mean=("SK_DPD", "mean"),
            sk_dpd_credit_card_std=("SK_DPD", "std"),
            sk_dpd_credit_card_median=("SK_DPD", "median"),
            sk_dpd_credit_card_range=("SK_DPD", lambda x: x.max() - x.min()),
            sk_dpd_def_credit_card_min=("SK_DPD_DEF", "min"),
            sk_dpd_def_credit_card_max=("SK_DPD_DEF", "max"),
            sk_dpd_def_credit_card_mean=("SK_DPD_DEF", "mean"),
            sk_dpd_def_credit_card_std=("SK_DPD_DEF", "std"),
            sk_dpd_def_credit_card_median=("SK_DPD_DEF", "median"),
            sk_dpd_def_credit_card_range=("SK_DPD_DEF", lambda x: x.max() - x.min()),
        )
        .reset_index()
        .pipe(klib.convert_datatypes)
    )

    credit_card_balance_aggregated.to_feather(file)

del file
Code
credit_card_balance_aggregated.shape
(103558, 122)
Code
credit_card_balance_aggregated.head()
SK_ID_CURR n_previous_credit_card_applications n_previous_credit_card_applications_completed n_previous_credit_card_applications_active n_previous_credit_card_applications_signed n_contracts_credit_card_active n_contracts_credit_card_completed n_contracts_credit_card_signed amt_balance_credit_card_min amt_balance_credit_card_max amt_balance_credit_card_mean amt_balance_credit_card_std amt_balance_credit_card_median amt_balance_credit_card_range amt_credit_limit_actual_min amt_credit_limit_actual_max amt_credit_limit_actual_mean amt_credit_limit_actual_std amt_credit_limit_actual_median amt_credit_limit_actual_range amt_drawings_atm_current_min amt_drawings_atm_current_max amt_drawings_atm_current_mean amt_drawings_atm_current_std amt_drawings_atm_current_median amt_drawings_atm_current_range amt_drawings_current_min amt_drawings_current_max amt_drawings_current_mean amt_drawings_current_std amt_drawings_current_median amt_drawings_current_range amt_drawings_other_current_min amt_drawings_other_current_max amt_drawings_other_current_mean amt_drawings_other_current_std amt_drawings_other_current_median amt_drawings_other_current_range amt_drawings_pos_current_min amt_drawings_pos_current_max amt_drawings_pos_current_mean amt_drawings_pos_current_std amt_drawings_pos_current_median amt_drawings_pos_current_range amt_inst_min_regularity_min amt_inst_min_regularity_max amt_inst_min_regularity_mean amt_inst_min_regularity_std amt_inst_min_regularity_median amt_inst_min_regularity_range amt_payment_current_min amt_payment_current_max amt_payment_current_mean amt_payment_current_std amt_payment_current_median amt_payment_current_range amt_payment_total_current_min amt_payment_total_current_max amt_payment_total_current_mean amt_payment_total_current_std amt_payment_total_current_median amt_payment_total_current_range amt_receivable_principal_min amt_receivable_principal_max amt_receivable_principal_mean amt_receivable_principal_std amt_receivable_principal_median amt_receivable_principal_range amt_receivable_min amt_receivable_max amt_receivable_mean amt_receivable_std amt_receivable_median amt_receivable_range amt_total_receivable_min amt_total_receivable_max amt_total_receivable_mean amt_total_receivable_std amt_total_receivable_median amt_total_receivable_range cnt_drawings_atm_current_min cnt_drawings_atm_current_max cnt_drawings_atm_current_mean cnt_drawings_atm_current_std cnt_drawings_atm_current_median cnt_drawings_atm_current_range cnt_drawings_current_min cnt_drawings_current_max cnt_drawings_current_mean cnt_drawings_current_std cnt_drawings_current_median cnt_drawings_current_range cnt_drawings_other_current_min cnt_drawings_other_current_max cnt_drawings_other_current_mean cnt_drawings_other_current_std cnt_drawings_other_current_median cnt_drawings_other_current_range cnt_drawings_pos_current_min cnt_drawings_pos_current_max cnt_drawings_pos_current_mean cnt_drawings_pos_current_std cnt_drawings_pos_current_median cnt_drawings_pos_current_range cnt_installment_mature_cum_min cnt_installment_mature_cum_max cnt_installment_mature_cum_mean cnt_installment_mature_cum_std cnt_installment_mature_cum_median cnt_installment_mature_cum_range sk_dpd_credit_card_min sk_dpd_credit_card_max sk_dpd_credit_card_mean sk_dpd_credit_card_std sk_dpd_credit_card_median sk_dpd_credit_card_range sk_dpd_def_credit_card_min sk_dpd_def_credit_card_max sk_dpd_def_credit_card_mean sk_dpd_def_credit_card_std sk_dpd_def_credit_card_median sk_dpd_def_credit_card_range
0 100006 6 0 6 0 6 0 0 0.00 0.00 0.00 0.00 0.00 0.00 270000 270000 270000.00 0.00 270000.00 0 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0 0 0.00 0.00 0.00 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0
1 100011 74 0 74 0 74 0 0 0.00 189000.00 54482.11 68127.24 0.00 189000.00 90000 180000 164189.19 34482.74 180000.00 90000 0.00 180000.00 2432.43 20924.57 0.00 0.00 0.00 180000.00 2432.43 20924.57 0.00 180000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9000.00 3956.22 4487.75 0.00 9000.00 0.00 55485.00 4843.06 7279.60 563.36 55485.00 0.00 55485.00 4520.07 7473.87 0.00 55485.00 0.00 180000.00 52402.09 65758.82 0.00 180000.00 -563.36 189000.00 54433.18 68166.97 0.00 189563.36 -563.36 189000.00 54433.18 68166.97 0.00 189563.36 0.00 4.00 0.05 0.46 0.00 4.00 0 4 0.05 0.46 0.00 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 33.00 25.77 10.29 33.00 32.00 0 0 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0
2 100013 96 0 96 0 96 0 0 0.00 161420.22 18159.92 43237.41 0.00 161420.22 45000 157500 131718.75 47531.59 157500.00 112500 0.00 157500.00 6350.00 28722.27 0.00 0.00 0.00 157500.00 5953.12 27843.37 0.00 157500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7875.00 1454.54 3028.41 0.00 7875.00 0.00 153675.00 7168.35 21626.14 274.32 153675.00 0.00 153675.00 6817.17 21730.66 0.00 153675.00 0.00 157500.00 17255.56 41279.75 0.00 157500.00 -274.32 161420.22 18101.08 43262.03 0.00 161694.54 -274.32 161420.22 18101.08 43262.03 0.00 161694.54 0.00 7.00 0.26 1.19 0.00 7.00 0 7 0.24 1.15 0.00 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 22.00 18.72 5.85 22.00 21.00 0 1 0.01 0.10 0.00 1 0 1 0.01 0.10 0.00 1
3 100021 17 10 7 0 7 10 0 0.00 0.00 0.00 0.00 0.00 0.00 675000 675000 675000.00 0.00 675000.00 0 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0 0 0.00 0.00 0.00 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0
4 100023 8 0 8 0 8 0 0 0.00 0.00 0.00 0.00 0.00 0.00 45000 225000 135000.00 96214.05 135000.00 180000 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0 0 0.00 0.00 0.00 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0
Code
an.col_info(credit_card_balance_aggregated, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 SK_ID_CURR int32 414.2 kB 103,558 100.0% 0 0% 1 <0.1% <0.1% 100006
2 n_previous_credit_card_applications int16 207.1 kB 132 0.1% 0 0% 7,606 7.3% 7.3% 96
3 n_previous_credit_card_applications_completed int8 103.6 kB 44 <0.1% 0 0% 90,377 87.3% 87.3% 0
4 n_previous_credit_card_applications_active int16 207.1 kB 104 0.1% 0 0% 6,697 6.5% 6.5% 96
5 n_previous_credit_card_applications_signed int8 103.6 kB 45 <0.1% 0 0% 98,629 95.2% 95.2% 0
6 n_contracts_credit_card_active int16 207.1 kB 104 0.1% 0 0% 6,697 6.5% 6.5% 96
7 n_contracts_credit_card_completed int8 103.6 kB 44 <0.1% 0 0% 90,377 87.3% 87.3% 0
8 n_contracts_credit_card_signed int8 103.6 kB 45 <0.1% 0 0% 98,629 95.2% 95.2% 0
9 amt_balance_credit_card_min float64 828.5 kB 13,320 12.9% 0 0% 88,898 85.8% 85.8% 0.0
10 amt_balance_credit_card_max float64 828.5 kB 66,374 64.1% 0 0% 33,355 32.2% 32.2% 0.0
11 amt_balance_credit_card_mean float64 828.5 kB 70,080 67.7% 0 0% 33,325 32.2% 32.2% 0.0
12 amt_balance_credit_card_std float64 828.5 kB 69,961 67.6% 692 0.7% 32,822 31.7% 31.9% 0.0
13 amt_balance_credit_card_median float64 828.5 kB 45,779 44.2% 0 0% 57,018 55.1% 55.1% 0.0
14 amt_balance_credit_card_range float64 828.5 kB 66,657 64.4% 0 0% 33,514 32.4% 32.4% 0.0
15 amt_credit_limit_actual_min int32 414.2 kB 180 0.2% 0 0% 26,273 25.4% 25.4% 45000
16 amt_credit_limit_actual_max int32 414.2 kB 54 0.1% 0 0% 14,894 14.4% 14.4% 135000
17 amt_credit_limit_actual_mean float64 828.5 kB 13,036 12.6% 0 0% 5,575 5.4% 5.4% 45000.0
18 amt_credit_limit_actual_std float64 828.5 kB 26,234 25.3% 692 0.7% 43,431 41.9% 42.2% 0.0
19 amt_credit_limit_actual_median float32 414.2 kB 191 0.2% 0 0% 12,884 12.4% 12.4% 0.0
20 amt_credit_limit_actual_range int32 414.2 kB 185 0.2% 0 0% 44,123 42.6% 42.6% 0
21 amt_drawings_atm_current_min float64 828.5 kB 144 0.1% 31,364 30.3% 71,267 68.8% 98.7% 0.0
22 amt_drawings_atm_current_max float64 828.5 kB 1,370 1.3% 31,364 30.3% 12,162 11.7% 16.8% 0.0
23 amt_drawings_atm_current_mean float64 828.5 kB 24,822 24.0% 31,364 30.3% 12,162 11.7% 16.8% 0.0
24 amt_drawings_atm_current_std float64 828.5 kB 50,080 48.4% 31,866 30.8% 11,947 11.5% 16.7% 0.0
25 amt_drawings_atm_current_median float64 828.5 kB 458 0.4% 31,364 30.3% 61,890 59.8% 85.7% 0.0
26 amt_drawings_atm_current_range float32 414.2 kB 1 <0.1% 31,364 30.3% 72,194 69.7% 100.0% 0.0
27 amt_drawings_current_min float64 828.5 kB 2,363 2.3% 0 0% 100,558 97.1% 97.1% 0.0
28 amt_drawings_current_max float64 828.5 kB 28,333 27.4% 0 0% 33,294 32.2% 32.2% 0.0
29 amt_drawings_current_mean float64 828.5 kB 57,397 55.4% 0 0% 33,293 32.1% 32.1% 0.0
30 amt_drawings_current_std float64 828.5 kB 65,572 63.3% 692 0.7% 32,811 31.7% 31.9% 0.0
31 amt_drawings_current_median float64 828.5 kB 15,653 15.1% 0 0% 81,097 78.3% 78.3% 0.0
32 amt_drawings_current_range float64 828.5 kB 28,388 27.4% 0 0% 33,503 32.4% 32.4% 0.0
33 amt_drawings_other_current_min float32 414.2 kB 7 <0.1% 31,364 30.3% 72,188 69.7% >99.9% 0.0
34 amt_drawings_other_current_max float64 828.5 kB 1,482 1.4% 31,364 30.3% 65,693 63.4% 91.0% 0.0
35 amt_drawings_other_current_mean float64 828.5 kB 4,397 4.2% 31,364 30.3% 65,693 63.4% 91.0% 0.0
36 amt_drawings_other_current_std float64 828.5 kB 5,323 5.1% 31,866 30.8% 65,195 63.0% 90.9% 0.0
37 amt_drawings_other_current_median float64 828.5 kB 45 <0.1% 31,364 30.3% 72,137 69.7% 99.9% 0.0
38 amt_drawings_other_current_range float64 828.5 kB 1,481 1.4% 31,364 30.3% 65,697 63.4% 91.0% 0.0
39 amt_drawings_pos_current_min float64 828.5 kB 2,906 2.8% 31,364 30.3% 68,950 66.6% 95.5% 0.0
40 amt_drawings_pos_current_max float64 828.5 kB 33,877 32.7% 31,364 30.3% 31,370 30.3% 43.5% 0.0
41 amt_drawings_pos_current_mean float64 828.5 kB 39,808 38.4% 31,364 30.3% 31,370 30.3% 43.5% 0.0
42 amt_drawings_pos_current_std float64 828.5 kB 40,122 38.7% 31,866 30.8% 31,159 30.1% 43.5% 0.0
43 amt_drawings_pos_current_median float64 828.5 kB 14,228 13.7% 31,364 30.3% 56,554 54.6% 78.3% 0.0
44 amt_drawings_pos_current_range float64 828.5 kB 33,748 32.6% 31,364 30.3% 31,661 30.6% 43.9% 0.0
45 amt_inst_min_regularity_min float64 828.5 kB 2,652 2.6% 0 0% 98,268 94.9% 94.9% 0.0
46 amt_inst_min_regularity_max float64 828.5 kB 37,619 36.3% 0 0% 33,662 32.5% 32.5% 0.0
47 amt_inst_min_regularity_mean float64 828.5 kB 67,591 65.3% 0 0% 33,662 32.5% 32.5% 0.0
48 amt_inst_min_regularity_std float64 828.5 kB 67,817 65.5% 692 0.7% 33,542 32.4% 32.6% 0.0
49 amt_inst_min_regularity_median float64 828.5 kB 28,060 27.1% 0 0% 57,678 55.7% 55.7% 0.0
50 amt_inst_min_regularity_range float64 828.5 kB 38,181 36.9% 0 0% 34,234 33.1% 33.1% 0.0
51 amt_payment_current_min float64 828.5 kB 11,528 11.1% 31,438 30.4% 45,218 43.7% 62.7% 0.0
52 amt_payment_current_max float64 828.5 kB 29,790 28.8% 31,438 30.4% 1,552 1.5% 2.2% 22500.0
53 amt_payment_current_mean float64 828.5 kB 66,748 64.5% 31,438 30.4% 143 0.1% 0.2% 0.0
54 amt_payment_current_std float64 828.5 kB 69,558 67.2% 31,956 30.9% 672 0.6% 0.9% 0.0
55 amt_payment_current_median float64 828.5 kB 25,981 25.1% 31,438 30.4% 4,491 4.3% 6.2% 9000.0
56 amt_payment_current_range float64 828.5 kB 35,713 34.5% 31,438 30.4% 1,190 1.1% 1.7% 0.0
57 amt_payment_total_current_min float64 828.5 kB 1,754 1.7% 0 0% 100,571 97.1% 97.1% 0.0
58 amt_payment_total_current_max float64 828.5 kB 35,265 34.1% 0 0% 31,936 30.8% 30.8% 0.0
59 amt_payment_total_current_mean float64 828.5 kB 67,932 65.6% 0 0% 31,936 30.8% 30.8% 0.0
60 amt_payment_total_current_std float64 828.5 kB 70,720 68.3% 692 0.7% 31,367 30.3% 30.5% 0.0
61 amt_payment_total_current_median float64 828.5 kB 21,251 20.5% 0 0% 52,198 50.4% 50.4% 0.0
62 amt_payment_total_current_range float64 828.5 kB 35,845 34.6% 0 0% 32,059 31.0% 31.0% 0.0
63 amt_receivable_principal_min float64 828.5 kB 9,864 9.5% 0 0% 90,893 87.8% 87.8% 0.0
64 amt_receivable_principal_max float64 828.5 kB 54,361 52.5% 0 0% 34,174 33.0% 33.0% 0.0
65 amt_receivable_principal_mean float64 828.5 kB 68,980 66.6% 0 0% 34,137 33.0% 33.0% 0.0
66 amt_receivable_principal_std float64 828.5 kB 69,008 66.6% 692 0.7% 33,638 32.5% 32.7% 0.0
67 amt_receivable_principal_median float64 828.5 kB 42,298 40.8% 0 0% 60,279 58.2% 58.2% 0.0
68 amt_receivable_principal_range float64 828.5 kB 56,103 54.2% 0 0% 34,330 33.2% 33.2% 0.0
69 amt_receivable_min float64 828.5 kB 22,497 21.7% 0 0% 75,273 72.7% 72.7% 0.0
70 amt_receivable_max float64 828.5 kB 66,001 63.7% 0 0% 33,593 32.4% 32.4% 0.0
71 amt_receivable_mean float64 828.5 kB 70,224 67.8% 0 0% 33,034 31.9% 31.9% 0.0
72 amt_receivable_std float64 828.5 kB 70,144 67.7% 692 0.7% 32,520 31.4% 31.6% 0.0
73 amt_receivable_median float64 828.5 kB 44,334 42.8% 0 0% 58,685 56.7% 56.7% 0.0
74 amt_receivable_range float64 828.5 kB 67,801 65.5% 0 0% 33,212 32.1% 32.1% 0.0
75 amt_total_receivable_min float64 828.5 kB 22,496 21.7% 0 0% 75,274 72.7% 72.7% 0.0
76 amt_total_receivable_max float64 828.5 kB 66,012 63.7% 0 0% 33,592 32.4% 32.4% 0.0
77 amt_total_receivable_mean float64 828.5 kB 70,224 67.8% 0 0% 33,034 31.9% 31.9% 0.0
78 amt_total_receivable_std float64 828.5 kB 70,145 67.7% 692 0.7% 32,520 31.4% 31.6% 0.0
79 amt_total_receivable_median float64 828.5 kB 44,334 42.8% 0 0% 58,685 56.7% 56.7% 0.0
80 amt_total_receivable_range float64 828.5 kB 67,802 65.5% 0 0% 33,212 32.1% 32.1% 0.0
81 cnt_drawings_atm_current_min float32 414.2 kB 23 <0.1% 31,364 30.3% 71,268 68.8% 98.7% 0.0
82 cnt_drawings_atm_current_max float32 414.2 kB 44 <0.1% 31,364 30.3% 12,162 11.7% 16.8% 0.0
83 cnt_drawings_atm_current_mean float32 414.2 kB 3,544 3.4% 31,364 30.3% 12,162 11.7% 16.8% 0.0
84 cnt_drawings_atm_current_std float32 414.2 kB 23,861 23.0% 31,866 30.8% 11,970 11.6% 16.7% 0.0
85 cnt_drawings_atm_current_median float32 414.2 kB 38 <0.1% 31,364 30.3% 61,890 59.8% 85.7% 0.0
86 cnt_drawings_atm_current_range float32 414.2 kB 44 <0.1% 31,364 30.3% 12,472 12.0% 17.3% 0.0
87 cnt_drawings_current_min int16 207.1 kB 45 <0.1% 0 0% 100,581 97.1% 97.1% 0
88 cnt_drawings_current_max int16 207.1 kB 123 0.1% 0 0% 33,866 32.7% 32.7% 0
89 cnt_drawings_current_mean float32 414.2 kB 7,111 6.9% 0 0% 33,866 32.7% 32.7% 0.0
90 cnt_drawings_current_std float32 414.2 kB 38,027 36.7% 692 0.7% 33,390 32.2% 32.5% 0.0
91 cnt_drawings_current_median float32 414.2 kB 121 0.1% 0 0% 81,288 78.5% 78.5% 0.0
92 cnt_drawings_current_range int16 207.1 kB 123 0.1% 0 0% 34,082 32.9% 32.9% 0
93 cnt_drawings_other_current_min float32 414.2 kB 3 <0.1% 31,364 30.3% 72,188 69.7% >99.9% 0.0
94 cnt_drawings_other_current_max float32 414.2 kB 11 <0.1% 31,364 30.3% 65,675 63.4% 91.0% 0.0
95 cnt_drawings_other_current_mean float32 414.2 kB 470 0.5% 31,364 30.3% 65,675 63.4% 91.0% 0.0
96 cnt_drawings_other_current_std float32 414.2 kB 958 0.9% 31,866 30.8% 65,178 62.9% 90.9% 0.0
97 cnt_drawings_other_current_median float32 414.2 kB 5 <0.1% 31,364 30.3% 72,137 69.7% 99.9% 0.0
98 cnt_drawings_other_current_range float32 414.2 kB 11 <0.1% 31,364 30.3% 65,680 63.4% 91.0% 0.0
99 cnt_drawings_pos_current_min float32 414.2 kB 48 <0.1% 31,364 30.3% 68,950 66.6% 95.5% 0.0
100 cnt_drawings_pos_current_max float32 414.2 kB 128 0.1% 31,364 30.3% 31,370 30.3% 43.5% 0.0
101 cnt_drawings_pos_current_mean float32 414.2 kB 5,492 5.3% 31,364 30.3% 31,370 30.3% 43.5% 0.0
102 cnt_drawings_pos_current_std float32 414.2 kB 21,472 20.7% 31,866 30.8% 31,173 30.1% 43.5% 0.0
103 cnt_drawings_pos_current_median float32 414.2 kB 123 0.1% 31,364 30.3% 56,554 54.6% 78.3% 0.0
104 cnt_drawings_pos_current_range float32 414.2 kB 124 0.1% 31,364 30.3% 31,675 30.6% 43.9% 0.0
105 cnt_installment_mature_cum_min float32 414.2 kB 30 <0.1% 0 0% 66,905 64.6% 64.6% 0.0
106 cnt_installment_mature_cum_max float32 414.2 kB 121 0.1% 0 0% 33,312 32.2% 32.2% 0.0
107 cnt_installment_mature_cum_mean float32 414.2 kB 15,471 14.9% 0 0% 33,312 32.2% 32.2% 0.0
108 cnt_installment_mature_cum_std float32 414.2 kB 19,210 18.5% 692 0.7% 33,274 32.1% 32.3% 0.0
109 cnt_installment_mature_cum_median float32 414.2 kB 148 0.1% 0 0% 35,237 34.0% 34.0% 0.0
110 cnt_installment_mature_cum_range float32 414.2 kB 96 0.1% 0 0% 33,966 32.8% 32.8% 0.0
111 sk_dpd_credit_card_min int16 207.1 kB 2 <0.1% 0 0% 103,557 >99.9% >99.9% 0
112 sk_dpd_credit_card_max int16 207.1 kB 438 0.4% 0 0% 82,898 80.0% 80.0% 0
113 sk_dpd_credit_card_mean float32 414.2 kB 3,945 3.8% 0 0% 82,898 80.0% 80.0% 0.0
114 sk_dpd_credit_card_std float32 414.2 kB 5,159 5.0% 692 0.7% 82,206 79.4% 79.9% 0.0
115 sk_dpd_credit_card_median float32 414.2 kB 292 0.3% 0 0% 102,677 99.1% 99.1% 0.0
116 sk_dpd_credit_card_range int16 207.1 kB 438 0.4% 0 0% 82,898 80.0% 80.0% 0
117 sk_dpd_def_credit_card_min int16 207.1 kB 2 <0.1% 0 0% 103,557 >99.9% >99.9% 0
118 sk_dpd_def_credit_card_max int16 207.1 kB 62 0.1% 0 0% 86,529 83.6% 83.6% 0
119 sk_dpd_def_credit_card_mean float32 414.2 kB 1,629 1.6% 0 0% 86,529 83.6% 83.6% 0.0
120 sk_dpd_def_credit_card_std float32 414.2 kB 2,275 2.2% 692 0.7% 85,837 82.9% 83.4% 0.0
121 sk_dpd_def_credit_card_median float32 414.2 kB 26 <0.1% 0 0% 103,494 99.9% 99.9% 0.0
122 sk_dpd_def_credit_card_range int16 207.1 kB 62 0.1% 0 0% 86,529 83.6% 83.6% 0

5.7 Merge and Further Pre-Process Tables

All the following tables should be left joined to application datasets on SK_ID_CURR variable:

  • application_train
  • bureau_aggregated
  • bureau_balance_aggregated
  • previous_application_aggregated
  • installments_payments_aggregated
  • pos_cash_balance_aggregated
  • credit_card_balance_aggregated
  1. At first, data will be merged with the application_train table and inspected.
  2. In Section 6.1, the features that are either redundant (duplicated or correlated to other features) or problematic (e.g., constant or almost constant) will be identified based on the training set only and the code to do required pre-processing (e.g., to remove the unnecessary columns) will be created.
  3. In Section 6.2, to create training, validation and test datasets, the datasets with aggregated features will be merged with application_train, application_validation and application_test datasets, respectively and the required preprocessing steps created in the previous sections will be applied.
Code
def merge_credit_history(to, on="SK_ID_CURR"):
    merged = (
        to.merge(bureau_aggregated, on=on, how="left", suffixes=("", "_bureau"))
        .merge(
            bureau_balance_aggregated,
            on=on,
            how="left",
            suffixes=("", "_bureau_balance"),
        )
        .merge(
            previous_application_aggregated,
            on=on,
            how="left",
            suffixes=("", "_previous_application"),
        )
        .merge(
            installments_payments_aggregated,
            on=on,
            how="left",
            suffixes=("", "_installments_payments"),
        )
        .merge(
            pos_cash_balance_aggregated,
            on=on,
            how="left",
            suffixes=("", "_pos_cash_balance"),
        )
        .merge(
            credit_card_balance_aggregated,
            on=on,
            how="left",
            suffixes=("", "_credit_card_balance"),
        )
    )
    return merged
Code
def preprocess_credit_data(df):
    education_values = [
        "Lower secondary",
        "Secondary / secondary special",
        "Incomplete higher",
        "Higher education",
        "Academic degree",
    ]

    education_dtype = pd.CategoricalDtype(categories=education_values, ordered=True)

    return (
        df.rename(
            columns={"n_nflag_insured_on_approval_any": "any_nflag_insured_on_approval"}
        )
        .assign(
            # Feature engineering
            any_nflag_insured_on_approval=lambda df: (
                df["any_nflag_insured_on_approval"] == "True"
            ).astype("Int8"),
            FLAG_OWN_CAR=lambda df: (df["FLAG_OWN_CAR"] == "Y").astype("Int8"),
            FLAG_OWN_REALTY=lambda df: (df["FLAG_OWN_REALTY"] == "Y").astype("Int8"),
            FLAG_IS_EMERGENCY=lambda df: (df["EMERGENCYSTATE_MODE"] == "Yes").astype(
                "Int8"
            ),
            NAME_EDUCATION_TYPE=lambda df: df["NAME_EDUCATION_TYPE"].astype(
                education_dtype
            ),
            ord_education_type=lambda df: df["NAME_EDUCATION_TYPE"].cat.codes,
            flag_has_children=lambda df: (df["CNT_CHILDREN"] > 0).astype("Int8"),
            DAYS_EMPLOYED=lambda df: df["DAYS_EMPLOYED"].replace(365243, np.nan),
            years_employed=lambda df: df["DAYS_EMPLOYED"] / -365,
            amt_income_total_per_family_member=lambda df: df["AMT_INCOME_TOTAL"]
            / df["CNT_FAM_MEMBERS"],
            cnt_fam_members_excluding_children=lambda df: df["CNT_FAM_MEMBERS"]
            - df["CNT_CHILDREN"],
            amt_annuity_to_credit_ratio=lambda df: df["AMT_ANNUITY"] / df["AMT_CREDIT"],
            amt_annuity_to_income_ratio=lambda df: df["AMT_ANNUITY"]
            / df["AMT_INCOME_TOTAL"],
            amt_credit_to_income_ratio=lambda df: df["AMT_CREDIT"]
            / df["AMT_INCOME_TOTAL"],
            amt_annuity_to_income_per_family_member=lambda df: df["AMT_ANNUITY"]
            / df["amt_income_total_per_family_member"],
            # Make explicit the missing values: XNA → NaN
            ORGANIZATION_TYPE=lambda df: df["ORGANIZATION_TYPE"].replace("XNA", np.nan),
        )
        .drop(
            columns=[
                "SK_ID_CURR",
                # Restricted by legal constraints
                "CODE_GENDER",
                "NAME_FAMILY_STATUS",
                "DAYS_BIRTH",
                # Not useful, unethical
                "WEEKDAY_APPR_PROCESS_START",
                "HOUR_APPR_PROCESS_START",
                # Already used/processed
                "EMERGENCYSTATE_MODE",
                "DAYS_EMPLOYED",
            ]
        )
    )
Code
file = dir_interim + "merged--credits_train--01.feather"

if os.path.exists(file):
    credits_train = pd.read_feather(file)
else:
    credits_train = (
        merge_credit_history(to=application_train)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
    )
    credits_train.to_feather(file)

del file

# Time: 1m 1.4s

5.8 Inspect Training Set

Code
credits_train.shape
(215257, 548)
Code
credits_train.head()
TARGET NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR n_credits_total n_credits_active n_credits_closed n_credits_bad_debt n_credits_sold mode_credit_currency n_different_currencies n_currency_1 n_currency_2 n_currency_3 n_currency_4 days_credit_min days_credit_max days_credit_mean days_credit_std days_credit_median days_credit_range days_credit_overdue_min days_credit_overdue_max days_credit_overdue_mean days_credit_overdue_std days_credit_overdue_median days_credit_overdue_range days_credit_enddate_min days_credit_enddate_max days_credit_enddate_mean days_credit_enddate_std days_credit_enddate_median days_credit_enddate_range days_enddate_fact_min days_enddate_fact_max days_enddate_fact_mean days_enddate_fact_std days_enddate_fact_median days_enddate_fact_range amt_credit_max_overdue_min ... cnt_installment_future_range cnt_installments_diff_min cnt_installments_diff_max cnt_installments_diff_mean cnt_installments_diff_std cnt_installments_diff_median cnt_installments_diff_range sk_dpd_pos_applications_min sk_dpd_pos_applications_max sk_dpd_pos_applications_mean sk_dpd_pos_applications_std sk_dpd_pos_applications_median sk_dpd_pos_applications_range sk_dpd_def_pos_applications_min sk_dpd_def_pos_applications_max sk_dpd_def_pos_applications_mean sk_dpd_def_pos_applications_std sk_dpd_def_pos_applications_median sk_dpd_def_pos_applications_range n_previous_credit_card_applications n_previous_credit_card_applications_completed n_previous_credit_card_applications_active n_previous_credit_card_applications_signed n_contracts_credit_card_active n_contracts_credit_card_completed n_contracts_credit_card_signed amt_balance_credit_card_min amt_balance_credit_card_max amt_balance_credit_card_mean amt_balance_credit_card_std amt_balance_credit_card_median amt_balance_credit_card_range amt_credit_limit_actual_min amt_credit_limit_actual_max amt_credit_limit_actual_mean amt_credit_limit_actual_std amt_credit_limit_actual_median amt_credit_limit_actual_range amt_drawings_atm_current_min amt_drawings_atm_current_max amt_drawings_atm_current_mean amt_drawings_atm_current_std amt_drawings_atm_current_median amt_drawings_atm_current_range amt_drawings_current_min amt_drawings_current_max amt_drawings_current_mean amt_drawings_current_std amt_drawings_current_median amt_drawings_current_range amt_drawings_other_current_min amt_drawings_other_current_max amt_drawings_other_current_mean amt_drawings_other_current_std amt_drawings_other_current_median amt_drawings_other_current_range amt_drawings_pos_current_min amt_drawings_pos_current_max amt_drawings_pos_current_mean amt_drawings_pos_current_std amt_drawings_pos_current_median amt_drawings_pos_current_range amt_inst_min_regularity_min amt_inst_min_regularity_max amt_inst_min_regularity_mean amt_inst_min_regularity_std amt_inst_min_regularity_median amt_inst_min_regularity_range amt_payment_current_min amt_payment_current_max amt_payment_current_mean amt_payment_current_std amt_payment_current_median amt_payment_current_range amt_payment_total_current_min amt_payment_total_current_max amt_payment_total_current_mean amt_payment_total_current_std amt_payment_total_current_median amt_payment_total_current_range amt_receivable_principal_min amt_receivable_principal_max amt_receivable_principal_mean amt_receivable_principal_std amt_receivable_principal_median amt_receivable_principal_range amt_receivable_min amt_receivable_max amt_receivable_mean amt_receivable_std amt_receivable_median amt_receivable_range amt_total_receivable_min amt_total_receivable_max amt_total_receivable_mean amt_total_receivable_std amt_total_receivable_median amt_total_receivable_range cnt_drawings_atm_current_min cnt_drawings_atm_current_max cnt_drawings_atm_current_mean cnt_drawings_atm_current_std cnt_drawings_atm_current_median cnt_drawings_atm_current_range cnt_drawings_current_min cnt_drawings_current_max cnt_drawings_current_mean cnt_drawings_current_std cnt_drawings_current_median cnt_drawings_current_range cnt_drawings_other_current_min cnt_drawings_other_current_max cnt_drawings_other_current_mean cnt_drawings_other_current_std cnt_drawings_other_current_median cnt_drawings_other_current_range cnt_drawings_pos_current_min cnt_drawings_pos_current_max cnt_drawings_pos_current_mean cnt_drawings_pos_current_std cnt_drawings_pos_current_median cnt_drawings_pos_current_range cnt_installment_mature_cum_min cnt_installment_mature_cum_max cnt_installment_mature_cum_mean cnt_installment_mature_cum_std cnt_installment_mature_cum_median cnt_installment_mature_cum_range sk_dpd_credit_card_min sk_dpd_credit_card_max sk_dpd_credit_card_mean sk_dpd_credit_card_std sk_dpd_credit_card_median sk_dpd_credit_card_range sk_dpd_def_credit_card_min sk_dpd_def_credit_card_max sk_dpd_def_credit_card_mean sk_dpd_def_credit_card_std sk_dpd_def_credit_card_median sk_dpd_def_credit_card_range FLAG_IS_EMERGENCY ord_education_type flag_has_children years_employed amt_income_total_per_family_member cnt_fam_members_excluding_children amt_annuity_to_credit_ratio amt_annuity_to_income_ratio amt_credit_to_income_ratio amt_annuity_to_income_per_family_member
0 0 Cash loans 1 1 2 405000.00 1971072.00 68643.00 1800000.00 Unaccompanied Commercial associate Higher education House / apartment 0.01 -7460.00 -1823 13.00 1 1 0 1 0 0 Accountants 4.00 3 3 0 0 0 0 0 0 Self-employed 0.68 0.33 0.64 0.12 0.10 0.98 0.78 NaN 0.00 0.24 0.17 0.21 0.00 0.10 0.12 NaN 0.03 0.12 0.10 0.98 0.79 NaN 0.00 0.24 0.17 0.21 0.00 0.11 0.13 NaN 0.03 0.12 0.10 0.98 0.79 NaN 0.00 0.24 0.17 0.21 0.00 0.10 0.13 NaN 0.03 reg oper account block of flats 0.10 Stone, brick 4.00 0.00 4.00 0.00 -2169.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 4.00 2.00 2.00 0.00 0.00 currency 1 1.00 4.00 0.00 0.00 0.00 -1239.00 -145.00 -846.75 489.28 -1001.50 1094.00 0.00 0.00 0.00 0.00 0.00 0.00 -746 934 51.00 698.62 8.00 1680 -746 -362 -554.00 271.53 -554.00 384 0.00 ... 12.00 0.00 12.00 6.00 3.89 6.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 3 1 2.82 101250.00 2.00 0.03 0.17 4.87 0.68
1 0 Cash loans 0 1 0 337500.00 508495.50 38146.50 454500.00 Family State servant Higher education House / apartment 0.01 -4054.00 -1090 NaN 1 1 0 1 0 0 Managers 2.00 2 2 0 0 0 0 0 0 Agriculture NaN 0.62 0.44 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 1.00 2.00 1.00 -659.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 6.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN ... 24.00 0.00 11.00 5.41 3.51 5.50 11.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.00 0.00 11.00 0.00 11.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 765000.00 765000.00 765000.00 0.00 765000.00 0.00 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 3 0 3.31 168750.00 2.00 0.08 0.11 1.51 0.23
2 0 Cash loans 0 1 1 112500.00 110146.50 13068.00 90000.00 Unaccompanied Commercial associate Secondary / secondary special House / apartment 0.01 -5554.00 -4130 NaN 1 1 0 1 1 1 Laborers 3.00 2 2 0 0 0 0 0 0 Business Entity Type 3 0.36 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -172.00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN ... 60.00 0.00 10.00 3.23 2.82 2.50 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 1 1 1.62 37500.00 2.00 0.12 0.12 0.98 0.35
3 0 Cash loans 0 1 2 40500.00 66384.00 3519.00 45000.00 Unaccompanied Commercial associate Secondary / secondary special House / apartment 0.03 -5285.00 -5290 NaN 1 1 0 1 0 0 Sales staff 4.00 2 2 0 0 0 0 0 0 Self-employed 0.39 0.60 0.45 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1576.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 1.00 0.00 2.00 5.00 3.00 2.00 0.00 0.00 currency 1 1.00 5.00 0.00 0.00 0.00 -1345.00 -325.00 -728.00 398.50 -545.00 1020.00 0.00 0.00 0.00 0.00 0.00 0.00 -679 30905 6060.20 13897.16 41.00 31584 -649 -518 -583.50 92.63 -583.50 131 NaN ... 24.00 0.00 24.00 8.53 7.40 6.00 24.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 1 1 14.73 10125.00 2.00 0.05 0.09 1.64 0.35
4 0 Cash loans 1 0 0 225000.00 298512.00 31801.50 270000.00 Unaccompanied Commercial associate Secondary / secondary special House / apartment 0.02 -86.00 -3033 11.00 1 1 0 1 0 0 Drivers 2.00 2 2 0 0 0 0 0 0 Construction 0.74 0.66 0.72 0.30 0.14 1.00 0.99 0.10 0.40 0.17 0.46 0.00 0.00 0.24 0.25 0.00 0.00 0.30 0.14 1.00 0.99 0.10 0.40 0.17 0.46 0.00 0.00 0.26 0.26 0.00 0.00 0.30 0.14 1.00 0.99 0.10 0.40 0.17 0.46 0.00 0.00 0.25 0.25 0.00 0.00 reg oper account block of flats 0.27 Stone, brick 3.00 0.00 3.00 0.00 -624.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 3.00 1.00 2.00 0.00 0.00 currency 1 1.00 3.00 0.00 0.00 0.00 -2861.00 -965.00 -1644.00 1056.31 -1106.00 1896.00 0.00 0.00 0.00 0.00 0.00 0.00 -2526 703 -569.67 1719.64 114.00 3229 -2501 -723 -1612.00 1257.24 -1612.00 1778 41400.00 ... 10.00 0.00 5.00 2.50 1.87 2.50 5.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 1 0 3.27 112500.00 2.00 0.11 0.14 1.33 0.28

5 rows × 548 columns

Info on all columns:

Column info (whole dataset)
credits_train_col_info = an.col_info(credits_train)
credits_train_col_info.pipe(an.style_col_info)
Table 5.1. All columns of the merged dataset.
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 TARGET int8 215.3 kB 2 <0.1% 0 0% 197,880 91.9% 91.9% 0
2 NAME_CONTRACT_TYPE category 215.5 kB 2 <0.1% 0 0% 194,675 90.4% 90.4% Cash loans
3 FLAG_OWN_CAR Int8 430.5 kB 2 <0.1% 0 0% 142,086 66.0% 66.0% 0
4 FLAG_OWN_REALTY Int8 430.5 kB 2 <0.1% 0 0% 149,412 69.4% 69.4% 1
5 CNT_CHILDREN int8 215.3 kB 12 <0.1% 0 0% 150,641 70.0% 70.0% 0
6 AMT_INCOME_TOTAL float64 1.7 MB 1,949 0.9% 0 0% 24,982 11.6% 11.6% 135000.0
7 AMT_CREDIT float32 861.0 kB 5,097 2.4% 0 0% 6,823 3.2% 3.2% 450000.0
8 AMT_ANNUITY float32 861.0 kB 12,801 5.9% 8 <0.1% 4,499 2.1% 2.1% 9000.0
9 AMT_GOODS_PRICE float32 861.0 kB 828 0.4% 187 0.1% 18,194 8.5% 8.5% 450000.0
10 NAME_TYPE_SUITE category 216.0 kB 7 <0.1% 901 0.4% 174,089 80.9% 81.2% Unaccompanied
11 NAME_INCOME_TYPE category 216.1 kB 8 <0.1% 0 0% 110,984 51.6% 51.6% Working
12 NAME_EDUCATION_TYPE category 215.8 kB 5 <0.1% 0 0% 152,993 71.1% 71.1% Secondary / secondary special
13 NAME_HOUSING_TYPE category 215.9 kB 6 <0.1% 0 0% 191,159 88.8% 88.8% House / apartment
14 REGION_POPULATION_RELATIVE float32 861.0 kB 81 <0.1% 0 0% 11,494 5.3% 5.3% 0.035792
15 DAYS_REGISTRATION float32 861.0 kB 15,249 7.1% 0 0% 79 <0.1% <0.1% -7.0
16 DAYS_ID_PUBLISH int16 430.5 kB 6,122 2.8% 0 0% 119 0.1% 0.1% -4074
17 OWN_CAR_AGE float32 861.0 kB 61 <0.1% 142,091 66.0% 5,232 2.4% 7.2% 7.0
18 FLAG_MOBIL int8 215.3 kB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 1
19 FLAG_EMP_PHONE int8 215.3 kB 2 <0.1% 0 0% 176,491 82.0% 82.0% 1
20 FLAG_WORK_PHONE int8 215.3 kB 2 <0.1% 0 0% 172,406 80.1% 80.1% 0
21 FLAG_CONT_MOBILE int8 215.3 kB 2 <0.1% 0 0% 214,855 99.8% 99.8% 1
22 FLAG_PHONE int8 215.3 kB 2 <0.1% 0 0% 154,906 72.0% 72.0% 0
23 FLAG_EMAIL int8 215.3 kB 2 <0.1% 0 0% 203,006 94.3% 94.3% 0
24 OCCUPATION_TYPE category 217.1 kB 18 <0.1% 67,480 31.3% 38,591 17.9% 26.1% Laborers
25 CNT_FAM_MEMBERS float32 861.0 kB 12 <0.1% 1 <0.1% 110,671 51.4% 51.4% 2.0
26 REGION_RATING_CLIENT int8 215.3 kB 3 <0.1% 0 0% 158,846 73.8% 73.8% 2
27 REGION_RATING_CLIENT_W_CITY int8 215.3 kB 3 <0.1% 0 0% 160,564 74.6% 74.6% 2
28 REG_REGION_NOT_LIVE_REGION int8 215.3 kB 2 <0.1% 0 0% 211,999 98.5% 98.5% 0
29 REG_REGION_NOT_WORK_REGION int8 215.3 kB 2 <0.1% 0 0% 204,222 94.9% 94.9% 0
30 LIVE_REGION_NOT_WORK_REGION int8 215.3 kB 2 <0.1% 0 0% 206,386 95.9% 95.9% 0
31 REG_CITY_NOT_LIVE_CITY int8 215.3 kB 2 <0.1% 0 0% 198,549 92.2% 92.2% 0
32 REG_CITY_NOT_WORK_CITY int8 215.3 kB 2 <0.1% 0 0% 165,697 77.0% 77.0% 0
33 LIVE_CITY_NOT_WORK_CITY int8 215.3 kB 2 <0.1% 0 0% 176,518 82.0% 82.0% 0
34 ORGANIZATION_TYPE category 221.3 kB 57 <0.1% 38,756 18.0% 47,582 22.1% 27.0% Business Entity Type 3
35 EXT_SOURCE_1 float32 861.0 kB 83,961 39.0% 121,373 56.4% 5 <0.1% <0.1% 0.44398212
36 EXT_SOURCE_2 float32 861.0 kB 102,229 47.5% 464 0.2% 503 0.2% 0.2% 0.28589788
37 EXT_SOURCE_3 float32 861.0 kB 804 0.4% 42,680 19.8% 985 0.5% 0.6% 0.7463002
38 APARTMENTS_AVG float32 861.0 kB 2,207 1.0% 109,076 50.7% 4,712 2.2% 4.4% 0.0825
39 BASEMENTAREA_AVG float32 861.0 kB 3,626 1.7% 125,793 58.4% 10,282 4.8% 11.5% 0.0
40 YEARS_BEGINEXPLUATATION_AVG float32 861.0 kB 260 0.1% 104,910 48.7% 3,073 1.4% 2.8% 0.9871
41 YEARS_BUILD_AVG float32 861.0 kB 146 0.1% 143,036 66.4% 2,132 1.0% 3.0% 0.8232
42 COMMONAREA_AVG float32 861.0 kB 2,964 1.4% 150,300 69.8% 5,899 2.7% 9.1% 0.0
43 ELEVATORS_AVG float32 861.0 kB 241 0.1% 114,570 53.2% 60,109 27.9% 59.7% 0.0
44 ENTRANCES_AVG float32 861.0 kB 266 0.1% 108,270 50.3% 23,867 11.1% 22.3% 0.1379
45 FLOORSMAX_AVG float32 861.0 kB 371 0.2% 106,970 49.7% 43,449 20.2% 40.1% 0.1667
46 FLOORSMIN_AVG float32 861.0 kB 280 0.1% 146,054 67.9% 23,117 10.7% 33.4% 0.2083
47 LANDAREA_AVG float32 861.0 kB 3,360 1.6% 127,644 59.3% 10,845 5.0% 12.4% 0.0
48 LIVINGAPARTMENTS_AVG float32 861.0 kB 1,761 0.8% 147,049 68.3% 2,984 1.4% 4.4% 0.0504
49 LIVINGAREA_AVG float32 861.0 kB 4,983 2.3% 107,990 50.2% 202 0.1% 0.2% 0.0
50 NONLIVINGAPARTMENTS_AVG float32 861.0 kB 345 0.2% 149,354 69.4% 38,319 17.8% 58.1% 0.0
51 NONLIVINGAREA_AVG float32 861.0 kB 3,042 1.4% 118,577 55.1% 41,099 19.1% 42.5% 0.0
52 APARTMENTS_MODE float32 861.0 kB 744 0.3% 109,076 50.7% 5,301 2.5% 5.0% 0.084
53 BASEMENTAREA_MODE float32 861.0 kB 3,687 1.7% 125,793 58.4% 11,561 5.4% 12.9% 0.0
54 YEARS_BEGINEXPLUATATION_MODE float32 861.0 kB 210 0.1% 104,910 48.7% 3,039 1.4% 2.8% 0.9871
55 YEARS_BUILD_MODE float32 861.0 kB 152 0.1% 143,036 66.4% 2,090 1.0% 2.9% 0.8301
56 COMMONAREA_MODE float32 861.0 kB 2,908 1.4% 150,300 69.8% 6,770 3.1% 10.4% 0.0
57 ELEVATORS_MODE float32 861.0 kB 26 <0.1% 114,570 53.2% 62,808 29.2% 62.4% 0.0
58 ENTRANCES_MODE float32 861.0 kB 30 <0.1% 108,270 50.3% 25,310 11.8% 23.7% 0.1379
59 FLOORSMAX_MODE float32 861.0 kB 25 <0.1% 106,970 49.7% 46,048 21.4% 42.5% 0.1667
60 FLOORSMIN_MODE float32 861.0 kB 25 <0.1% 146,054 67.9% 24,209 11.2% 35.0% 0.2083
61 LANDAREA_MODE float32 861.0 kB 3,406 1.6% 127,644 59.3% 12,121 5.6% 13.8% 0.0
62 LIVINGAPARTMENTS_MODE float32 861.0 kB 715 0.3% 147,049 68.3% 3,447 1.6% 5.1% 0.0551
63 LIVINGAREA_MODE float32 861.0 kB 5,083 2.4% 107,990 50.2% 310 0.1% 0.3% 0.0
64 NONLIVINGAPARTMENTS_MODE float32 861.0 kB 148 0.1% 149,354 69.4% 41,574 19.3% 63.1% 0.0
65 NONLIVINGAREA_MODE float32 861.0 kB 3,090 1.4% 118,577 55.1% 46,933 21.8% 48.5% 0.0
66 APARTMENTS_MEDI float32 861.0 kB 1,120 0.5% 109,076 50.7% 5,000 2.3% 4.7% 0.0833
67 BASEMENTAREA_MEDI float32 861.0 kB 3,614 1.7% 125,793 58.4% 10,458 4.9% 11.7% 0.0
68 YEARS_BEGINEXPLUATATION_MEDI float32 861.0 kB 232 0.1% 104,910 48.7% 3,060 1.4% 2.8% 0.9871
69 YEARS_BUILD_MEDI float32 861.0 kB 148 0.1% 143,036 66.4% 2,118 1.0% 2.9% 0.8256
70 COMMONAREA_MEDI float32 861.0 kB 2,982 1.4% 150,300 69.8% 6,068 2.8% 9.3% 0.0
71 ELEVATORS_MEDI float32 861.0 kB 46 <0.1% 114,570 53.2% 61,040 28.4% 60.6% 0.0
72 ENTRANCES_MEDI float32 861.0 kB 46 <0.1% 108,270 50.3% 24,940 11.6% 23.3% 0.1379
73 FLOORSMAX_MEDI float32 861.0 kB 49 <0.1% 106,970 49.7% 44,659 20.7% 41.2% 0.1667
74 FLOORSMIN_MEDI float32 861.0 kB 47 <0.1% 146,054 67.9% 23,733 11.0% 34.3% 0.2083
75 LANDAREA_MEDI float32 861.0 kB 3,393 1.6% 127,644 59.3% 11,058 5.1% 12.6% 0.0
76 LIVINGAPARTMENTS_MEDI float32 861.0 kB 1,063 0.5% 147,049 68.3% 3,142 1.5% 4.6% 0.0513
77 LIVINGAREA_MEDI float32 861.0 kB 5,067 2.4% 107,990 50.2% 210 0.1% 0.2% 0.0
78 NONLIVINGAPARTMENTS_MEDI float32 861.0 kB 190 0.1% 149,354 69.4% 39,384 18.3% 59.8% 0.0
79 NONLIVINGAREA_MEDI float32 861.0 kB 3,083 1.4% 118,577 55.1% 42,610 19.8% 44.1% 0.0
80 FONDKAPREMONT_MODE category 215.7 kB 4 <0.1% 147,099 68.3% 51,785 24.1% 76.0% reg oper account
81 HOUSETYPE_MODE category 215.6 kB 3 <0.1% 107,834 50.1% 105,515 49.0% 98.2% block of flats
82 TOTALAREA_MODE float32 861.0 kB 4,896 2.3% 103,833 48.2% 417 0.2% 0.4% 0.0
83 WALLSMATERIAL_MODE category 216.0 kB 7 <0.1% 109,329 50.8% 46,298 21.5% 43.7% Panel
84 OBS_30_CNT_SOCIAL_CIRCLE float32 861.0 kB 32 <0.1% 714 0.3% 114,550 53.2% 53.4% 0.0
85 DEF_30_CNT_SOCIAL_CIRCLE float32 861.0 kB 10 <0.1% 714 0.3% 189,988 88.3% 88.6% 0.0
86 OBS_60_CNT_SOCIAL_CIRCLE float32 861.0 kB 32 <0.1% 714 0.3% 115,085 53.5% 53.6% 0.0
87 DEF_60_CNT_SOCIAL_CIRCLE float32 861.0 kB 9 <0.1% 714 0.3% 196,614 91.3% 91.6% 0.0
88 DAYS_LAST_PHONE_CHANGE float32 861.0 kB 3,720 1.7% 1 <0.1% 26,201 12.2% 12.2% 0.0
89 FLAG_DOCUMENT_2 int8 215.3 kB 2 <0.1% 0 0% 215,246 >99.9% >99.9% 0
90 FLAG_DOCUMENT_3 int8 215.3 kB 2 <0.1% 0 0% 152,845 71.0% 71.0% 1
91 FLAG_DOCUMENT_4 int8 215.3 kB 2 <0.1% 0 0% 215,238 >99.9% >99.9% 0
92 FLAG_DOCUMENT_5 int8 215.3 kB 2 <0.1% 0 0% 212,025 98.5% 98.5% 0
93 FLAG_DOCUMENT_6 int8 215.3 kB 2 <0.1% 0 0% 196,348 91.2% 91.2% 0
94 FLAG_DOCUMENT_7 int8 215.3 kB 2 <0.1% 0 0% 215,221 >99.9% >99.9% 0
95 FLAG_DOCUMENT_8 int8 215.3 kB 2 <0.1% 0 0% 197,689 91.8% 91.8% 0
96 FLAG_DOCUMENT_9 int8 215.3 kB 2 <0.1% 0 0% 214,440 99.6% 99.6% 0
97 FLAG_DOCUMENT_10 int8 215.3 kB 2 <0.1% 0 0% 215,253 >99.9% >99.9% 0
98 FLAG_DOCUMENT_11 int8 215.3 kB 2 <0.1% 0 0% 214,448 99.6% 99.6% 0
99 FLAG_DOCUMENT_12 int8 215.3 kB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0
100 FLAG_DOCUMENT_13 int8 215.3 kB 2 <0.1% 0 0% 214,541 99.7% 99.7% 0
101 FLAG_DOCUMENT_14 int8 215.3 kB 2 <0.1% 0 0% 214,614 99.7% 99.7% 0
102 FLAG_DOCUMENT_15 int8 215.3 kB 2 <0.1% 0 0% 215,015 99.9% 99.9% 0
103 FLAG_DOCUMENT_16 int8 215.3 kB 2 <0.1% 0 0% 213,089 99.0% 99.0% 0
104 FLAG_DOCUMENT_17 int8 215.3 kB 2 <0.1% 0 0% 215,200 >99.9% >99.9% 0
105 FLAG_DOCUMENT_18 int8 215.3 kB 2 <0.1% 0 0% 213,525 99.2% 99.2% 0
106 FLAG_DOCUMENT_19 int8 215.3 kB 2 <0.1% 0 0% 215,124 99.9% 99.9% 0
107 FLAG_DOCUMENT_20 int8 215.3 kB 2 <0.1% 0 0% 215,146 99.9% 99.9% 0
108 FLAG_DOCUMENT_21 int8 215.3 kB 2 <0.1% 0 0% 215,187 >99.9% >99.9% 0
109 AMT_REQ_CREDIT_BUREAU_HOUR float32 861.0 kB 5 <0.1% 29,081 13.5% 185,061 86.0% 99.4% 0.0
110 AMT_REQ_CREDIT_BUREAU_DAY float32 861.0 kB 9 <0.1% 29,081 13.5% 185,147 86.0% 99.4% 0.0
111 AMT_REQ_CREDIT_BUREAU_WEEK float32 861.0 kB 9 <0.1% 29,081 13.5% 180,246 83.7% 96.8% 0.0
112 AMT_REQ_CREDIT_BUREAU_MON float32 861.0 kB 22 <0.1% 29,081 13.5% 155,679 72.3% 83.6% 0.0
113 AMT_REQ_CREDIT_BUREAU_QRT float32 861.0 kB 10 <0.1% 29,081 13.5% 150,895 70.1% 81.0% 0.0
114 AMT_REQ_CREDIT_BUREAU_YEAR float32 861.0 kB 24 <0.1% 29,081 13.5% 50,313 23.4% 27.0% 0.0
115 n_credits_total float32 861.0 kB 57 <0.1% 30,836 14.3% 25,129 11.7% 13.6% 1.0
116 n_credits_active float32 861.0 kB 22 <0.1% 30,836 14.3% 51,735 24.0% 28.1% 1.0
117 n_credits_closed float32 861.0 kB 52 <0.1% 30,836 14.3% 37,807 17.6% 20.5% 1.0
118 n_credits_bad_debt float32 861.0 kB 2 <0.1% 30,836 14.3% 184,408 85.7% >99.9% 0.0
119 n_credits_sold float32 861.0 kB 7 <0.1% 30,836 14.3% 180,711 84.0% 98.0% 0.0
120 mode_credit_currency category 215.6 kB 3 <0.1% 30,836 14.3% 184,386 85.7% >99.9% currency 1
121 n_different_currencies float32 861.0 kB 3 <0.1% 30,836 14.3% 183,765 85.4% 99.6% 1.0
122 n_currency_1 float32 861.0 kB 58 <0.1% 30,836 14.3% 25,155 11.7% 13.6% 1.0
123 n_currency_2 float32 861.0 kB 7 <0.1% 30,836 14.3% 183,835 85.4% 99.7% 0.0
124 n_currency_3 float32 861.0 kB 4 <0.1% 30,836 14.3% 184,319 85.6% 99.9% 0.0
125 n_currency_4 float32 861.0 kB 2 <0.1% 30,836 14.3% 184,414 85.7% >99.9% 0.0
126 days_credit_min float32 861.0 kB 2,921 1.4% 30,836 14.3% 205 0.1% 0.1% -2919.0
127 days_credit_max float32 861.0 kB 2,922 1.4% 30,836 14.3% 480 0.2% 0.3% -91.0
128 days_credit_mean float32 861.0 kB 53,697 24.9% 30,836 14.3% 61 <0.1% <0.1% -441.0
129 days_credit_std float32 861.0 kB 133,052 61.8% 55,965 26.0% 1,383 0.6% 0.9% 0.0
130 days_credit_median float32 861.0 kB 5,711 2.7% 30,836 14.3% 118 0.1% 0.1% -561.0
131 days_credit_range float32 861.0 kB 2,913 1.4% 30,836 14.3% 26,512 12.3% 14.4% 0.0
132 days_credit_overdue_min float32 861.0 kB 69 <0.1% 30,836 14.3% 184,320 85.6% 99.9% 0.0
133 days_credit_overdue_max float32 861.0 kB 671 0.3% 30,836 14.3% 182,056 84.6% 98.7% 0.0
134 days_credit_overdue_mean float32 861.0 kB 1,195 0.6% 30,836 14.3% 182,056 84.6% 98.7% 0.0
135 days_credit_overdue_std float32 861.0 kB 1,441 0.7% 55,965 26.0% 157,026 72.9% 98.6% 0.0
136 days_credit_overdue_median float32 861.0 kB 168 0.1% 30,836 14.3% 184,119 85.5% 99.8% 0.0
137 days_credit_overdue_range float32 861.0 kB 655 0.3% 30,836 14.3% 182,155 84.6% 98.8% 0.0
138 days_credit_enddate_min Int32 1.1 MB 6,266 2.9% 32,432 15.1% 119 0.1% 0.1% -2359
139 days_credit_enddate_max Int32 1.1 MB 12,274 5.7% 32,432 15.1% 187 0.1% 0.1% 31060
140 days_credit_enddate_mean Float64 1.9 MB 77,581 36.0% 32,432 15.1% 46 <0.1% <0.1% -99.0
141 days_credit_enddate_std Float64 1.9 MB 134,001 62.3% 59,197 27.5% 1,369 0.6% 0.9% 0.0
142 days_credit_enddate_median Float32 1.1 MB 13,238 6.1% 32,432 15.1% 113 0.1% 0.1% 0.0
143 days_credit_enddate_range Int32 1.1 MB 17,383 8.1% 32,432 15.1% 28,134 13.1% 15.4% 0
144 days_enddate_fact_min Int32 1.1 MB 2,901 1.3% 53,870 25.0% 122 0.1% 0.1% -2450
145 days_enddate_fact_max Int16 645.8 kB 2,793 1.3% 53,870 25.0% 340 0.2% 0.2% -84
146 days_enddate_fact_mean Float32 1.1 MB 35,685 16.6% 53,870 25.0% 71 <0.1% <0.1% -795.0
147 days_enddate_fact_std Float32 1.1 MB 93,662 43.5% 91,572 42.5% 921 0.4% 0.7% 0.0
148 days_enddate_fact_median Float32 1.1 MB 5,341 2.5% 53,870 25.0% 135 0.1% 0.1% -919.0
149 days_enddate_fact_range Int32 1.1 MB 2,796 1.3% 53,870 25.0% 38,623 17.9% 23.9% 0
150 amt_credit_max_overdue_min float64 1.7 MB 9,923 4.6% 86,638 40.2% 116,256 54.0% 90.4% 0.0
151 amt_credit_max_overdue_max float64 1.7 MB 32,871 15.3% 86,638 40.2% 79,549 37.0% 61.8% 0.0
152 amt_credit_max_overdue_mean float64 1.7 MB 39,837 18.5% 86,638 40.2% 79,549 37.0% 61.8% 0.0
153 amt_credit_max_overdue_std float64 1.7 MB 35,648 16.6% 132,328 61.5% 43,267 20.1% 52.2% 0.0
154 amt_credit_max_overdue_median float64 1.7 MB 21,151 9.8% 86,638 40.2% 100,477 46.7% 78.1% 0.0
155 amt_credit_max_overdue_range float64 1.7 MB 27,267 12.7% 86,638 40.2% 88,957 41.3% 69.2% 0.0
156 cnt_credit_prolong_min float32 861.0 kB 6 <0.1% 30,836 14.3% 184,215 85.6% 99.9% 0.0
157 cnt_credit_prolong_max float32 861.0 kB 9 <0.1% 30,836 14.3% 178,412 82.9% 96.7% 0.0
158 cnt_credit_prolong_mean float32 861.0 kB 100 <0.1% 30,836 14.3% 178,412 82.9% 96.7% 0.0
159 cnt_credit_prolong_std float32 861.0 kB 167 0.1% 55,965 26.0% 153,489 71.3% 96.4% 0.0
160 cnt_credit_prolong_median float32 861.0 kB 8 <0.1% 30,836 14.3% 183,844 85.4% 99.7% 0.0
161 cnt_credit_prolong_range float32 861.0 kB 9 <0.1% 30,836 14.3% 178,618 83.0% 96.9% 0.0
162 cnt_credit_prolong_sum float32 861.0 kB 10 <0.1% 30,836 14.3% 178,412 82.9% 96.7% 0.0
163 amt_credit_sum_min float64 1.7 MB 44,136 20.5% 30,836 14.3% 30,083 14.0% 16.3% 0.0
164 amt_credit_sum_max float64 1.7 MB 49,429 23.0% 30,836 14.3% 6,293 2.9% 3.4% 450000.0
165 amt_credit_sum_mean float64 1.7 MB 150,070 69.7% 30,836 14.3% 943 0.4% 0.5% 225000.0
166 amt_credit_sum_std float64 1.7 MB 148,439 69.0% 55,965 26.0% 1,156 0.5% 0.7% 0.0
167 amt_credit_sum_median float64 1.7 MB 77,800 36.1% 30,836 14.3% 5,011 2.3% 2.7% 225000.0
168 amt_credit_sum_range float64 1.7 MB 94,343 43.8% 30,836 14.3% 26,285 12.2% 14.3% 0.0
169 amt_credit_sum_sum float64 1.7 MB 147,742 68.6% 30,836 14.3% 924 0.4% 0.5% 225000.0
170 amt_credit_sum_debt_min float64 1.7 MB 20,754 9.6% 36,039 16.7% 155,688 72.3% 86.9% 0.0
171 amt_credit_sum_debt_max float64 1.7 MB 104,430 48.5% 36,039 16.7% 49,345 22.9% 27.5% 0.0
172 amt_credit_sum_debt_mean float64 1.7 MB 121,544 56.5% 36,039 16.7% 48,543 22.6% 27.1% 0.0
173 amt_credit_sum_debt_std float64 1.7 MB 116,314 54.0% 65,302 30.3% 31,435 14.6% 21.0% 0.0
174 amt_credit_sum_debt_median float64 1.7 MB 48,592 22.6% 36,039 16.7% 120,818 56.1% 67.4% 0.0
175 amt_credit_sum_debt_range float64 1.7 MB 98,760 45.9% 36,039 16.7% 60,698 28.2% 33.9% 0.0
176 amt_credit_sum_debt_sum float64 1.7 MB 113,811 52.9% 30,836 14.3% 53,746 25.0% 29.1% 0.0
177 amt_credit_sum_limit_min float64 1.7 MB 2,121 1.0% 45,585 21.2% 167,209 77.7% 98.5% 0.0
178 amt_credit_sum_limit_max float64 1.7 MB 24,324 11.3% 45,585 21.2% 135,642 63.0% 79.9% 0.0
179 amt_credit_sum_limit_mean float64 1.7 MB 27,475 12.8% 45,585 21.2% 135,599 63.0% 79.9% 0.0
180 amt_credit_sum_limit_std float64 1.7 MB 26,937 12.5% 80,896 37.6% 102,265 47.5% 76.1% 0.0
181 amt_credit_sum_limit_median float64 1.7 MB 5,916 2.7% 45,585 21.2% 162,356 75.4% 95.7% 0.0
182 amt_credit_sum_limit_range float64 1.7 MB 22,987 10.7% 45,585 21.2% 137,576 63.9% 81.1% 0.0
183 amt_credit_sum_limit_sum float64 1.7 MB 26,367 12.2% 30,836 14.3% 150,348 69.8% 81.5% 0.0
184 amt_credit_sum_overdue_min float32 861.0 kB 81 <0.1% 30,836 14.3% 184,318 85.6% 99.9% 0.0
185 amt_credit_sum_overdue_max float64 1.7 MB 918 0.4% 30,836 14.3% 182,090 84.6% 98.7% 0.0
186 amt_credit_sum_overdue_mean float64 1.7 MB 1,424 0.7% 30,836 14.3% 182,090 84.6% 98.7% 0.0
187 amt_credit_sum_overdue_std float64 1.7 MB 1,618 0.8% 55,965 26.0% 157,060 73.0% 98.6% 0.0
188 amt_credit_sum_overdue_median float64 1.7 MB 200 0.1% 30,836 14.3% 184,121 85.5% 99.8% 0.0
189 amt_credit_sum_overdue_range float64 1.7 MB 895 0.4% 30,836 14.3% 182,189 84.6% 98.8% 0.0
190 amt_credit_sum_overdue_sum float64 1.7 MB 930 0.4% 30,836 14.3% 182,090 84.6% 98.7% 0.0
191 mode_credit_type category 215.8 kB 6 <0.1% 30,836 14.3% 160,802 74.7% 87.2% Consumer credit
192 n_different_credit_types float32 861.0 kB 5 <0.1% 30,836 14.3% 100,733 46.8% 54.6% 2.0
193 n_consumer_credits float32 861.0 kB 51 <0.1% 30,836 14.3% 33,496 15.6% 18.2% 1.0
194 n_credit_card_credits float32 861.0 kB 22 <0.1% 30,836 14.3% 63,863 29.7% 34.6% 0.0
195 n_car_loans float32 861.0 kB 9 <0.1% 30,836 14.3% 170,683 79.3% 92.6% 0.0
196 n_mortgages float32 861.0 kB 7 <0.1% 30,836 14.3% 174,434 81.0% 94.6% 0.0
197 n_microloans float32 861.0 kB 28 <0.1% 30,836 14.3% 181,975 84.5% 98.7% 0.0
198 n_other_type_credit float32 861.0 kB 9 <0.1% 30,836 14.3% 182,373 84.7% 98.9% 0.0
199 days_credit_update_min float32 861.0 kB 2,949 1.4% 30,836 14.3% 549 0.3% 0.3% -19.0
200 days_credit_update_max float32 861.0 kB 2,585 1.2% 30,836 14.3% 7,529 3.5% 4.1% -7.0
201 days_credit_update_mean float32 861.0 kB 46,055 21.4% 30,836 14.3% 512 0.2% 0.3% -12.0
202 days_credit_update_std float64 1.7 MB 131,798 61.2% 55,965 26.0% 1,885 0.9% 1.2% 0.0
203 days_credit_update_median float32 861.0 kB 4,779 2.2% 30,836 14.3% 1,055 0.5% 0.6% -22.0
204 days_credit_update_range float32 861.0 kB 2,925 1.4% 30,836 14.3% 27,014 12.5% 14.6% 0.0
205 amt_annuity_min float64 1.7 MB 9,921 4.6% 159,480 74.1% 36,975 17.2% 66.3% 0.0
206 amt_annuity_max float64 1.7 MB 18,638 8.7% 159,480 74.1% 13,781 6.4% 24.7% 0.0
207 amt_annuity_mean float64 1.7 MB 29,816 13.9% 159,480 74.1% 13,781 6.4% 24.7% 0.0
208 amt_annuity_std float64 1.7 MB 25,917 12.0% 171,585 79.7% 15,071 7.0% 34.5% 0.0
209 amt_annuity_median float64 1.7 MB 16,441 7.6% 159,480 74.1% 23,785 11.0% 42.6% 0.0
210 amt_annuity_range float64 1.7 MB 15,462 7.2% 159,480 74.1% 27,176 12.6% 48.7% 0.0
211 bureau_months_balance_min float32 861.0 kB 97 <0.1% 152,586 70.9% 1,508 0.7% 2.4% -95.0
212 bureau_months_balance_max float32 861.0 kB 89 <0.1% 152,586 70.9% 59,695 27.7% 95.3% 0.0
213 bureau_dpd_status_min float32 861.0 kB 6 <0.1% 152,586 70.9% 62,638 29.1% 99.9% 0.0
214 bureau_dpd_status_max float32 861.0 kB 6 <0.1% 152,586 70.9% 41,042 19.1% 65.5% 0.0
215 bureau_dpd_status_mean float32 861.0 kB 3,772 1.8% 152,586 70.9% 41,042 19.1% 65.5% 0.0
216 bureau_dpd_status_std float32 861.0 kB 7,016 3.3% 153,149 71.1% 40,500 18.8% 65.2% 0.0
217 bureau_dpd_status_median float32 861.0 kB 11 <0.1% 152,586 70.9% 61,726 28.7% 98.5% 0.0
218 bureau_dpd_status_range float32 861.0 kB 6 <0.1% 152,586 70.9% 41,063 19.1% 65.5% 0.0
219 n_different_loans float32 861.0 kB 4 <0.1% 11,456 5.3% 77,974 36.2% 38.3% 2.0
220 n_cash_loans float32 861.0 kB 55 <0.1% 11,456 5.3% 83,697 38.9% 41.1% 0.0
221 n_consumer_loans float32 861.0 kB 36 <0.1% 11,456 5.3% 78,331 36.4% 38.4% 1.0
222 n_revolving_loans float32 861.0 kB 25 <0.1% 11,456 5.3% 130,792 60.8% 64.2% 0.0
223 amt_annuity_min_previous_application float64 1.7 MB 113,816 52.9% 11,752 5.5% 16,017 7.4% 7.9% 2250.0
224 amt_annuity_max_previous_application float64 1.7 MB 110,598 51.4% 11,752 5.5% 2,363 1.1% 1.2% 22500.0
225 amt_annuity_mean_previous_application float64 1.7 MB 191,798 89.1% 11,752 5.5% 367 0.2% 0.2% 2250.0
226 amt_annuity_std_previous_application float64 1.7 MB 157,678 73.3% 56,274 26.1% 296 0.1% 0.2% 0.0
227 amt_annuity_median_previous_application float64 1.7 MB 157,063 73.0% 11,752 5.5% 1,357 0.6% 0.7% 11250.0
228 amt_annuity_range_previous_application float64 1.7 MB 146,639 68.1% 11,752 5.5% 44,818 20.8% 22.0% 0.0
229 amt_application_min float64 1.7 MB 29,672 13.8% 11,456 5.3% 95,786 44.5% 47.0% 0.0
230 amt_application_max float64 1.7 MB 39,568 18.4% 11,456 5.3% 9,541 4.4% 4.7% 450000.0
231 amt_application_mean float64 1.7 MB 142,462 66.2% 11,456 5.3% 736 0.3% 0.4% 0.0
232 amt_application_std float64 1.7 MB 150,921 70.1% 48,154 22.4% 1,132 0.5% 0.7% 0.0
233 amt_application_median float64 1.7 MB 63,472 29.5% 11,456 5.3% 10,838 5.0% 5.3% 0.0
234 amt_application_range float64 1.7 MB 51,986 24.2% 11,456 5.3% 37,830 17.6% 18.6% 0.0
235 amt_credit_min float64 1.7 MB 33,220 15.4% 11,456 5.3% 79,660 37.0% 39.1% 0.0
236 amt_credit_max float64 1.7 MB 49,618 23.1% 11,456 5.3% 4,696 2.2% 2.3% 450000.0
237 amt_credit_mean float64 1.7 MB 156,814 72.8% 11,456 5.3% 293 0.1% 0.1% 45000.0
238 amt_credit_std float64 1.7 MB 157,015 72.9% 48,154 22.4% 340 0.2% 0.2% 0.0
239 amt_credit_median float64 1.7 MB 73,966 34.4% 11,456 5.3% 8,095 3.8% 4.0% 0.0
240 amt_credit_range float64 1.7 MB 71,950 33.4% 11,456 5.3% 37,038 17.2% 18.2% 0.0
241 amt_down_payment_min float64 1.7 MB 10,194 4.7% 23,703 11.0% 125,181 58.2% 65.4% 0.0
242 amt_down_payment_max float64 1.7 MB 17,607 8.2% 23,703 11.0% 53,725 25.0% 28.0% 0.0
243 amt_down_payment_mean float64 1.7 MB 42,577 19.8% 23,703 11.0% 53,725 25.0% 28.0% 0.0
244 amt_down_payment_std float64 1.7 MB 57,310 26.6% 99,327 46.1% 19,374 9.0% 16.7% 0.0
245 amt_down_payment_median float64 1.7 MB 19,734 9.2% 23,703 11.0% 74,539 34.6% 38.9% 0.0
246 amt_down_payment_range float64 1.7 MB 17,144 8.0% 23,703 11.0% 94,998 44.1% 49.6% 0.0
247 amt_goods_price_min float64 1.7 MB 39,170 18.2% 12,169 5.7% 11,596 5.4% 5.7% 45000.0
248 amt_goods_price_max float64 1.7 MB 39,563 18.4% 12,169 5.7% 9,543 4.4% 4.7% 450000.0
249 amt_goods_price_mean float64 1.7 MB 138,760 64.5% 12,169 5.7% 777 0.4% 0.4% 135000.0
250 amt_goods_price_std float64 1.7 MB 140,074 65.1% 56,728 26.4% 1,360 0.6% 0.9% 0.0
251 amt_goods_price_median float64 1.7 MB 67,080 31.2% 12,169 5.7% 4,499 2.1% 2.2% 135000.0
252 amt_goods_price_range float64 1.7 MB 79,283 36.8% 12,169 5.7% 45,919 21.3% 22.6% 0.0
253 rate_down_payment_min float32 861.0 kB 46,257 21.5% 23,703 11.0% 125,181 58.2% 65.4% 0.0
254 rate_down_payment_max float32 861.0 kB 84,883 39.4% 23,703 11.0% 53,725 25.0% 28.0% 0.0
255 rate_down_payment_mean float32 861.0 kB 116,968 54.3% 23,703 11.0% 53,725 25.0% 28.0% 0.0
256 rate_down_payment_std float32 861.0 kB 88,115 40.9% 99,327 46.1% 19,263 8.9% 16.6% 0.0
257 rate_down_payment_median float32 861.0 kB 87,629 40.7% 23,703 11.0% 74,539 34.6% 38.9% 0.0
258 rate_down_payment_range float32 861.0 kB 73,615 34.2% 23,703 11.0% 94,887 44.1% 49.5% 0.0
259 rate_interest_primary_min float32 861.0 kB 119 0.1% 212,016 98.5% 666 0.3% 20.5% 0.18913634
260 rate_interest_primary_max float32 861.0 kB 119 0.1% 212,016 98.5% 674 0.3% 20.8% 0.18913634
261 rate_interest_primary_mean float32 861.0 kB 160 0.1% 212,016 98.5% 655 0.3% 20.2% 0.18913634
262 rate_interest_primary_std float32 861.0 kB 39 <0.1% 215,139 99.9% 37 <0.1% 31.4% 0.0
263 rate_interest_primary_median float32 861.0 kB 157 0.1% 212,016 98.5% 655 0.3% 20.2% 0.18913634
264 rate_interest_primary_range float32 861.0 kB 37 <0.1% 212,016 98.5% 3,160 1.5% 97.5% 0.0
265 rate_interest_primary_count float32 861.0 kB 4 <0.1% 11,456 5.3% 200,560 93.2% 98.4% 0.0
266 rate_interest_privileged_min float32 861.0 kB 21 <0.1% 212,016 98.5% 892 0.4% 27.5% 0.83509517
267 rate_interest_privileged_max float32 861.0 kB 21 <0.1% 212,016 98.5% 906 0.4% 28.0% 0.83509517
268 rate_interest_privileged_mean float32 861.0 kB 42 <0.1% 212,016 98.5% 881 0.4% 27.2% 0.83509517
269 rate_interest_privileged_std float32 861.0 kB 21 <0.1% 215,139 99.9% 50 <0.1% 42.4% 0.0
270 rate_interest_privileged_median float32 861.0 kB 40 <0.1% 212,016 98.5% 881 0.4% 27.2% 0.83509517
271 rate_interest_privileged_range float32 861.0 kB 20 <0.1% 212,016 98.5% 3,173 1.5% 97.9% 0.0
272 rate_interest_privileged_count float32 861.0 kB 4 <0.1% 11,456 5.3% 200,560 93.2% 98.4% 0.0
273 n_different_contract_types float32 861.0 kB 4 <0.1% 11,456 5.3% 77,974 36.2% 38.3% 2.0
274 n_contract_status_approved float32 861.0 kB 25 <0.1% 11,456 5.3% 53,519 24.9% 26.3% 1.0
275 n_contract_status_canceled float32 861.0 kB 36 <0.1% 11,456 5.3% 126,281 58.7% 62.0% 0.0
276 n_contract_status_refused float32 861.0 kB 44 <0.1% 11,456 5.3% 133,394 62.0% 65.5% 0.0
277 n_contract_status_unused_offer float32 861.0 kB 11 <0.1% 11,456 5.3% 190,553 88.5% 93.5% 0.0
278 days_decision_min float32 861.0 kB 2,921 1.4% 11,456 5.3% 136 0.1% 0.1% -476.0
279 days_decision_max float32 861.0 kB 2,921 1.4% 11,456 5.3% 598 0.3% 0.3% -7.0
280 days_decision_mean float32 861.0 kB 50,330 23.4% 11,456 5.3% 109 0.1% 0.1% -351.0
281 days_decision_std float32 861.0 kB 129,009 59.9% 48,154 22.4% 3,867 1.8% 2.3% 0.0
282 days_decision_median float32 861.0 kB 5,656 2.6% 11,456 5.3% 255 0.1% 0.1% -364.0
283 days_decision_range float32 861.0 kB 2,919 1.4% 11,456 5.3% 40,565 18.8% 19.9% 0.0
284 n_payment_type_cash_through_bank float32 861.0 kB 44 <0.1% 11,456 5.3% 54,943 25.5% 27.0% 1.0
285 n_payment_type_cash_from_account float32 861.0 kB 1 <0.1% 11,456 5.3% 203,801 94.7% 100.0% 0.0
286 n_payment_type_not_available float32 861.0 kB 46 <0.1% 11,456 5.3% 71,796 33.4% 35.2% 0.0
287 n_reject_reason_not_applicable float32 861.0 kB 44 <0.1% 11,456 5.3% 44,154 20.5% 21.7% 1.0
288 n_reject_reason_hc float32 861.0 kB 36 <0.1% 11,456 5.3% 157,346 73.1% 77.2% 0.0
289 n_reject_reason_limit float32 861.0 kB 22 <0.1% 11,456 5.3% 183,819 85.4% 90.2% 0.0
290 n_reject_reason_scoc float32 861.0 kB 20 <0.1% 11,456 5.3% 188,558 87.6% 92.5% 0.0
291 n_reject_reason_client float32 861.0 kB 11 <0.1% 11,456 5.3% 190,553 88.5% 93.5% 0.0
292 n_reject_reason_scofr float32 861.0 kB 16 <0.1% 11,456 5.3% 199,055 92.5% 97.7% 0.0
293 n_client_type_new float32 861.0 kB 14 <0.1% 11,456 5.3% 154,064 71.6% 75.6% 1.0
294 n_client_type_repeater float32 861.0 kB 61 <0.1% 11,456 5.3% 49,122 22.8% 24.1% 0.0
295 n_client_type_refreshed float32 861.0 kB 23 <0.1% 11,456 5.3% 150,108 69.7% 73.7% 0.0
296 n_portfolio_pos float32 861.0 kB 32 <0.1% 11,456 5.3% 81,754 38.0% 40.1% 1.0
297 n_portfolio_cash float32 861.0 kB 39 <0.1% 11,456 5.3% 99,269 46.1% 48.7% 0.0
298 n_portfolio_cards float32 861.0 kB 21 <0.1% 11,456 5.3% 135,213 62.8% 66.3% 0.0
299 n_product_type_xsell float32 861.0 kB 33 <0.1% 11,456 5.3% 97,659 45.4% 47.9% 0.0
300 n_product_type_walk_in float32 861.0 kB 28 <0.1% 11,456 5.3% 152,783 71.0% 75.0% 0.0
301 n_different_channels float32 861.0 kB 7 <0.1% 11,456 5.3% 79,085 36.7% 38.8% 2.0
302 n_channel_type_credit_and_cash float32 861.0 kB 52 <0.1% 11,456 5.3% 96,482 44.8% 47.3% 0.0
303 n_channel_type_countrywide float32 861.0 kB 34 <0.1% 11,456 5.3% 67,466 31.3% 33.1% 1.0
304 n_channel_type_stone float32 861.0 kB 22 <0.1% 11,456 5.3% 121,683 56.5% 59.7% 0.0
305 n_channel_type_regional_and_local float32 861.0 kB 19 <0.1% 11,456 5.3% 158,328 73.6% 77.7% 0.0
306 n_channel_type_contact_center float32 861.0 kB 19 <0.1% 11,456 5.3% 175,621 81.6% 86.2% 0.0
307 n_channel_type_ap_minus float32 861.0 kB 33 <0.1% 11,456 5.3% 187,751 87.2% 92.1% 0.0
308 n_channel_type_channel_corporate_sales float32 861.0 kB 20 <0.1% 11,456 5.3% 202,289 94.0% 99.3% 0.0
309 n_channel_type_car_dealer float32 861.0 kB 6 <0.1% 11,456 5.3% 203,580 94.6% 99.9% 0.0
310 n_cnt_payment_0 float32 861.0 kB 21 <0.1% 11,456 5.3% 135,213 62.8% 66.3% 0.0
311 cnt_payment_min float32 861.0 kB 31 <0.1% 11,752 5.5% 68,588 31.9% 33.7% 0.0
312 cnt_payment_max float32 861.0 kB 39 <0.1% 11,752 5.5% 52,776 24.5% 25.9% 12.0
313 cnt_payment_mean float32 861.0 kB 2,495 1.2% 11,752 5.5% 25,110 11.7% 12.3% 12.0
314 cnt_payment_std float32 861.0 kB 14,394 6.7% 56,274 26.1% 10,117 4.7% 6.4% 0.0
315 cnt_payment_median float32 861.0 kB 87 <0.1% 11,752 5.5% 53,998 25.1% 26.5% 12.0
316 cnt_payment_range float32 861.0 kB 69 <0.1% 11,752 5.5% 54,639 25.4% 26.8% 0.0
317 n_yield_group_low_action float32 861.0 kB 22 <0.1% 11,456 5.3% 163,415 75.9% 80.2% 0.0
318 n_yield_group_low_normal float32 861.0 kB 23 <0.1% 11,456 5.3% 94,724 44.0% 46.5% 0.0
319 n_yield_group_middle float32 861.0 kB 25 <0.1% 11,456 5.3% 80,043 37.2% 39.3% 0.0
320 n_yield_group_high float32 861.0 kB 30 <0.1% 11,456 5.3% 89,153 41.4% 43.7% 0.0
321 days_first_draw_min float32 861.0 kB 2,718 1.3% 12,377 5.7% 165,404 76.8% 81.5% 365243.0
322 days_first_draw_max float32 861.0 kB 939 0.4% 12,377 5.7% 201,133 93.4% 99.1% 365243.0
323 days_first_draw_mean float32 861.0 kB 14,131 6.6% 12,377 5.7% 165,404 76.8% 81.5% 365243.0
324 days_first_draw_std float64 1.7 MB 13,562 6.3% 67,931 31.6% 111,591 51.8% 75.7% 0.0
325 days_first_draw_median float32 861.0 kB 2,812 1.3% 12,377 5.7% 193,631 90.0% 95.4% 365243.0
326 days_first_draw_range float32 861.0 kB 2,723 1.3% 12,377 5.7% 167,145 77.6% 82.4% 0.0
327 days_last_due_1st_version_min float32 861.0 kB 4,081 1.9% 12,377 5.7% 1,911 0.9% 0.9% 365243.0
328 days_last_due_1st_version_max float32 861.0 kB 4,521 2.1% 12,377 5.7% 55,263 25.7% 27.2% 365243.0
329 days_last_due_1st_version_mean float32 861.0 kB 51,499 23.9% 12,377 5.7% 1,911 0.9% 0.9% 365243.0
330 days_last_due_1st_version_std float64 1.7 MB 104,185 48.4% 67,931 31.6% 50 <0.1% <0.1% 241.83051916579925
331 days_last_due_1st_version_median float32 861.0 kB 10,719 5.0% 12,377 5.7% 1,937 0.9% 1.0% 365243.0
332 days_last_due_1st_version_range float32 861.0 kB 7,864 3.7% 12,377 5.7% 55,584 25.8% 27.4% 0.0
333 days_last_due_min float32 861.0 kB 2,859 1.3% 12,377 5.7% 14,374 6.7% 7.1% 365243.0
334 days_last_due_max float32 861.0 kB 2,761 1.3% 12,377 5.7% 98,527 45.8% 48.6% 365243.0
335 days_last_due_mean float32 861.0 kB 51,645 24.0% 12,377 5.7% 14,374 6.7% 7.1% 365243.0
336 days_last_due_std float64 1.7 MB 99,434 46.2% 67,931 31.6% 3,105 1.4% 2.1% 0.0
337 days_last_due_median float32 861.0 kB 7,906 3.7% 12,377 5.7% 21,138 9.8% 10.4% 365243.0
338 days_last_due_range float32 861.0 kB 5,592 2.6% 12,377 5.7% 58,659 27.3% 28.9% 0.0
339 days_termination_min float32 861.0 kB 2,797 1.3% 12,377 5.7% 15,833 7.4% 7.8% 365243.0
340 days_termination_max float32 861.0 kB 2,683 1.2% 12,377 5.7% 105,005 48.8% 51.8% 365243.0
341 days_termination_mean float32 861.0 kB 51,017 23.7% 12,377 5.7% 15,833 7.4% 7.8% 365243.0
342 days_termination_std float64 1.7 MB 95,145 44.2% 67,931 31.6% 3,494 1.6% 2.4% 0.0
343 days_termination_median float32 861.0 kB 7,716 3.6% 12,377 5.7% 23,269 10.8% 11.5% 365243.0
344 days_termination_range float32 861.0 kB 5,101 2.4% 12,377 5.7% 59,048 27.4% 29.1% 0.0
345 n_nflag_insured_on_approval_sum float32 861.0 kB 19 <0.1% 11,456 5.3% 96,596 44.9% 47.4% 0.0
346 n_nflag_insured_on_approval_mean float32 861.0 kB 102 <0.1% 12,377 5.7% 95,675 44.4% 47.2% 0.0
347 any_nflag_insured_on_approval Int8 430.5 kB 1 <0.1% 0 0% 215,257 100.0% 100.0% 0
348 n_installments_total float32 861.0 kB 310 0.1% 11,034 5.1% 8,624 4.0% 4.2% 12.0
349 n_installments_late float32 861.0 kB 99 <0.1% 11,034 5.1% 95,670 44.4% 46.8% 0.0
350 n_installments_early float32 861.0 kB 215 0.1% 11,034 5.1% 9,335 4.3% 4.6% 6.0
351 n_installments_on_time float32 861.0 kB 140 0.1% 11,034 5.1% 88,381 41.1% 43.3% 0.0
352 percent_installments_late float32 861.0 kB 4,464 2.1% 11,034 5.1% 95,670 44.4% 46.8% 0.0
353 percent_installments_early float32 861.0 kB 7,892 3.7% 11,034 5.1% 64,688 30.1% 31.7% 1.0
354 percent_installments_on_time float32 861.0 kB 7,944 3.7% 11,034 5.1% 88,381 41.1% 43.3% 0.0
355 n_installments_late_7 float32 861.0 kB 59 <0.1% 11,034 5.1% 147,558 68.5% 72.3% 0.0
356 n_installments_late_30 float32 861.0 kB 42 <0.1% 11,034 5.1% 190,963 88.7% 93.5% 0.0
357 n_installments_late_60 float32 861.0 kB 39 <0.1% 11,034 5.1% 198,146 92.1% 97.0% 0.0
358 any_installments_late_7 float32 861.0 kB 2 <0.1% 11,034 5.1% 147,558 68.5% 72.3% 0.0
359 any_installments_late_30 float32 861.0 kB 2 <0.1% 11,034 5.1% 190,963 88.7% 93.5% 0.0
360 any_installments_late_60 float32 861.0 kB 2 <0.1% 11,034 5.1% 198,146 92.1% 97.0% 0.0
361 percent_installments_late_7 float32 861.0 kB 2,595 1.2% 11,034 5.1% 147,558 68.5% 72.3% 0.0
362 percent_installments_late_30 float32 861.0 kB 894 0.4% 11,034 5.1% 190,963 88.7% 93.5% 0.0
363 percent_installments_late_60 float32 861.0 kB 629 0.3% 11,034 5.1% 198,146 92.1% 97.0% 0.0
364 diff_days_installment_payment_min float32 861.0 kB 1,465 0.7% 11,037 5.1% 30,953 14.4% 15.2% 0.0
365 diff_days_installment_payment_max float32 861.0 kB 409 0.2% 11,037 5.1% 15,321 7.1% 7.5% 30.0
366 diff_days_installment_payment_mean float32 861.0 kB 50,246 23.3% 11,037 5.1% 761 0.4% 0.4% 9.0
367 diff_days_installment_payment_std float32 861.0 kB 159,834 74.3% 11,500 5.3% 341 0.2% 0.2% 0.0
368 diff_days_installment_payment_median float32 861.0 kB 320 0.1% 11,037 5.1% 21,620 10.0% 10.6% 0.0
369 diff_days_installment_payment_range float32 861.0 kB 1,465 0.7% 11,037 5.1% 5,349 2.5% 2.6% 30.0
370 diff_days_installment_payment_sum float32 861.0 kB 4,383 2.0% 11,034 5.1% 540 0.3% 0.3% 66.0
371 diff_days_installment_payment_sum_late_only float32 861.0 kB 1,815 0.8% 11,034 5.1% 95,670 44.4% 46.8% 0.0
372 diff_amt_installment_payment_min float64 1.7 MB 25,190 11.7% 11,037 5.1% 177,973 82.7% 87.1% 0.0
373 diff_amt_installment_payment_max float64 1.7 MB 75,445 35.0% 11,037 5.1% 116,518 54.1% 57.1% 0.0
374 diff_amt_installment_payment_mean float64 1.7 MB 97,257 45.2% 11,037 5.1% 103,060 47.9% 50.5% 0.0
375 diff_amt_installment_payment_std float64 1.7 MB 101,021 46.9% 11,500 5.3% 102,599 47.7% 50.4% 0.0
376 diff_amt_installment_payment_median float64 1.7 MB 6,855 3.2% 11,037 5.1% 195,960 91.0% 96.0% 0.0
377 diff_amt_installment_payment_range float64 1.7 MB 90,195 41.9% 11,037 5.1% 103,062 47.9% 50.5% 0.0
378 diff_percent_installment_payment_min float32 861.0 kB 25,589 11.9% 11,037 5.1% 177,973 82.7% 87.1% 1.0
379 diff_percent_installment_payment_max float64 1.7 MB 83,143 38.6% 11,037 5.1% 116,664 54.2% 57.1% 1.0
380 diff_percent_installment_payment_mean float64 1.7 MB 87,934 40.9% 11,037 5.1% 103,191 47.9% 50.5% 1.0
381 diff_percent_installment_payment_std float64 1.7 MB 100,863 46.9% 11,500 5.3% 102,727 47.7% 50.4% 0.0
382 diff_percent_installment_payment_median float32 861.0 kB 7,969 3.7% 11,037 5.1% 195,960 91.0% 96.0% 1.0
383 diff_percent_installment_payment_range float64 1.7 MB 97,055 45.1% 11,037 5.1% 103,190 47.9% 50.5% 0.0
384 n_previous_pos_applications float32 861.0 kB 221 0.1% 12,570 5.8% 9,559 4.4% 4.7% 13.0
385 n_previous_pos_applications_active float32 861.0 kB 207 0.1% 12,570 5.8% 11,535 5.4% 5.7% 12.0
386 n_previous_pos_applications_signed float32 861.0 kB 31 <0.1% 12,570 5.8% 162,017 75.3% 79.9% 0.0
387 n_previous_pos_applications_completed float32 861.0 kB 45 <0.1% 12,570 5.8% 73,226 34.0% 36.1% 1.0
388 cnt_installment_min float32 861.0 kB 53 <0.1% 12,588 5.8% 42,362 19.7% 20.9% 6.0
389 cnt_installment_max float32 861.0 kB 54 <0.1% 12,588 5.8% 57,934 26.9% 28.6% 12.0
390 cnt_installment_mean float32 861.0 kB 34,036 15.8% 12,588 5.8% 15,121 7.0% 7.5% 12.0
391 cnt_installment_std float32 861.0 kB 86,454 40.2% 12,828 6.0% 49,452 23.0% 24.4% 0.0
392 cnt_installment_median float32 861.0 kB 103 <0.1% 12,588 5.8% 61,162 28.4% 30.2% 12.0
393 cnt_installment_range float32 861.0 kB 69 <0.1% 12,588 5.8% 49,692 23.1% 24.5% 0.0
394 cnt_installment_future_min float32 861.0 kB 61 <0.1% 12,588 5.8% 183,466 85.2% 90.5% 0.0
395 cnt_installment_future_max float32 861.0 kB 61 <0.1% 12,588 5.8% 56,961 26.5% 28.1% 12.0
396 cnt_installment_future_mean float32 861.0 kB 33,098 15.4% 12,588 5.8% 7,294 3.4% 3.6% 6.0
397 cnt_installment_future_std float32 861.0 kB 94,015 43.7% 12,828 6.0% 7,063 3.3% 3.5% 2.1602468
398 cnt_installment_future_median float32 861.0 kB 121 0.1% 12,588 5.8% 22,039 10.2% 10.9% 6.0
399 cnt_installment_future_range float32 861.0 kB 65 <0.1% 12,588 5.8% 51,476 23.9% 25.4% 12.0
400 cnt_installments_diff_min float32 861.0 kB 58 <0.1% 12,588 5.8% 198,083 92.0% 97.7% 0.0
401 cnt_installments_diff_max float32 861.0 kB 65 <0.1% 12,588 5.8% 36,048 16.7% 17.8% 12.0
402 cnt_installments_diff_mean float32 861.0 kB 20,290 9.4% 12,588 5.8% 9,014 4.2% 4.4% 3.0
403 cnt_installments_diff_std float32 861.0 kB 73,650 34.2% 12,828 6.0% 7,541 3.5% 3.7% 2.1602468
404 cnt_installments_diff_median float32 861.0 kB 64 <0.1% 12,588 5.8% 29,837 13.9% 14.7% 4.0
405 cnt_installments_diff_range float32 861.0 kB 82 <0.1% 12,588 5.8% 35,742 16.6% 17.6% 12.0
406 sk_dpd_pos_applications_min float32 861.0 kB 44 <0.1% 12,570 5.8% 202,642 94.1% >99.9% 0.0
407 sk_dpd_pos_applications_max float32 861.0 kB 1,595 0.7% 12,570 5.8% 164,332 76.3% 81.1% 0.0
408 sk_dpd_pos_applications_mean float32 861.0 kB 8,594 4.0% 12,570 5.8% 164,332 76.3% 81.1% 0.0
409 sk_dpd_pos_applications_std float32 861.0 kB 20,325 9.4% 12,819 6.0% 164,083 76.2% 81.1% 0.0
410 sk_dpd_pos_applications_median float32 861.0 kB 856 0.4% 12,570 5.8% 201,113 93.4% 99.2% 0.0
411 sk_dpd_pos_applications_range float32 861.0 kB 1,566 0.7% 12,570 5.8% 164,332 76.3% 81.1% 0.0
412 sk_dpd_def_pos_applications_min float32 861.0 kB 3 <0.1% 12,570 5.8% 202,685 94.2% >99.9% 0.0
413 sk_dpd_def_pos_applications_max float32 861.0 kB 173 0.1% 12,570 5.8% 174,617 81.1% 86.2% 0.0
414 sk_dpd_def_pos_applications_mean float32 861.0 kB 3,858 1.8% 12,570 5.8% 174,617 81.1% 86.2% 0.0
415 sk_dpd_def_pos_applications_std float32 861.0 kB 12,093 5.6% 12,819 6.0% 174,368 81.0% 86.1% 0.0
416 sk_dpd_def_pos_applications_median float32 861.0 kB 61 <0.1% 12,570 5.8% 202,489 94.1% 99.9% 0.0
417 sk_dpd_def_pos_applications_range float32 861.0 kB 172 0.1% 12,570 5.8% 174,617 81.1% 86.2% 0.0
418 n_previous_credit_card_applications float32 861.0 kB 126 0.1% 154,158 71.6% 4,332 2.0% 7.1% 96.0
419 n_previous_credit_card_applications_completed float32 861.0 kB 40 <0.1% 154,158 71.6% 53,625 24.9% 87.8% 0.0
420 n_previous_credit_card_applications_active float32 861.0 kB 102 <0.1% 154,158 71.6% 3,810 1.8% 6.2% 96.0
421 n_previous_credit_card_applications_signed float32 861.0 kB 37 <0.1% 154,158 71.6% 58,091 27.0% 95.1% 0.0
422 n_contracts_credit_card_active float32 861.0 kB 102 <0.1% 154,158 71.6% 3,810 1.8% 6.2% 96.0
423 n_contracts_credit_card_completed float32 861.0 kB 40 <0.1% 154,158 71.6% 53,625 24.9% 87.8% 0.0
424 n_contracts_credit_card_signed float32 861.0 kB 37 <0.1% 154,158 71.6% 58,091 27.0% 95.1% 0.0
425 amt_balance_credit_card_min float64 1.7 MB 8,310 3.9% 154,158 71.6% 52,144 24.2% 85.3% 0.0
426 amt_balance_credit_card_max float64 1.7 MB 40,175 18.7% 154,158 71.6% 19,232 8.9% 31.5% 0.0
427 amt_balance_credit_card_mean float64 1.7 MB 41,818 19.4% 154,158 71.6% 19,214 8.9% 31.4% 0.0
428 amt_balance_credit_card_std float64 1.7 MB 41,728 19.4% 154,590 71.8% 18,904 8.8% 31.2% 0.0
429 amt_balance_credit_card_median float64 1.7 MB 27,685 12.9% 154,158 71.6% 33,027 15.3% 54.1% 0.0
430 amt_balance_credit_card_range float64 1.7 MB 40,268 18.7% 154,158 71.6% 19,336 9.0% 31.6% 0.0
431 amt_credit_limit_actual_min float32 861.0 kB 150 0.1% 154,158 71.6% 15,769 7.3% 25.8% 45000.0
432 amt_credit_limit_actual_max float32 861.0 kB 52 <0.1% 154,158 71.6% 8,852 4.1% 14.5% 135000.0
433 amt_credit_limit_actual_mean float64 1.7 MB 9,366 4.4% 154,158 71.6% 3,297 1.5% 5.4% 45000.0
434 amt_credit_limit_actual_std float64 1.7 MB 17,158 8.0% 154,590 71.8% 25,868 12.0% 42.6% 0.0
435 amt_credit_limit_actual_median float32 861.0 kB 151 0.1% 154,158 71.6% 7,600 3.5% 12.4% 0.0
436 amt_credit_limit_actual_range float32 861.0 kB 147 0.1% 154,158 71.6% 26,300 12.2% 43.0% 0.0
437 amt_drawings_atm_current_min float32 861.0 kB 114 0.1% 172,254 80.0% 42,401 19.7% 98.6% 0.0
438 amt_drawings_atm_current_max float64 1.7 MB 1,131 0.5% 172,254 80.0% 6,929 3.2% 16.1% 0.0
439 amt_drawings_atm_current_mean float64 1.7 MB 17,404 8.1% 172,254 80.0% 6,929 3.2% 16.1% 0.0
440 amt_drawings_atm_current_std float64 1.7 MB 30,960 14.4% 172,561 80.2% 6,804 3.2% 15.9% 0.0
441 amt_drawings_atm_current_median float64 1.7 MB 378 0.2% 172,254 80.0% 36,581 17.0% 85.1% 0.0
442 amt_drawings_atm_current_range float32 861.0 kB 1 <0.1% 172,254 80.0% 43,003 20.0% 100.0% 0.0
443 amt_drawings_current_min float64 1.7 MB 1,475 0.7% 154,158 71.6% 59,264 27.5% 97.0% 0.0
444 amt_drawings_current_max float64 1.7 MB 17,325 8.0% 154,158 71.6% 19,196 8.9% 31.4% 0.0
445 amt_drawings_current_mean float64 1.7 MB 35,095 16.3% 154,158 71.6% 19,196 8.9% 31.4% 0.0
446 amt_drawings_current_std float64 1.7 MB 39,419 18.3% 154,590 71.8% 18,901 8.8% 31.2% 0.0
447 amt_drawings_current_median float64 1.7 MB 9,561 4.4% 154,158 71.6% 47,512 22.1% 77.8% 0.0
448 amt_drawings_current_range float64 1.7 MB 17,342 8.1% 154,158 71.6% 19,333 9.0% 31.6% 0.0
449 amt_drawings_other_current_min float32 861.0 kB 4 <0.1% 172,254 80.0% 43,000 20.0% >99.9% 0.0
450 amt_drawings_other_current_max float64 1.7 MB 1,084 0.5% 172,254 80.0% 38,999 18.1% 90.7% 0.0
451 amt_drawings_other_current_mean float64 1.7 MB 2,925 1.4% 172,254 80.0% 38,999 18.1% 90.7% 0.0
452 amt_drawings_other_current_std float64 1.7 MB 3,439 1.6% 172,561 80.2% 38,694 18.0% 90.6% 0.0
453 amt_drawings_other_current_median float64 1.7 MB 33 <0.1% 172,254 80.0% 42,965 20.0% 99.9% 0.0
454 amt_drawings_other_current_range float64 1.7 MB 1,083 0.5% 172,254 80.0% 39,001 18.1% 90.7% 0.0
455 amt_drawings_pos_current_min float64 1.7 MB 1,772 0.8% 172,254 80.0% 41,083 19.1% 95.5% 0.0
456 amt_drawings_pos_current_max float64 1.7 MB 20,726 9.6% 172,254 80.0% 19,027 8.8% 44.2% 0.0
457 amt_drawings_pos_current_mean float64 1.7 MB 23,516 10.9% 172,254 80.0% 19,027 8.8% 44.2% 0.0
458 amt_drawings_pos_current_std float64 1.7 MB 23,623 11.0% 172,561 80.2% 18,898 8.8% 44.3% 0.0
459 amt_drawings_pos_current_median float64 1.7 MB 8,634 4.0% 172,254 80.0% 33,721 15.7% 78.4% 0.0
460 amt_drawings_pos_current_range float64 1.7 MB 20,626 9.6% 172,254 80.0% 19,205 8.9% 44.7% 0.0
461 amt_inst_min_regularity_min float64 1.7 MB 1,664 0.8% 154,158 71.6% 57,788 26.8% 94.6% 0.0
462 amt_inst_min_regularity_max float64 1.7 MB 22,887 10.6% 154,158 71.6% 19,437 9.0% 31.8% 0.0
463 amt_inst_min_regularity_mean float64 1.7 MB 40,398 18.8% 154,158 71.6% 19,437 9.0% 31.8% 0.0
464 amt_inst_min_regularity_std float64 1.7 MB 40,484 18.8% 154,590 71.8% 19,359 9.0% 31.9% 0.0
465 amt_inst_min_regularity_median float64 1.7 MB 16,994 7.9% 154,158 71.6% 33,468 15.5% 54.8% 0.0
466 amt_inst_min_regularity_range float64 1.7 MB 23,219 10.8% 154,158 71.6% 19,791 9.2% 32.4% 0.0
467 amt_payment_current_min float64 1.7 MB 7,398 3.4% 172,336 80.1% 26,925 12.5% 62.7% 0.0
468 amt_payment_current_max float64 1.7 MB 19,208 8.9% 172,336 80.1% 907 0.4% 2.1% 22500.0
469 amt_payment_current_mean float64 1.7 MB 40,261 18.7% 172,336 80.1% 83 <0.1% 0.2% 0.0
470 amt_payment_current_std float64 1.7 MB 41,555 19.3% 172,647 80.2% 371 0.2% 0.9% 0.0
471 amt_payment_current_median float64 1.7 MB 17,066 7.9% 172,336 80.1% 2,689 1.2% 6.3% 9000.0
472 amt_payment_current_range float64 1.7 MB 22,545 10.5% 172,336 80.1% 682 0.3% 1.6% 0.0
473 amt_payment_total_current_min float64 1.7 MB 1,131 0.5% 154,158 71.6% 59,285 27.5% 97.0% 0.0
474 amt_payment_total_current_max float64 1.7 MB 22,332 10.4% 154,158 71.6% 18,441 8.6% 30.2% 0.0
475 amt_payment_total_current_mean float64 1.7 MB 40,916 19.0% 154,158 71.6% 18,441 8.6% 30.2% 0.0
476 amt_payment_total_current_std float64 1.7 MB 42,215 19.6% 154,590 71.8% 18,090 8.4% 29.8% 0.0
477 amt_payment_total_current_median float64 1.7 MB 13,261 6.2% 154,158 71.6% 30,408 14.1% 49.8% 0.0
478 amt_payment_total_current_range float64 1.7 MB 22,686 10.5% 154,158 71.6% 18,522 8.6% 30.3% 0.0
479 amt_receivable_principal_min float64 1.7 MB 6,082 2.8% 154,158 71.6% 53,385 24.8% 87.4% 0.0
480 amt_receivable_principal_max float64 1.7 MB 33,039 15.3% 154,158 71.6% 19,707 9.2% 32.3% 0.0
481 amt_receivable_principal_mean float64 1.7 MB 41,189 19.1% 154,158 71.6% 19,683 9.1% 32.2% 0.0
482 amt_receivable_principal_std float64 1.7 MB 41,193 19.1% 154,590 71.8% 19,378 9.0% 31.9% 0.0
483 amt_receivable_principal_median float64 1.7 MB 25,587 11.9% 154,158 71.6% 34,981 16.3% 57.3% 0.0
484 amt_receivable_principal_range float64 1.7 MB 33,975 15.8% 154,158 71.6% 19,810 9.2% 32.4% 0.0
485 amt_receivable_min float64 1.7 MB 14,658 6.8% 154,158 71.6% 43,946 20.4% 71.9% 0.0
486 amt_receivable_max float64 1.7 MB 39,955 18.6% 154,158 71.6% 19,362 9.0% 31.7% 0.0
487 amt_receivable_mean float64 1.7 MB 41,873 19.5% 154,158 71.6% 19,064 8.9% 31.2% 0.0
488 amt_receivable_std float64 1.7 MB 41,816 19.4% 154,590 71.8% 18,748 8.7% 30.9% 0.0
489 amt_receivable_median float64 1.7 MB 26,844 12.5% 154,158 71.6% 33,993 15.8% 55.6% 0.0
490 amt_receivable_range float64 1.7 MB 40,943 19.0% 154,158 71.6% 19,180 8.9% 31.4% 0.0
491 amt_total_receivable_min float64 1.7 MB 14,657 6.8% 154,158 71.6% 43,947 20.4% 71.9% 0.0
492 amt_total_receivable_max float64 1.7 MB 39,959 18.6% 154,158 71.6% 19,361 9.0% 31.7% 0.0
493 amt_total_receivable_mean float64 1.7 MB 41,873 19.5% 154,158 71.6% 19,064 8.9% 31.2% 0.0
494 amt_total_receivable_std float64 1.7 MB 41,817 19.4% 154,590 71.8% 18,748 8.7% 30.9% 0.0
495 amt_total_receivable_median float64 1.7 MB 26,843 12.5% 154,158 71.6% 33,993 15.8% 55.6% 0.0
496 amt_total_receivable_range float64 1.7 MB 40,943 19.0% 154,158 71.6% 19,180 8.9% 31.4% 0.0
497 cnt_drawings_atm_current_min float32 861.0 kB 19 <0.1% 172,254 80.0% 42,402 19.7% 98.6% 0.0
498 cnt_drawings_atm_current_max float32 861.0 kB 43 <0.1% 172,254 80.0% 6,929 3.2% 16.1% 0.0
499 cnt_drawings_atm_current_mean float32 861.0 kB 3,073 1.4% 172,254 80.0% 6,929 3.2% 16.1% 0.0
500 cnt_drawings_atm_current_std float32 861.0 kB 16,770 7.8% 172,561 80.2% 6,817 3.2% 16.0% 0.0
501 cnt_drawings_atm_current_median float32 861.0 kB 33 <0.1% 172,254 80.0% 36,581 17.0% 85.1% 0.0
502 cnt_drawings_atm_current_range float32 861.0 kB 43 <0.1% 172,254 80.0% 7,124 3.3% 16.6% 0.0
503 cnt_drawings_current_min float32 861.0 kB 39 <0.1% 154,158 71.6% 59,278 27.5% 97.0% 0.0
504 cnt_drawings_current_max float32 861.0 kB 114 0.1% 154,158 71.6% 19,499 9.1% 31.9% 0.0
505 cnt_drawings_current_mean float32 861.0 kB 5,724 2.7% 154,158 71.6% 19,499 9.1% 31.9% 0.0
506 cnt_drawings_current_std float32 861.0 kB 25,425 11.8% 154,590 71.8% 19,208 8.9% 31.7% 0.0
507 cnt_drawings_current_median float32 861.0 kB 113 0.1% 154,158 71.6% 47,629 22.1% 78.0% 0.0
508 cnt_drawings_current_range float32 861.0 kB 114 0.1% 154,158 71.6% 19,640 9.1% 32.1% 0.0
509 cnt_drawings_other_current_min float32 861.0 kB 3 <0.1% 172,254 80.0% 43,000 20.0% >99.9% 0.0
510 cnt_drawings_other_current_max float32 861.0 kB 11 <0.1% 172,254 80.0% 38,987 18.1% 90.7% 0.0
511 cnt_drawings_other_current_mean float32 861.0 kB 382 0.2% 172,254 80.0% 38,987 18.1% 90.7% 0.0
512 cnt_drawings_other_current_std float32 861.0 kB 724 0.3% 172,561 80.2% 38,683 18.0% 90.6% 0.0
513 cnt_drawings_other_current_median float32 861.0 kB 4 <0.1% 172,254 80.0% 42,965 20.0% 99.9% 0.0
514 cnt_drawings_other_current_range float32 861.0 kB 11 <0.1% 172,254 80.0% 38,990 18.1% 90.7% 0.0
515 cnt_drawings_pos_current_min float32 861.0 kB 40 <0.1% 172,254 80.0% 41,083 19.1% 95.5% 0.0
516 cnt_drawings_pos_current_max float32 861.0 kB 116 0.1% 172,254 80.0% 19,027 8.8% 44.2% 0.0
517 cnt_drawings_pos_current_mean float32 861.0 kB 4,240 2.0% 172,254 80.0% 19,027 8.8% 44.2% 0.0
518 cnt_drawings_pos_current_std float32 861.0 kB 13,887 6.5% 172,561 80.2% 18,908 8.8% 44.3% 0.0
519 cnt_drawings_pos_current_median float32 861.0 kB 113 0.1% 172,254 80.0% 33,721 15.7% 78.4% 0.0
520 cnt_drawings_pos_current_range float32 861.0 kB 116 0.1% 172,254 80.0% 19,215 8.9% 44.7% 0.0
521 cnt_installment_mature_cum_min float32 861.0 kB 28 <0.1% 154,158 71.6% 38,853 18.0% 63.6% 0.0
522 cnt_installment_mature_cum_max float32 861.0 kB 120 0.1% 154,158 71.6% 19,249 8.9% 31.5% 0.0
523 cnt_installment_mature_cum_mean float32 861.0 kB 11,238 5.2% 154,158 71.6% 19,249 8.9% 31.5% 0.0
524 cnt_installment_mature_cum_std float32 861.0 kB 12,965 6.0% 154,590 71.8% 19,175 8.9% 31.6% 0.0
525 cnt_installment_mature_cum_median float32 861.0 kB 144 0.1% 154,158 71.6% 20,299 9.4% 33.2% 0.0
526 cnt_installment_mature_cum_range float32 861.0 kB 96 <0.1% 154,158 71.6% 19,607 9.1% 32.1% 0.0
527 sk_dpd_credit_card_min float32 861.0 kB 1 <0.1% 154,158 71.6% 61,099 28.4% 100.0% 0.0
528 sk_dpd_credit_card_max float32 861.0 kB 353 0.2% 154,158 71.6% 48,474 22.5% 79.3% 0.0
529 sk_dpd_credit_card_mean float32 861.0 kB 2,882 1.3% 154,158 71.6% 48,474 22.5% 79.3% 0.0
530 sk_dpd_credit_card_std float32 861.0 kB 3,641 1.7% 154,590 71.8% 48,042 22.3% 79.2% 0.0
531 sk_dpd_credit_card_median float32 861.0 kB 222 0.1% 154,158 71.6% 60,546 28.1% 99.1% 0.0
532 sk_dpd_credit_card_range float32 861.0 kB 353 0.2% 154,158 71.6% 48,474 22.5% 79.3% 0.0
533 sk_dpd_def_credit_card_min float32 861.0 kB 1 <0.1% 154,158 71.6% 61,099 28.4% 100.0% 0.0
534 sk_dpd_def_credit_card_max float32 861.0 kB 47 <0.1% 154,158 71.6% 50,652 23.5% 82.9% 0.0
535 sk_dpd_def_credit_card_mean float32 861.0 kB 1,328 0.6% 154,158 71.6% 50,652 23.5% 82.9% 0.0
536 sk_dpd_def_credit_card_std float32 861.0 kB 1,757 0.8% 154,590 71.8% 50,220 23.3% 82.8% 0.0
537 sk_dpd_def_credit_card_median float32 861.0 kB 16 <0.1% 154,158 71.6% 61,061 28.4% 99.9% 0.0
538 sk_dpd_def_credit_card_range float32 861.0 kB 47 <0.1% 154,158 71.6% 50,652 23.5% 82.9% 0.0
539 FLAG_IS_EMERGENCY Int8 430.5 kB 2 <0.1% 0 0% 213,628 99.2% 99.2% 0
540 ord_education_type int8 215.3 kB 5 <0.1% 0 0% 152,993 71.1% 71.1% 1
541 flag_has_children Int8 430.5 kB 2 <0.1% 0 0% 150,641 70.0% 70.0% 0
542 years_employed float64 1.7 MB 11,769 5.5% 38,756 18.0% 112 0.1% 0.1% 0.6273972602739726
543 amt_income_total_per_family_member float64 1.7 MB 2,362 1.1% 1 <0.1% 17,111 7.9% 7.9% 67500.0
544 cnt_fam_members_excluding_children float32 861.0 kB 2 <0.1% 1 <0.1% 158,301 73.5% 73.5% 2.0
545 amt_annuity_to_credit_ratio float32 861.0 kB 33,148 15.4% 8 <0.1% 20,556 9.5% 9.5% 0.05
546 amt_annuity_to_income_ratio float64 1.7 MB 71,916 33.4% 8 <0.1% 2,049 1.0% 1.0% 0.1
547 amt_credit_to_income_ratio float64 1.7 MB 39,372 18.3% 0 0% 3,691 1.7% 1.7% 2.0
548 amt_annuity_to_income_per_family_member float64 1.7 MB 88,172 41.0% 9 <0.1% 1,500 0.7% 0.7% 0.3

6 Further Pre-Processing

In this chapter, further data pre-processing and pre-selection of features are performed to prepare the data for modeling.

6.1 Identify Redundant and Problematic Features

The purpose of this section is to identify 2 sets of variables:

  1. A set of variables from the merged data table that should be kept and included in the pre-processing (“before pre-processing” set);
  2. A set of variables that should be kept after pre-processing and used for creating a predictive model (“after pre-processing” set).

To achieve this, first, problematic, duplicated, and correlated columns will be identified and then a complement of these will be used.

In this section, only the training set is used.

6.1.1 Steps Before Pre-Processing

Columns that:

  1. have only one unique value or all missing values;
  2. have more than 90% of missing values;
  3. a single value (excluding missing ones) is present in more than 99.9% of cases;

are considered to be problematic and will be excluded before further preprocessing.

problematic_columns = credits_train_col_info.query(
    "n_unique <= 1 or p_missing >= 90.00 or p_dom_excl_na >= 99.85"
)

print(f"N columns to remove: {problematic_columns.shape[0]}")
problematic_columns.pipe(an.style_col_info)
N columns to remove: 45
Table 6.1. Info on the problematic columns to remove before preprocessing.
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
18 FLAG_MOBIL int8 215.3 kB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 1
89 FLAG_DOCUMENT_2 int8 215.3 kB 2 <0.1% 0 0% 215,246 >99.9% >99.9% 0
91 FLAG_DOCUMENT_4 int8 215.3 kB 2 <0.1% 0 0% 215,238 >99.9% >99.9% 0
94 FLAG_DOCUMENT_7 int8 215.3 kB 2 <0.1% 0 0% 215,221 >99.9% >99.9% 0
97 FLAG_DOCUMENT_10 int8 215.3 kB 2 <0.1% 0 0% 215,253 >99.9% >99.9% 0
99 FLAG_DOCUMENT_12 int8 215.3 kB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0
102 FLAG_DOCUMENT_15 int8 215.3 kB 2 <0.1% 0 0% 215,015 99.9% 99.9% 0
104 FLAG_DOCUMENT_17 int8 215.3 kB 2 <0.1% 0 0% 215,200 >99.9% >99.9% 0
106 FLAG_DOCUMENT_19 int8 215.3 kB 2 <0.1% 0 0% 215,124 99.9% 99.9% 0
107 FLAG_DOCUMENT_20 int8 215.3 kB 2 <0.1% 0 0% 215,146 99.9% 99.9% 0
108 FLAG_DOCUMENT_21 int8 215.3 kB 2 <0.1% 0 0% 215,187 >99.9% >99.9% 0
118 n_credits_bad_debt float32 861.0 kB 2 <0.1% 30,836 14.3% 184,408 85.7% >99.9% 0.0
120 mode_credit_currency category 215.6 kB 3 <0.1% 30,836 14.3% 184,386 85.7% >99.9% currency 1
124 n_currency_3 float32 861.0 kB 4 <0.1% 30,836 14.3% 184,319 85.6% 99.9% 0.0
125 n_currency_4 float32 861.0 kB 2 <0.1% 30,836 14.3% 184,414 85.7% >99.9% 0.0
132 days_credit_overdue_min float32 861.0 kB 69 <0.1% 30,836 14.3% 184,320 85.6% 99.9% 0.0
156 cnt_credit_prolong_min float32 861.0 kB 6 <0.1% 30,836 14.3% 184,215 85.6% 99.9% 0.0
184 amt_credit_sum_overdue_min float32 861.0 kB 81 <0.1% 30,836 14.3% 184,318 85.6% 99.9% 0.0
213 bureau_dpd_status_min float32 861.0 kB 6 <0.1% 152,586 70.9% 62,638 29.1% 99.9% 0.0
259 rate_interest_primary_min float32 861.0 kB 119 0.1% 212,016 98.5% 666 0.3% 20.5% 0.18913634
260 rate_interest_primary_max float32 861.0 kB 119 0.1% 212,016 98.5% 674 0.3% 20.8% 0.18913634
261 rate_interest_primary_mean float32 861.0 kB 160 0.1% 212,016 98.5% 655 0.3% 20.2% 0.18913634
262 rate_interest_primary_std float32 861.0 kB 39 <0.1% 215,139 99.9% 37 <0.1% 31.4% 0.0
263 rate_interest_primary_median float32 861.0 kB 157 0.1% 212,016 98.5% 655 0.3% 20.2% 0.18913634
264 rate_interest_primary_range float32 861.0 kB 37 <0.1% 212,016 98.5% 3,160 1.5% 97.5% 0.0
266 rate_interest_privileged_min float32 861.0 kB 21 <0.1% 212,016 98.5% 892 0.4% 27.5% 0.83509517
267 rate_interest_privileged_max float32 861.0 kB 21 <0.1% 212,016 98.5% 906 0.4% 28.0% 0.83509517
268 rate_interest_privileged_mean float32 861.0 kB 42 <0.1% 212,016 98.5% 881 0.4% 27.2% 0.83509517
269 rate_interest_privileged_std float32 861.0 kB 21 <0.1% 215,139 99.9% 50 <0.1% 42.4% 0.0
270 rate_interest_privileged_median float32 861.0 kB 40 <0.1% 212,016 98.5% 881 0.4% 27.2% 0.83509517
271 rate_interest_privileged_range float32 861.0 kB 20 <0.1% 212,016 98.5% 3,173 1.5% 97.9% 0.0
285 n_payment_type_cash_from_account float32 861.0 kB 1 <0.1% 11,456 5.3% 203,801 94.7% 100.0% 0.0
309 n_channel_type_car_dealer float32 861.0 kB 6 <0.1% 11,456 5.3% 203,580 94.6% 99.9% 0.0
347 any_nflag_insured_on_approval Int8 430.5 kB 1 <0.1% 0 0% 215,257 100.0% 100.0% 0
406 sk_dpd_pos_applications_min float32 861.0 kB 44 <0.1% 12,570 5.8% 202,642 94.1% >99.9% 0.0
412 sk_dpd_def_pos_applications_min float32 861.0 kB 3 <0.1% 12,570 5.8% 202,685 94.2% >99.9% 0.0
416 sk_dpd_def_pos_applications_median float32 861.0 kB 61 <0.1% 12,570 5.8% 202,489 94.1% 99.9% 0.0
442 amt_drawings_atm_current_range float32 861.0 kB 1 <0.1% 172,254 80.0% 43,003 20.0% 100.0% 0.0
449 amt_drawings_other_current_min float32 861.0 kB 4 <0.1% 172,254 80.0% 43,000 20.0% >99.9% 0.0
453 amt_drawings_other_current_median float64 1.7 MB 33 <0.1% 172,254 80.0% 42,965 20.0% 99.9% 0.0
509 cnt_drawings_other_current_min float32 861.0 kB 3 <0.1% 172,254 80.0% 43,000 20.0% >99.9% 0.0
513 cnt_drawings_other_current_median float32 861.0 kB 4 <0.1% 172,254 80.0% 42,965 20.0% 99.9% 0.0
527 sk_dpd_credit_card_min float32 861.0 kB 1 <0.1% 154,158 71.6% 61,099 28.4% 100.0% 0.0
533 sk_dpd_def_credit_card_min float32 861.0 kB 1 <0.1% 154,158 71.6% 61,099 28.4% 100.0% 0.0
537 sk_dpd_def_credit_card_median float32 861.0 kB 16 <0.1% 154,158 71.6% 61,061 28.4% 99.9% 0.0
Code
# Create list of columns to keep
cols_to_keep_1 = list(
    set(credits_train.columns) - set(problematic_columns.column) - set(["TARGET"])
)

The following steps are to:

  1. manually remove the identified problematic columns;
  2. drop duplicated columns;
  3. use SmartCorrelatedSelection algorithm to identify the groups of correlated variables and to leave only a single variable from each group.
Code
pipeline_selection_before_preprec = Pipeline(
    steps=[
        ("column_selector_1", ColumnSelector(cols_to_keep_1)),
        ("drop_duplicate_features", DropDuplicateFeatures()),
        (
            "drop_corr_features",
            SmartCorrelatedSelection(selection_method="variance"),
        ),
    ]
)

pipeline_selection_before_preprec.fit(credits_train)
# Time: 5m 36.1s
Pipeline(steps=[('column_selector_1',
                 ColumnSelector(keep=['days_credit_update_max',
                                      'cnt_installment_mature_cum_min',
                                      'cnt_drawings_current_min',
                                      'sk_dpd_pos_applications_mean',
                                      'ord_education_type',
                                      'REG_REGION_NOT_WORK_REGION',
                                      'amt_goods_price_mean', 'FLOORSMIN_MEDI',
                                      'cnt_installment_future_std',
                                      'n_previous_pos_applications',
                                      'diff_percent_installment_p...
                                      'amt_credit_sum_std',
                                      'amt_credit_sum_debt_sum',
                                      'DAYS_ID_PUBLISH', 'FLAG_DOCUMENT_11',
                                      'LIVINGAPARTMENTS_MODE',
                                      'amt_payment_total_current_std',
                                      'cnt_payment_min',
                                      'sk_dpd_def_pos_applications_max',
                                      'n_channel_type_regional_and_local', ...])),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
df_before_preproc = pipeline_selection_before_preprec.transform(credits_train)
df_before_preproc.shape
(215257, 251)
df_before_preproc = df_before_preproc.sort_index(axis=1)
before_preproc_col_info = an.col_info(df_before_preproc)
before_preproc_col_info.pipe(an.style_col_info)
Table 6.2. Info on the selected columns before preprocessing.
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 AMT_ANNUITY float32 861.0 kB 12,801 5.9% 8 <0.1% 4,499 2.1% 2.1% 9000.0
2 AMT_CREDIT float32 861.0 kB 5,097 2.4% 0 0% 6,823 3.2% 3.2% 450000.0
3 AMT_INCOME_TOTAL float64 1.7 MB 1,949 0.9% 0 0% 24,982 11.6% 11.6% 135000.0
4 AMT_REQ_CREDIT_BUREAU_DAY float32 861.0 kB 9 <0.1% 29,081 13.5% 185,147 86.0% 99.4% 0.0
5 AMT_REQ_CREDIT_BUREAU_HOUR float32 861.0 kB 5 <0.1% 29,081 13.5% 185,061 86.0% 99.4% 0.0
6 AMT_REQ_CREDIT_BUREAU_MON float32 861.0 kB 22 <0.1% 29,081 13.5% 155,679 72.3% 83.6% 0.0
7 AMT_REQ_CREDIT_BUREAU_QRT float32 861.0 kB 10 <0.1% 29,081 13.5% 150,895 70.1% 81.0% 0.0
8 AMT_REQ_CREDIT_BUREAU_WEEK float32 861.0 kB 9 <0.1% 29,081 13.5% 180,246 83.7% 96.8% 0.0
9 AMT_REQ_CREDIT_BUREAU_YEAR float32 861.0 kB 24 <0.1% 29,081 13.5% 50,313 23.4% 27.0% 0.0
10 BASEMENTAREA_MODE float32 861.0 kB 3,687 1.7% 125,793 58.4% 11,561 5.4% 12.9% 0.0
11 CNT_FAM_MEMBERS float32 861.0 kB 12 <0.1% 1 <0.1% 110,671 51.4% 51.4% 2.0
12 COMMONAREA_MEDI float32 861.0 kB 2,982 1.4% 150,300 69.8% 6,068 2.8% 9.3% 0.0
13 DAYS_ID_PUBLISH int16 430.5 kB 6,122 2.8% 0 0% 119 0.1% 0.1% -4074
14 DAYS_LAST_PHONE_CHANGE float32 861.0 kB 3,720 1.7% 1 <0.1% 26,201 12.2% 12.2% 0.0
15 DAYS_REGISTRATION float32 861.0 kB 15,249 7.1% 0 0% 79 <0.1% <0.1% -7.0
16 DEF_30_CNT_SOCIAL_CIRCLE float32 861.0 kB 10 <0.1% 714 0.3% 189,988 88.3% 88.6% 0.0
17 ELEVATORS_AVG float32 861.0 kB 241 0.1% 114,570 53.2% 60,109 27.9% 59.7% 0.0
18 ELEVATORS_MEDI float32 861.0 kB 46 <0.1% 114,570 53.2% 61,040 28.4% 60.6% 0.0
19 ENTRANCES_MODE float32 861.0 kB 30 <0.1% 108,270 50.3% 25,310 11.8% 23.7% 0.1379
20 EXT_SOURCE_1 float32 861.0 kB 83,961 39.0% 121,373 56.4% 5 <0.1% <0.1% 0.44398212
21 EXT_SOURCE_2 float32 861.0 kB 102,229 47.5% 464 0.2% 503 0.2% 0.2% 0.28589788
22 EXT_SOURCE_3 float32 861.0 kB 804 0.4% 42,680 19.8% 985 0.5% 0.6% 0.7463002
23 FLAG_CONT_MOBILE int8 215.3 kB 2 <0.1% 0 0% 214,855 99.8% 99.8% 1
24 FLAG_DOCUMENT_11 int8 215.3 kB 2 <0.1% 0 0% 214,448 99.6% 99.6% 0
25 FLAG_DOCUMENT_13 int8 215.3 kB 2 <0.1% 0 0% 214,541 99.7% 99.7% 0
26 FLAG_DOCUMENT_14 int8 215.3 kB 2 <0.1% 0 0% 214,614 99.7% 99.7% 0
27 FLAG_DOCUMENT_16 int8 215.3 kB 2 <0.1% 0 0% 213,089 99.0% 99.0% 0
28 FLAG_DOCUMENT_18 int8 215.3 kB 2 <0.1% 0 0% 213,525 99.2% 99.2% 0
29 FLAG_DOCUMENT_3 int8 215.3 kB 2 <0.1% 0 0% 152,845 71.0% 71.0% 1
30 FLAG_DOCUMENT_5 int8 215.3 kB 2 <0.1% 0 0% 212,025 98.5% 98.5% 0
31 FLAG_DOCUMENT_6 int8 215.3 kB 2 <0.1% 0 0% 196,348 91.2% 91.2% 0
32 FLAG_DOCUMENT_8 int8 215.3 kB 2 <0.1% 0 0% 197,689 91.8% 91.8% 0
33 FLAG_DOCUMENT_9 int8 215.3 kB 2 <0.1% 0 0% 214,440 99.6% 99.6% 0
34 FLAG_EMAIL int8 215.3 kB 2 <0.1% 0 0% 203,006 94.3% 94.3% 0
35 FLAG_EMP_PHONE int8 215.3 kB 2 <0.1% 0 0% 176,491 82.0% 82.0% 1
36 FLAG_IS_EMERGENCY Int8 430.5 kB 2 <0.1% 0 0% 213,628 99.2% 99.2% 0
37 FLAG_OWN_CAR Int8 430.5 kB 2 <0.1% 0 0% 142,086 66.0% 66.0% 0
38 FLAG_OWN_REALTY Int8 430.5 kB 2 <0.1% 0 0% 149,412 69.4% 69.4% 1
39 FLAG_PHONE int8 215.3 kB 2 <0.1% 0 0% 154,906 72.0% 72.0% 0
40 FLAG_WORK_PHONE int8 215.3 kB 2 <0.1% 0 0% 172,406 80.1% 80.1% 0
41 FLOORSMAX_MEDI float32 861.0 kB 49 <0.1% 106,970 49.7% 44,659 20.7% 41.2% 0.1667
42 FLOORSMIN_MEDI float32 861.0 kB 47 <0.1% 146,054 67.9% 23,733 11.0% 34.3% 0.2083
43 FONDKAPREMONT_MODE category 215.7 kB 4 <0.1% 147,099 68.3% 51,785 24.1% 76.0% reg oper account
44 HOUSETYPE_MODE category 215.6 kB 3 <0.1% 107,834 50.1% 105,515 49.0% 98.2% block of flats
45 LANDAREA_MEDI float32 861.0 kB 3,393 1.6% 127,644 59.3% 11,058 5.1% 12.6% 0.0
46 NAME_CONTRACT_TYPE category 215.5 kB 2 <0.1% 0 0% 194,675 90.4% 90.4% Cash loans
47 NAME_EDUCATION_TYPE category 215.8 kB 5 <0.1% 0 0% 152,993 71.1% 71.1% Secondary / secondary special
48 NAME_HOUSING_TYPE category 215.9 kB 6 <0.1% 0 0% 191,159 88.8% 88.8% House / apartment
49 NAME_INCOME_TYPE category 216.1 kB 8 <0.1% 0 0% 110,984 51.6% 51.6% Working
50 NAME_TYPE_SUITE category 216.0 kB 7 <0.1% 901 0.4% 174,089 80.9% 81.2% Unaccompanied
51 NONLIVINGAPARTMENTS_AVG float32 861.0 kB 345 0.2% 149,354 69.4% 38,319 17.8% 58.1% 0.0
52 NONLIVINGAREA_MODE float32 861.0 kB 3,090 1.4% 118,577 55.1% 46,933 21.8% 48.5% 0.0
53 OBS_30_CNT_SOCIAL_CIRCLE float32 861.0 kB 32 <0.1% 714 0.3% 114,550 53.2% 53.4% 0.0
54 OCCUPATION_TYPE category 217.1 kB 18 <0.1% 67,480 31.3% 38,591 17.9% 26.1% Laborers
55 ORGANIZATION_TYPE category 221.3 kB 57 <0.1% 38,756 18.0% 47,582 22.1% 27.0% Business Entity Type 3
56 OWN_CAR_AGE float32 861.0 kB 61 <0.1% 142,091 66.0% 5,232 2.4% 7.2% 7.0
57 REGION_POPULATION_RELATIVE float32 861.0 kB 81 <0.1% 0 0% 11,494 5.3% 5.3% 0.035792
58 REGION_RATING_CLIENT int8 215.3 kB 3 <0.1% 0 0% 158,846 73.8% 73.8% 2
59 REG_CITY_NOT_LIVE_CITY int8 215.3 kB 2 <0.1% 0 0% 198,549 92.2% 92.2% 0
60 REG_CITY_NOT_WORK_CITY int8 215.3 kB 2 <0.1% 0 0% 165,697 77.0% 77.0% 0
61 REG_REGION_NOT_LIVE_REGION int8 215.3 kB 2 <0.1% 0 0% 211,999 98.5% 98.5% 0
62 REG_REGION_NOT_WORK_REGION int8 215.3 kB 2 <0.1% 0 0% 204,222 94.9% 94.9% 0
63 WALLSMATERIAL_MODE category 216.0 kB 7 <0.1% 109,329 50.8% 46,298 21.5% 43.7% Panel
64 YEARS_BEGINEXPLUATATION_MODE float32 861.0 kB 210 0.1% 104,910 48.7% 3,039 1.4% 2.8% 0.9871
65 YEARS_BUILD_AVG float32 861.0 kB 146 0.1% 143,036 66.4% 2,132 1.0% 3.0% 0.8232
66 amt_annuity_max float64 1.7 MB 18,638 8.7% 159,480 74.1% 13,781 6.4% 24.7% 0.0
67 amt_annuity_max_previous_application float64 1.7 MB 110,598 51.4% 11,752 5.5% 2,363 1.1% 1.2% 22500.0
68 amt_annuity_median float64 1.7 MB 16,441 7.6% 159,480 74.1% 23,785 11.0% 42.6% 0.0
69 amt_annuity_median_previous_application float64 1.7 MB 157,063 73.0% 11,752 5.5% 1,357 0.6% 0.7% 11250.0
70 amt_annuity_min float64 1.7 MB 9,921 4.6% 159,480 74.1% 36,975 17.2% 66.3% 0.0
71 amt_annuity_min_previous_application float64 1.7 MB 113,816 52.9% 11,752 5.5% 16,017 7.4% 7.9% 2250.0
72 amt_annuity_to_credit_ratio float32 861.0 kB 33,148 15.4% 8 <0.1% 20,556 9.5% 9.5% 0.05
73 amt_annuity_to_income_per_family_member float64 1.7 MB 88,172 41.0% 9 <0.1% 1,500 0.7% 0.7% 0.3
74 amt_annuity_to_income_ratio float64 1.7 MB 71,916 33.4% 8 <0.1% 2,049 1.0% 1.0% 0.1
75 amt_balance_credit_card_max float64 1.7 MB 40,175 18.7% 154,158 71.6% 19,232 8.9% 31.5% 0.0
76 amt_balance_credit_card_median float64 1.7 MB 27,685 12.9% 154,158 71.6% 33,027 15.3% 54.1% 0.0
77 amt_balance_credit_card_min float64 1.7 MB 8,310 3.9% 154,158 71.6% 52,144 24.2% 85.3% 0.0
78 amt_credit_limit_actual_median float32 861.0 kB 151 0.1% 154,158 71.6% 7,600 3.5% 12.4% 0.0
79 amt_credit_limit_actual_range float32 861.0 kB 147 0.1% 154,158 71.6% 26,300 12.2% 43.0% 0.0
80 amt_credit_max float64 1.7 MB 49,618 23.1% 11,456 5.3% 4,696 2.2% 2.3% 450000.0
81 amt_credit_max_overdue_max float64 1.7 MB 32,871 15.3% 86,638 40.2% 79,549 37.0% 61.8% 0.0
82 amt_credit_max_overdue_range float64 1.7 MB 27,267 12.7% 86,638 40.2% 88,957 41.3% 69.2% 0.0
83 amt_credit_median float64 1.7 MB 73,966 34.4% 11,456 5.3% 8,095 3.8% 4.0% 0.0
84 amt_credit_min float64 1.7 MB 33,220 15.4% 11,456 5.3% 79,660 37.0% 39.1% 0.0
85 amt_credit_range float64 1.7 MB 71,950 33.4% 11,456 5.3% 37,038 17.2% 18.2% 0.0
86 amt_credit_sum_debt_mean float64 1.7 MB 121,544 56.5% 36,039 16.7% 48,543 22.6% 27.1% 0.0
87 amt_credit_sum_debt_median float64 1.7 MB 48,592 22.6% 36,039 16.7% 120,818 56.1% 67.4% 0.0
88 amt_credit_sum_debt_sum float64 1.7 MB 113,811 52.9% 30,836 14.3% 53,746 25.0% 29.1% 0.0
89 amt_credit_sum_limit_min float64 1.7 MB 2,121 1.0% 45,585 21.2% 167,209 77.7% 98.5% 0.0
90 amt_credit_sum_limit_std float64 1.7 MB 26,937 12.5% 80,896 37.6% 102,265 47.5% 76.1% 0.0
91 amt_credit_sum_limit_sum float64 1.7 MB 26,367 12.2% 30,836 14.3% 150,348 69.8% 81.5% 0.0
92 amt_credit_sum_median float64 1.7 MB 77,800 36.1% 30,836 14.3% 5,011 2.3% 2.7% 225000.0
93 amt_credit_sum_overdue_std float64 1.7 MB 1,618 0.8% 55,965 26.0% 157,060 73.0% 98.6% 0.0
94 amt_credit_sum_overdue_sum float64 1.7 MB 930 0.4% 30,836 14.3% 182,090 84.6% 98.7% 0.0
95 amt_credit_sum_std float64 1.7 MB 148,439 69.0% 55,965 26.0% 1,156 0.5% 0.7% 0.0
96 amt_credit_sum_sum float64 1.7 MB 147,742 68.6% 30,836 14.3% 924 0.4% 0.5% 225000.0
97 amt_credit_to_income_ratio float64 1.7 MB 39,372 18.3% 0 0% 3,691 1.7% 1.7% 2.0
98 amt_down_payment_max float64 1.7 MB 17,607 8.2% 23,703 11.0% 53,725 25.0% 28.0% 0.0
99 amt_down_payment_mean float64 1.7 MB 42,577 19.8% 23,703 11.0% 53,725 25.0% 28.0% 0.0
100 amt_drawings_atm_current_max float64 1.7 MB 1,131 0.5% 172,254 80.0% 6,929 3.2% 16.1% 0.0
101 amt_drawings_atm_current_median float64 1.7 MB 378 0.2% 172,254 80.0% 36,581 17.0% 85.1% 0.0
102 amt_drawings_atm_current_min float32 861.0 kB 114 0.1% 172,254 80.0% 42,401 19.7% 98.6% 0.0
103 amt_drawings_current_max float64 1.7 MB 17,325 8.0% 154,158 71.6% 19,196 8.9% 31.4% 0.0
104 amt_drawings_current_mean float64 1.7 MB 35,095 16.3% 154,158 71.6% 19,196 8.9% 31.4% 0.0
105 amt_drawings_current_min float64 1.7 MB 1,475 0.7% 154,158 71.6% 59,264 27.5% 97.0% 0.0
106 amt_drawings_other_current_max float64 1.7 MB 1,084 0.5% 172,254 80.0% 38,999 18.1% 90.7% 0.0
107 amt_drawings_pos_current_max float64 1.7 MB 20,726 9.6% 172,254 80.0% 19,027 8.8% 44.2% 0.0
108 amt_drawings_pos_current_mean float64 1.7 MB 23,516 10.9% 172,254 80.0% 19,027 8.8% 44.2% 0.0
109 amt_drawings_pos_current_min float64 1.7 MB 1,772 0.8% 172,254 80.0% 41,083 19.1% 95.5% 0.0
110 amt_goods_price_min float64 1.7 MB 39,170 18.2% 12,169 5.7% 11,596 5.4% 5.7% 45000.0
111 amt_inst_min_regularity_min float64 1.7 MB 1,664 0.8% 154,158 71.6% 57,788 26.8% 94.6% 0.0
112 amt_payment_current_median float64 1.7 MB 17,066 7.9% 172,336 80.1% 2,689 1.2% 6.3% 9000.0
113 amt_payment_current_min float64 1.7 MB 7,398 3.4% 172,336 80.1% 26,925 12.5% 62.7% 0.0
114 amt_payment_current_range float64 1.7 MB 22,545 10.5% 172,336 80.1% 682 0.3% 1.6% 0.0
115 amt_payment_total_current_min float64 1.7 MB 1,131 0.5% 154,158 71.6% 59,285 27.5% 97.0% 0.0
116 any_installments_late_30 float32 861.0 kB 2 <0.1% 11,034 5.1% 190,963 88.7% 93.5% 0.0
117 any_installments_late_60 float32 861.0 kB 2 <0.1% 11,034 5.1% 198,146 92.1% 97.0% 0.0
118 any_installments_late_7 float32 861.0 kB 2 <0.1% 11,034 5.1% 147,558 68.5% 72.3% 0.0
119 bureau_dpd_status_max float32 861.0 kB 6 <0.1% 152,586 70.9% 41,042 19.1% 65.5% 0.0
120 bureau_dpd_status_median float32 861.0 kB 11 <0.1% 152,586 70.9% 61,726 28.7% 98.5% 0.0
121 bureau_months_balance_max float32 861.0 kB 89 <0.1% 152,586 70.9% 59,695 27.7% 95.3% 0.0
122 cnt_credit_prolong_mean float32 861.0 kB 100 <0.1% 30,836 14.3% 178,412 82.9% 96.7% 0.0
123 cnt_credit_prolong_sum float32 861.0 kB 10 <0.1% 30,836 14.3% 178,412 82.9% 96.7% 0.0
124 cnt_drawings_atm_current_max float32 861.0 kB 43 <0.1% 172,254 80.0% 6,929 3.2% 16.1% 0.0
125 cnt_drawings_atm_current_std float32 861.0 kB 16,770 7.8% 172,561 80.2% 6,817 3.2% 16.0% 0.0
126 cnt_drawings_current_min float32 861.0 kB 39 <0.1% 154,158 71.6% 59,278 27.5% 97.0% 0.0
127 cnt_drawings_other_current_max float32 861.0 kB 11 <0.1% 172,254 80.0% 38,987 18.1% 90.7% 0.0
128 cnt_drawings_pos_current_max float32 861.0 kB 116 0.1% 172,254 80.0% 19,027 8.8% 44.2% 0.0
129 cnt_drawings_pos_current_median float32 861.0 kB 113 0.1% 172,254 80.0% 33,721 15.7% 78.4% 0.0
130 cnt_drawings_pos_current_min float32 861.0 kB 40 <0.1% 172,254 80.0% 41,083 19.1% 95.5% 0.0
131 cnt_fam_members_excluding_children float32 861.0 kB 2 <0.1% 1 <0.1% 158,301 73.5% 73.5% 2.0
132 cnt_installment_future_min float32 861.0 kB 61 <0.1% 12,588 5.8% 183,466 85.2% 90.5% 0.0
133 cnt_installment_mature_cum_max float32 861.0 kB 120 0.1% 154,158 71.6% 19,249 8.9% 31.5% 0.0
134 cnt_installment_mature_cum_min float32 861.0 kB 28 <0.1% 154,158 71.6% 38,853 18.0% 63.6% 0.0
135 cnt_installment_median float32 861.0 kB 103 <0.1% 12,588 5.8% 61,162 28.4% 30.2% 12.0
136 cnt_installment_min float32 861.0 kB 53 <0.1% 12,588 5.8% 42,362 19.7% 20.9% 6.0
137 cnt_installment_range float32 861.0 kB 69 <0.1% 12,588 5.8% 49,692 23.1% 24.5% 0.0
138 cnt_installments_diff_mean float32 861.0 kB 20,290 9.4% 12,588 5.8% 9,014 4.2% 4.4% 3.0
139 cnt_installments_diff_min float32 861.0 kB 58 <0.1% 12,588 5.8% 198,083 92.0% 97.7% 0.0
140 cnt_installments_diff_range float32 861.0 kB 82 <0.1% 12,588 5.8% 35,742 16.6% 17.6% 12.0
141 cnt_payment_median float32 861.0 kB 87 <0.1% 11,752 5.5% 53,998 25.1% 26.5% 12.0
142 cnt_payment_min float32 861.0 kB 31 <0.1% 11,752 5.5% 68,588 31.9% 33.7% 0.0
143 cnt_payment_range float32 861.0 kB 69 <0.1% 11,752 5.5% 54,639 25.4% 26.8% 0.0
144 days_credit_enddate_max Int32 1.1 MB 12,274 5.7% 32,432 15.1% 187 0.1% 0.1% 31060
145 days_credit_enddate_min Int32 1.1 MB 6,266 2.9% 32,432 15.1% 119 0.1% 0.1% -2359
146 days_credit_enddate_std Float64 1.9 MB 134,001 62.3% 59,197 27.5% 1,369 0.6% 0.9% 0.0
147 days_credit_max float32 861.0 kB 2,922 1.4% 30,836 14.3% 480 0.2% 0.3% -91.0
148 days_credit_median float32 861.0 kB 5,711 2.7% 30,836 14.3% 118 0.1% 0.1% -561.0
149 days_credit_overdue_max float32 861.0 kB 671 0.3% 30,836 14.3% 182,056 84.6% 98.7% 0.0
150 days_credit_overdue_mean float32 861.0 kB 1,195 0.6% 30,836 14.3% 182,056 84.6% 98.7% 0.0
151 days_credit_overdue_median float32 861.0 kB 168 0.1% 30,836 14.3% 184,119 85.5% 99.8% 0.0
152 days_credit_range float32 861.0 kB 2,913 1.4% 30,836 14.3% 26,512 12.3% 14.4% 0.0
153 days_credit_std float32 861.0 kB 133,052 61.8% 55,965 26.0% 1,383 0.6% 0.9% 0.0
154 days_credit_update_max float32 861.0 kB 2,585 1.2% 30,836 14.3% 7,529 3.5% 4.1% -7.0
155 days_credit_update_median float32 861.0 kB 4,779 2.2% 30,836 14.3% 1,055 0.5% 0.6% -22.0
156 days_credit_update_range float32 861.0 kB 2,925 1.4% 30,836 14.3% 27,014 12.5% 14.6% 0.0
157 days_decision_max float32 861.0 kB 2,921 1.4% 11,456 5.3% 598 0.3% 0.3% -7.0
158 days_decision_median float32 861.0 kB 5,656 2.6% 11,456 5.3% 255 0.1% 0.1% -364.0
159 days_decision_range float32 861.0 kB 2,919 1.4% 11,456 5.3% 40,565 18.8% 19.9% 0.0
160 days_enddate_fact_max Int16 645.8 kB 2,793 1.3% 53,870 25.0% 340 0.2% 0.2% -84
161 days_enddate_fact_median Float32 1.1 MB 5,341 2.5% 53,870 25.0% 135 0.1% 0.1% -919.0
162 days_enddate_fact_range Int32 1.1 MB 2,796 1.3% 53,870 25.0% 38,623 17.9% 23.9% 0
163 days_first_draw_min float32 861.0 kB 2,718 1.3% 12,377 5.7% 165,404 76.8% 81.5% 365243.0
164 days_last_due_1st_version_max float32 861.0 kB 4,521 2.1% 12,377 5.7% 55,263 25.7% 27.2% 365243.0
165 days_last_due_1st_version_mean float32 861.0 kB 51,499 23.9% 12,377 5.7% 1,911 0.9% 0.9% 365243.0
166 days_last_due_1st_version_median float32 861.0 kB 10,719 5.0% 12,377 5.7% 1,937 0.9% 1.0% 365243.0
167 days_last_due_1st_version_min float32 861.0 kB 4,081 1.9% 12,377 5.7% 1,911 0.9% 0.9% 365243.0
168 days_last_due_max float32 861.0 kB 2,761 1.3% 12,377 5.7% 98,527 45.8% 48.6% 365243.0
169 days_last_due_range float32 861.0 kB 5,592 2.6% 12,377 5.7% 58,659 27.3% 28.9% 0.0
170 days_termination_median float32 861.0 kB 7,716 3.6% 12,377 5.7% 23,269 10.8% 11.5% 365243.0
171 days_termination_min float32 861.0 kB 2,797 1.3% 12,377 5.7% 15,833 7.4% 7.8% 365243.0
172 diff_amt_installment_payment_max float64 1.7 MB 75,445 35.0% 11,037 5.1% 116,518 54.1% 57.1% 0.0
173 diff_amt_installment_payment_mean float64 1.7 MB 97,257 45.2% 11,037 5.1% 103,060 47.9% 50.5% 0.0
174 diff_amt_installment_payment_median float64 1.7 MB 6,855 3.2% 11,037 5.1% 195,960 91.0% 96.0% 0.0
175 diff_amt_installment_payment_range float64 1.7 MB 90,195 41.9% 11,037 5.1% 103,062 47.9% 50.5% 0.0
176 diff_days_installment_payment_max float32 861.0 kB 409 0.2% 11,037 5.1% 15,321 7.1% 7.5% 30.0
177 diff_days_installment_payment_mean float32 861.0 kB 50,246 23.3% 11,037 5.1% 761 0.4% 0.4% 9.0
178 diff_days_installment_payment_median float32 861.0 kB 320 0.1% 11,037 5.1% 21,620 10.0% 10.6% 0.0
179 diff_days_installment_payment_range float32 861.0 kB 1,465 0.7% 11,037 5.1% 5,349 2.5% 2.6% 30.0
180 diff_days_installment_payment_sum float32 861.0 kB 4,383 2.0% 11,034 5.1% 540 0.3% 0.3% 66.0
181 diff_days_installment_payment_sum_late_only float32 861.0 kB 1,815 0.8% 11,034 5.1% 95,670 44.4% 46.8% 0.0
182 diff_percent_installment_payment_mean float64 1.7 MB 87,934 40.9% 11,037 5.1% 103,191 47.9% 50.5% 1.0
183 diff_percent_installment_payment_median float32 861.0 kB 7,969 3.7% 11,037 5.1% 195,960 91.0% 96.0% 1.0
184 diff_percent_installment_payment_min float32 861.0 kB 25,589 11.9% 11,037 5.1% 177,973 82.7% 87.1% 1.0
185 diff_percent_installment_payment_range float64 1.7 MB 97,055 45.1% 11,037 5.1% 103,190 47.9% 50.5% 0.0
186 mode_credit_type category 215.8 kB 6 <0.1% 30,836 14.3% 160,802 74.7% 87.2% Consumer credit
187 n_car_loans float32 861.0 kB 9 <0.1% 30,836 14.3% 170,683 79.3% 92.6% 0.0
188 n_cash_loans float32 861.0 kB 55 <0.1% 11,456 5.3% 83,697 38.9% 41.1% 0.0
189 n_channel_type_ap_minus float32 861.0 kB 33 <0.1% 11,456 5.3% 187,751 87.2% 92.1% 0.0
190 n_channel_type_channel_corporate_sales float32 861.0 kB 20 <0.1% 11,456 5.3% 202,289 94.0% 99.3% 0.0
191 n_channel_type_contact_center float32 861.0 kB 19 <0.1% 11,456 5.3% 175,621 81.6% 86.2% 0.0
192 n_channel_type_countrywide float32 861.0 kB 34 <0.1% 11,456 5.3% 67,466 31.3% 33.1% 1.0
193 n_channel_type_credit_and_cash float32 861.0 kB 52 <0.1% 11,456 5.3% 96,482 44.8% 47.3% 0.0
194 n_channel_type_regional_and_local float32 861.0 kB 19 <0.1% 11,456 5.3% 158,328 73.6% 77.7% 0.0
195 n_channel_type_stone float32 861.0 kB 22 <0.1% 11,456 5.3% 121,683 56.5% 59.7% 0.0
196 n_client_type_new float32 861.0 kB 14 <0.1% 11,456 5.3% 154,064 71.6% 75.6% 1.0
197 n_client_type_refreshed float32 861.0 kB 23 <0.1% 11,456 5.3% 150,108 69.7% 73.7% 0.0
198 n_client_type_repeater float32 861.0 kB 61 <0.1% 11,456 5.3% 49,122 22.8% 24.1% 0.0
199 n_consumer_loans float32 861.0 kB 36 <0.1% 11,456 5.3% 78,331 36.4% 38.4% 1.0
200 n_contract_status_refused float32 861.0 kB 44 <0.1% 11,456 5.3% 133,394 62.0% 65.5% 0.0
201 n_contract_status_unused_offer float32 861.0 kB 11 <0.1% 11,456 5.3% 190,553 88.5% 93.5% 0.0
202 n_contracts_credit_card_completed float32 861.0 kB 40 <0.1% 154,158 71.6% 53,625 24.9% 87.8% 0.0
203 n_credit_card_credits float32 861.0 kB 22 <0.1% 30,836 14.3% 63,863 29.7% 34.6% 0.0
204 n_credits_active float32 861.0 kB 22 <0.1% 30,836 14.3% 51,735 24.0% 28.1% 1.0
205 n_credits_sold float32 861.0 kB 7 <0.1% 30,836 14.3% 180,711 84.0% 98.0% 0.0
206 n_credits_total float32 861.0 kB 57 <0.1% 30,836 14.3% 25,129 11.7% 13.6% 1.0
207 n_currency_2 float32 861.0 kB 7 <0.1% 30,836 14.3% 183,835 85.4% 99.7% 0.0
208 n_different_channels float32 861.0 kB 7 <0.1% 11,456 5.3% 79,085 36.7% 38.8% 2.0
209 n_different_contract_types float32 861.0 kB 4 <0.1% 11,456 5.3% 77,974 36.2% 38.3% 2.0
210 n_different_credit_types float32 861.0 kB 5 <0.1% 30,836 14.3% 100,733 46.8% 54.6% 2.0
211 n_different_currencies float32 861.0 kB 3 <0.1% 30,836 14.3% 183,765 85.4% 99.6% 1.0
212 n_installments_late float32 861.0 kB 99 <0.1% 11,034 5.1% 95,670 44.4% 46.8% 0.0
213 n_installments_late_30 float32 861.0 kB 42 <0.1% 11,034 5.1% 190,963 88.7% 93.5% 0.0
214 n_installments_late_7 float32 861.0 kB 59 <0.1% 11,034 5.1% 147,558 68.5% 72.3% 0.0
215 n_installments_total float32 861.0 kB 310 0.1% 11,034 5.1% 8,624 4.0% 4.2% 12.0
216 n_microloans float32 861.0 kB 28 <0.1% 30,836 14.3% 181,975 84.5% 98.7% 0.0
217 n_mortgages float32 861.0 kB 7 <0.1% 30,836 14.3% 174,434 81.0% 94.6% 0.0
218 n_nflag_insured_on_approval_mean float32 861.0 kB 102 <0.1% 12,377 5.7% 95,675 44.4% 47.2% 0.0
219 n_nflag_insured_on_approval_sum float32 861.0 kB 19 <0.1% 11,456 5.3% 96,596 44.9% 47.4% 0.0
220 n_other_type_credit float32 861.0 kB 9 <0.1% 30,836 14.3% 182,373 84.7% 98.9% 0.0
221 n_payment_type_cash_through_bank float32 861.0 kB 44 <0.1% 11,456 5.3% 54,943 25.5% 27.0% 1.0
222 n_payment_type_not_available float32 861.0 kB 46 <0.1% 11,456 5.3% 71,796 33.4% 35.2% 0.0
223 n_previous_credit_card_applications float32 861.0 kB 126 0.1% 154,158 71.6% 4,332 2.0% 7.1% 96.0
224 n_previous_credit_card_applications_signed float32 861.0 kB 37 <0.1% 154,158 71.6% 58,091 27.0% 95.1% 0.0
225 n_previous_pos_applications float32 861.0 kB 221 0.1% 12,570 5.8% 9,559 4.4% 4.7% 13.0
226 n_previous_pos_applications_completed float32 861.0 kB 45 <0.1% 12,570 5.8% 73,226 34.0% 36.1% 1.0
227 n_previous_pos_applications_signed float32 861.0 kB 31 <0.1% 12,570 5.8% 162,017 75.3% 79.9% 0.0
228 n_product_type_walk_in float32 861.0 kB 28 <0.1% 11,456 5.3% 152,783 71.0% 75.0% 0.0
229 n_reject_reason_limit float32 861.0 kB 22 <0.1% 11,456 5.3% 183,819 85.4% 90.2% 0.0
230 n_reject_reason_scoc float32 861.0 kB 20 <0.1% 11,456 5.3% 188,558 87.6% 92.5% 0.0
231 n_reject_reason_scofr float32 861.0 kB 16 <0.1% 11,456 5.3% 199,055 92.5% 97.7% 0.0
232 n_revolving_loans float32 861.0 kB 25 <0.1% 11,456 5.3% 130,792 60.8% 64.2% 0.0
233 n_yield_group_high float32 861.0 kB 30 <0.1% 11,456 5.3% 89,153 41.4% 43.7% 0.0
234 n_yield_group_low_action float32 861.0 kB 22 <0.1% 11,456 5.3% 163,415 75.9% 80.2% 0.0
235 n_yield_group_low_normal float32 861.0 kB 23 <0.1% 11,456 5.3% 94,724 44.0% 46.5% 0.0
236 n_yield_group_middle float32 861.0 kB 25 <0.1% 11,456 5.3% 80,043 37.2% 39.3% 0.0
237 ord_education_type int8 215.3 kB 5 <0.1% 0 0% 152,993 71.1% 71.1% 1
238 percent_installments_early float32 861.0 kB 7,892 3.7% 11,034 5.1% 64,688 30.1% 31.7% 1.0
239 percent_installments_late float32 861.0 kB 4,464 2.1% 11,034 5.1% 95,670 44.4% 46.8% 0.0
240 percent_installments_late_30 float32 861.0 kB 894 0.4% 11,034 5.1% 190,963 88.7% 93.5% 0.0
241 percent_installments_late_60 float32 861.0 kB 629 0.3% 11,034 5.1% 198,146 92.1% 97.0% 0.0
242 percent_installments_late_7 float32 861.0 kB 2,595 1.2% 11,034 5.1% 147,558 68.5% 72.3% 0.0
243 rate_down_payment_max float32 861.0 kB 84,883 39.4% 23,703 11.0% 53,725 25.0% 28.0% 0.0
244 rate_down_payment_range float32 861.0 kB 73,615 34.2% 23,703 11.0% 94,887 44.1% 49.5% 0.0
245 rate_interest_privileged_count float32 861.0 kB 4 <0.1% 11,456 5.3% 200,560 93.2% 98.4% 0.0
246 sk_dpd_credit_card_max float32 861.0 kB 353 0.2% 154,158 71.6% 48,474 22.5% 79.3% 0.0
247 sk_dpd_credit_card_median float32 861.0 kB 222 0.1% 154,158 71.6% 60,546 28.1% 99.1% 0.0
248 sk_dpd_def_credit_card_max float32 861.0 kB 47 <0.1% 154,158 71.6% 50,652 23.5% 82.9% 0.0
249 sk_dpd_def_pos_applications_max float32 861.0 kB 173 0.1% 12,570 5.8% 174,617 81.1% 86.2% 0.0
250 sk_dpd_pos_applications_max float32 861.0 kB 1,595 0.7% 12,570 5.8% 164,332 76.3% 81.1% 0.0
251 years_employed float64 1.7 MB 11,769 5.5% 38,756 18.0% 112 0.1% 0.1% 0.6273972602739726
Code
# Save to file
file_path = dir_interim + "colnames--cols_to_include_in_preprocessing.csv"
before_preproc_col_info.column.to_csv(file_path, index=False)

# Read from file (to check)
cols_to_include_in_preprocessing = pd.read_csv(file_path).column.tolist()
del file_path

6.1.2 Pre-Processing

Next, data will be pre-processed in the following pipeline:

  1. Remove the columns identified in the previous step.
  2. Use different pre-processing steps for different data types:
    1. Use SimpleImputer to impute missing values and create missing value indicators for numeric data;
    2. Use OneHotEncoder to encode categorical data and after that fix names to be in the snake case;
    3. Other types of data (if any) are left unchanged.
Code
pipeline_pre_processing = Pipeline(
    steps=[
        ("selector", ColumnSelector(cols_to_include_in_preprocessing)),
        ("preprocessor", clone(pre_processing)),
    ]
)

pipeline_pre_processing
Pipeline(steps=[('selector',
                 ColumnSelector(keep=['AMT_ANNUITY', 'AMT_CREDIT',
                                      'AMT_INCOME_TOTAL',
                                      'AMT_REQ_CREDIT_BUREAU_DAY',
                                      'AMT_REQ_CREDIT_BUREAU_HOUR',
                                      'AMT_REQ_CREDIT_BUREAU_MON',
                                      'AMT_REQ_CREDIT_BUREAU_QRT',
                                      'AMT_REQ_CREDIT_BUREAU_WEEK',
                                      'AMT_REQ_CREDIT_BUREAU_YEAR',
                                      'BASEMENTAREA_MODE', 'CNT_FAM_MEMBERS',
                                      'COMMONAREA_MEDI', 'DAYS_ID_PUBLISH',
                                      'DAYS_LAST...
                                                                                 strategy='median'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027679139410>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False)),
                                                                  ('clean_names',
                                                                   CleanColumnNames())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027726FD0B90>)],
                                   verbose_feature_names_out=False))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
credits_train_transformed = pipeline_pre_processing.fit_transform(credits_train)

Let’s look at the transformed data:

Code
credits_train_transformed.shape
(215257, 580)
Code
credits_train_transformed.head()
AMT_ANNUITY AMT_CREDIT AMT_INCOME_TOTAL AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_YEAR BASEMENTAREA_MODE CNT_FAM_MEMBERS COMMONAREA_MEDI DAYS_ID_PUBLISH DAYS_LAST_PHONE_CHANGE DAYS_REGISTRATION DEF_30_CNT_SOCIAL_CIRCLE ELEVATORS_AVG ELEVATORS_MEDI ENTRANCES_MODE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 FLAG_CONT_MOBILE FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_16 FLAG_DOCUMENT_18 FLAG_DOCUMENT_3 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_EMAIL FLAG_EMP_PHONE FLAG_IS_EMERGENCY FLAG_OWN_CAR FLAG_OWN_REALTY FLAG_PHONE FLAG_WORK_PHONE FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE OWN_CAR_AGE REGION_POPULATION_RELATIVE REGION_RATING_CLIENT REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_AVG amt_annuity_max amt_annuity_max_previous_application amt_annuity_median amt_annuity_median_previous_application amt_annuity_min amt_annuity_min_previous_application amt_annuity_to_credit_ratio amt_annuity_to_income_per_family_member amt_annuity_to_income_ratio amt_balance_credit_card_max amt_balance_credit_card_median amt_balance_credit_card_min amt_credit_limit_actual_median amt_credit_limit_actual_range amt_credit_max amt_credit_max_overdue_max amt_credit_max_overdue_range amt_credit_median amt_credit_min amt_credit_range amt_credit_sum_debt_mean amt_credit_sum_debt_median amt_credit_sum_debt_sum amt_credit_sum_limit_min amt_credit_sum_limit_std amt_credit_sum_limit_sum amt_credit_sum_median amt_credit_sum_overdue_std amt_credit_sum_overdue_sum amt_credit_sum_std amt_credit_sum_sum amt_credit_to_income_ratio amt_down_payment_max amt_down_payment_mean amt_drawings_atm_current_max amt_drawings_atm_current_median amt_drawings_atm_current_min amt_drawings_current_max amt_drawings_current_mean amt_drawings_current_min amt_drawings_other_current_max amt_drawings_pos_current_max amt_drawings_pos_current_mean amt_drawings_pos_current_min amt_goods_price_min amt_inst_min_regularity_min amt_payment_current_median amt_payment_current_min amt_payment_current_range amt_payment_total_current_min any_installments_late_30 any_installments_late_60 any_installments_late_7 bureau_dpd_status_max bureau_dpd_status_median bureau_months_balance_max cnt_credit_prolong_mean cnt_credit_prolong_sum cnt_drawings_atm_current_max cnt_drawings_atm_current_std cnt_drawings_current_min cnt_drawings_other_current_max cnt_drawings_pos_current_max cnt_drawings_pos_current_median cnt_drawings_pos_current_min cnt_fam_members_excluding_children cnt_installment_future_min cnt_installment_mature_cum_max cnt_installment_mature_cum_min cnt_installment_median cnt_installment_min cnt_installment_range cnt_installments_diff_mean cnt_installments_diff_min cnt_installments_diff_range cnt_payment_median cnt_payment_min cnt_payment_range days_credit_enddate_max days_credit_enddate_min days_credit_enddate_std days_credit_max days_credit_median days_credit_overdue_max days_credit_overdue_mean days_credit_overdue_median days_credit_range days_credit_std days_credit_update_max days_credit_update_median days_credit_update_range days_decision_max days_decision_median days_decision_range days_enddate_fact_max ... missingindicator_n_reject_reason_scofr missingindicator_n_revolving_loans missingindicator_n_yield_group_high missingindicator_n_yield_group_low_action missingindicator_n_yield_group_low_normal missingindicator_n_yield_group_middle missingindicator_percent_installments_early missingindicator_percent_installments_late missingindicator_percent_installments_late_30 missingindicator_percent_installments_late_60 missingindicator_percent_installments_late_7 missingindicator_rate_down_payment_max missingindicator_rate_down_payment_range missingindicator_rate_interest_privileged_count missingindicator_sk_dpd_credit_card_max missingindicator_sk_dpd_credit_card_median missingindicator_sk_dpd_def_credit_card_max missingindicator_sk_dpd_def_pos_applications_max missingindicator_sk_dpd_pos_applications_max missingindicator_years_employed FONDKAPREMONT_MODE_not_specified FONDKAPREMONT_MODE_org_spec_account FONDKAPREMONT_MODE_reg_oper_account FONDKAPREMONT_MODE_reg_oper_spec_account FONDKAPREMONT_MODE_nan HOUSETYPE_MODE_block_of_flats HOUSETYPE_MODE_specific_housing HOUSETYPE_MODE_terraced_house HOUSETYPE_MODE_nan NAME_CONTRACT_TYPE_Cash_loans NAME_CONTRACT_TYPE_Revolving_loans NAME_EDUCATION_TYPE_Academic_degree NAME_EDUCATION_TYPE_Higher_education NAME_EDUCATION_TYPE_Incomplete_higher NAME_EDUCATION_TYPE_Lower_secondary NAME_EDUCATION_TYPE_Secondary_secondary_special NAME_HOUSING_TYPE_Co_op_apartment NAME_HOUSING_TYPE_House_apartment NAME_HOUSING_TYPE_Municipal_apartment NAME_HOUSING_TYPE_Office_apartment NAME_HOUSING_TYPE_Rented_apartment NAME_HOUSING_TYPE_With_parents NAME_INCOME_TYPE_Businessman NAME_INCOME_TYPE_Commercial_associate NAME_INCOME_TYPE_Maternity_leave NAME_INCOME_TYPE_Pensioner NAME_INCOME_TYPE_State_servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_TYPE_SUITE_Children NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group_of_people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse_partner NAME_TYPE_SUITE_Unaccompanied NAME_TYPE_SUITE_nan OCCUPATION_TYPE_Accountants OCCUPATION_TYPE_Cleaning_staff OCCUPATION_TYPE_Cooking_staff OCCUPATION_TYPE_Core_staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR_staff OCCUPATION_TYPE_High_skill_tech_staff OCCUPATION_TYPE_IT_staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low_skill_Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine_staff OCCUPATION_TYPE_Private_service_staff OCCUPATION_TYPE_Realty_agents OCCUPATION_TYPE_Sales_staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security_staff OCCUPATION_TYPE_Waiters_barmen_staff OCCUPATION_TYPE_nan ORGANIZATION_TYPE_Advertising ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business_Entity_Type_1 ORGANIZATION_TYPE_Business_Entity_Type_2 ORGANIZATION_TYPE_Business_Entity_Type_3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry_type_1 ORGANIZATION_TYPE_Industry_type_10 ORGANIZATION_TYPE_Industry_type_11 ORGANIZATION_TYPE_Industry_type_12 ORGANIZATION_TYPE_Industry_type_13 ORGANIZATION_TYPE_Industry_type_2 ORGANIZATION_TYPE_Industry_type_3 ORGANIZATION_TYPE_Industry_type_4 ORGANIZATION_TYPE_Industry_type_5 ORGANIZATION_TYPE_Industry_type_6 ORGANIZATION_TYPE_Industry_type_7 ORGANIZATION_TYPE_Industry_type_8 ORGANIZATION_TYPE_Industry_type_9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal_Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security_Ministries ORGANIZATION_TYPE_Self_employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade_type_1 ORGANIZATION_TYPE_Trade_type_2 ORGANIZATION_TYPE_Trade_type_3 ORGANIZATION_TYPE_Trade_type_4 ORGANIZATION_TYPE_Trade_type_5 ORGANIZATION_TYPE_Trade_type_6 ORGANIZATION_TYPE_Trade_type_7 ORGANIZATION_TYPE_Transport_type_1 ORGANIZATION_TYPE_Transport_type_2 ORGANIZATION_TYPE_Transport_type_3 ORGANIZATION_TYPE_Transport_type_4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_nan WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone_brick WALLSMATERIAL_MODE_Wooden WALLSMATERIAL_MODE_nan mode_credit_type_Car_loan mode_credit_type_Consumer_credit mode_credit_type_Credit_card mode_credit_type_Microloan mode_credit_type_Mortgage mode_credit_type_Other mode_credit_type_nan
0 68643.00 1971072.00 405000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 4.00 0.02 -1823.00 -2169.00 -7460.00 0.00 0.00 0.00 0.24 0.68 0.33 0.64 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.17 0.21 0.00 0.00 0.03 4.00 13.00 0.01 3.00 0.00 0.00 0.00 0.00 0.98 0.78 45459.00 5920.02 27009.00 5920.02 0.00 5920.02 0.03 0.68 0.17 97790.49 0.00 0.00 157500.00 45000.00 51034.50 0.00 0.00 51034.50 51034.50 0.00 297855.00 161358.75 1191420.00 0.00 0.00 0.00 346479.75 0.00 0.00 522819.33 2141271.18 4.87 5175.00 5175.00 90000.00 0.00 0.00 69750.00 3498.70 0.00 0.00 6300.00 303.43 0.00 51610.50 0.00 5850.00 0.00 63000.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 3.00 0.75 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 12.00 12.00 0.00 6.00 0.00 12.00 12.00 12.00 0.00 934.00 -746.00 698.62 -145.00 -1001.50 0.00 0.00 0.00 1094.00 489.28 -7.00 -189.50 734.00 -2169.00 -2169.00 0.00 -362.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00
1 38146.50 508495.50 337500.00 0.00 0.00 0.00 0.00 0.00 6.00 0.07 2.00 0.02 -1090.00 -659.00 -4054.00 1.00 0.00 0.00 0.14 0.51 0.62 0.44 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.17 0.21 0.05 0.00 0.00 2.00 9.00 0.01 2.00 0.00 0.00 0.00 0.00 0.98 0.76 12500.01 38443.23 3942.00 38250.00 0.00 28879.88 0.08 0.23 0.11 0.00 0.00 0.00 765000.00 0.00 765000.00 0.00 0.00 404878.50 0.00 765000.00 44370.00 0.00 169746.66 0.00 0.00 0.00 133852.50 0.00 0.00 183202.89 964161.00 1.51 5853.24 3375.00 90000.00 0.00 0.00 0.00 0.00 0.00 0.00 6300.00 303.43 0.00 337500.00 0.00 5850.00 0.00 63000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.75 0.00 0.00 1.00 0.00 0.00 2.00 0.00 0.00 0.00 12.00 11.00 13.00 5.41 0.00 11.00 12.00 0.00 24.00 911.00 -1267.00 1014.06 -300.00 -957.00 0.00 0.00 0.00 1262.00 621.29 -19.00 -360.00 904.00 -330.00 -361.00 329.00 -345.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
2 13068.00 110146.50 112500.00 0.00 0.00 0.00 0.00 0.00 1.00 0.07 3.00 0.02 -4130.00 -172.00 -5554.00 0.00 0.00 0.00 0.14 0.36 0.65 0.54 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 0.17 0.21 0.05 0.00 0.00 0.00 9.00 0.01 2.00 0.00 0.00 0.00 0.00 0.98 0.76 12500.01 29840.31 3942.00 10251.99 0.00 7074.85 0.12 0.35 0.12 97790.49 0.00 0.00 157500.00 45000.00 808650.00 0.00 0.00 40045.50 0.00 808650.00 44370.00 0.00 169746.66 0.00 0.00 0.00 133852.50 0.00 0.00 183202.89 964161.00 0.98 24750.00 11407.50 90000.00 0.00 0.00 69750.00 3498.70 0.00 0.00 6300.00 303.43 0.00 37800.00 0.00 5850.00 0.00 63000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.75 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 8.00 2.00 58.00 3.23 0.00 10.00 10.00 4.00 56.00 911.00 -1267.00 1014.06 -300.00 -957.00 0.00 0.00 0.00 1262.00 621.29 -19.00 -360.00 904.00 -121.00 -172.00 2606.00 -345.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
3 3519.00 66384.00 40500.00 0.00 0.00 1.00 0.00 0.00 2.00 0.07 4.00 0.02 -5290.00 -1576.00 -5285.00 0.00 0.00 0.00 0.14 0.39 0.60 0.45 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.17 0.21 0.05 0.00 0.00 0.00 9.00 0.03 2.00 0.00 0.00 0.00 0.00 0.98 0.76 14647.50 33316.83 4387.50 10444.18 0.00 8532.81 0.05 0.35 0.09 97790.49 0.00 0.00 157500.00 45000.00 593460.00 0.00 0.00 102568.50 43321.50 550138.50 69847.88 46305.00 279391.50 0.00 0.00 0.00 136719.00 0.00 0.00 88112.62 800424.00 1.64 6268.50 3134.25 90000.00 0.00 0.00 69750.00 3498.70 0.00 0.00 6300.00 303.43 0.00 36540.00 0.00 5850.00 0.00 63000.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 3.00 0.75 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 24.00 6.00 18.00 8.53 0.00 24.00 15.00 6.00 24.00 30905.00 -679.00 13897.16 -325.00 -545.00 0.00 0.00 0.00 1020.00 398.50 -14.00 -20.00 629.00 -575.00 -1190.00 2293.00 -518.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00
4 31801.50 298512.00 225000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.14 2.00 0.10 -3033.00 -624.00 -86.00 0.00 0.40 0.40 0.17 0.74 0.66 0.72 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.46 0.00 0.00 0.00 0.00 3.00 11.00 0.02 2.00 0.00 0.00 0.00 0.00 1.00 0.99 12500.01 18041.58 3942.00 18041.58 0.00 18041.58 0.11 0.28 0.14 97790.49 0.00 0.00 157500.00 45000.00 162405.00 41400.00 0.00 162405.00 162405.00 0.00 9328.50 0.00 27985.50 0.00 0.00 0.00 120690.00 0.00 0.00 70766.58 435690.00 1.33 18045.00 18045.00 90000.00 0.00 0.00 69750.00 3498.70 0.00 0.00 6300.00 303.43 0.00 180450.00 0.00 5850.00 0.00 63000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.75 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 10.00 5.00 5.00 2.50 0.00 5.00 10.00 10.00 0.00 703.00 -2526.00 1719.64 -965.00 -1106.00 0.00 0.00 0.00 1896.00 1056.31 -50.00 -696.00 2445.00 -624.00 -624.00 0.00 -723.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

5 rows × 580 columns

Code
df_transformed_col_info = an.col_info(credits_train_transformed)
Column info (pre-processed data)
df_processed_col_info.pipe(an.style_col_info)
Table 6.3. Info on all columns after preprocessing.
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 AMT_ANNUITY float64 1.7 MB 12,801 5.9% 0 0% 4,499 2.1% 2.1% 9000.0
2 AMT_CREDIT float64 1.7 MB 5,097 2.4% 0 0% 6,823 3.2% 3.2% 450000.0
3 AMT_INCOME_TOTAL float64 1.7 MB 1,949 0.9% 0 0% 24,982 11.6% 11.6% 135000.0
4 AMT_REQ_CREDIT_BUREAU_DAY float64 1.7 MB 9 <0.1% 0 0% 214,228 99.5% 99.5% 0.0
5 AMT_REQ_CREDIT_BUREAU_HOUR float64 1.7 MB 5 <0.1% 0 0% 214,142 99.5% 99.5% 0.0
6 AMT_REQ_CREDIT_BUREAU_MON float64 1.7 MB 22 <0.1% 0 0% 184,760 85.8% 85.8% 0.0
7 AMT_REQ_CREDIT_BUREAU_QRT float64 1.7 MB 10 <0.1% 0 0% 179,976 83.6% 83.6% 0.0
8 AMT_REQ_CREDIT_BUREAU_WEEK float64 1.7 MB 9 <0.1% 0 0% 209,327 97.2% 97.2% 0.0
9 AMT_REQ_CREDIT_BUREAU_YEAR float64 1.7 MB 24 <0.1% 0 0% 73,441 34.1% 34.1% 1.0
10 BASEMENTAREA_MODE float64 1.7 MB 3,687 1.7% 0 0% 125,860 58.5% 58.5% 0.07460000365972519
11 CNT_FAM_MEMBERS float64 1.7 MB 12 <0.1% 0 0% 110,672 51.4% 51.4% 2.0
12 COMMONAREA_MEDI float64 1.7 MB 2,982 1.4% 0 0% 150,382 69.9% 69.9% 0.020899999886751175
13 DAYS_ID_PUBLISH float64 1.7 MB 6,122 2.8% 0 0% 119 0.1% 0.1% -4074.0
14 DAYS_LAST_PHONE_CHANGE float64 1.7 MB 3,720 1.7% 0 0% 26,201 12.2% 12.2% 0.0
15 DAYS_REGISTRATION float64 1.7 MB 15,249 7.1% 0 0% 79 <0.1% <0.1% -7.0
16 DEF_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 10 <0.1% 0 0% 190,702 88.6% 88.6% 0.0
17 ELEVATORS_AVG float64 1.7 MB 241 0.1% 0 0% 174,679 81.1% 81.1% 0.0
18 ELEVATORS_MEDI float64 1.7 MB 46 <0.1% 0 0% 175,610 81.6% 81.6% 0.0
19 ENTRANCES_MODE float64 1.7 MB 30 <0.1% 0 0% 133,580 62.1% 62.1% 0.1378999948501587
20 EXT_SOURCE_1 float64 1.7 MB 83,962 39.0% 0 0% 121,373 56.4% 56.4% 0.5052886605262756
21 EXT_SOURCE_2 float64 1.7 MB 102,229 47.5% 0 0% 503 0.2% 0.2% 0.2858978807926178
22 EXT_SOURCE_3 float64 1.7 MB 804 0.4% 0 0% 43,202 20.1% 20.1% 0.5352762341499329
23 FLAG_CONT_MOBILE float64 1.7 MB 2 <0.1% 0 0% 214,855 99.8% 99.8% 1.0
24 FLAG_DOCUMENT_11 float64 1.7 MB 2 <0.1% 0 0% 214,448 99.6% 99.6% 0.0
25 FLAG_DOCUMENT_13 float64 1.7 MB 2 <0.1% 0 0% 214,541 99.7% 99.7% 0.0
26 FLAG_DOCUMENT_14 float64 1.7 MB 2 <0.1% 0 0% 214,614 99.7% 99.7% 0.0
27 FLAG_DOCUMENT_15 float64 1.7 MB 2 <0.1% 0 0% 215,015 99.9% 99.9% 0.0
28 FLAG_DOCUMENT_16 float64 1.7 MB 2 <0.1% 0 0% 213,089 99.0% 99.0% 0.0
29 FLAG_DOCUMENT_18 float64 1.7 MB 2 <0.1% 0 0% 213,525 99.2% 99.2% 0.0
30 FLAG_DOCUMENT_3 float64 1.7 MB 2 <0.1% 0 0% 152,845 71.0% 71.0% 1.0
31 FLAG_DOCUMENT_5 float64 1.7 MB 2 <0.1% 0 0% 212,025 98.5% 98.5% 0.0
32 FLAG_DOCUMENT_6 float64 1.7 MB 2 <0.1% 0 0% 196,348 91.2% 91.2% 0.0
33 FLAG_DOCUMENT_8 float64 1.7 MB 2 <0.1% 0 0% 197,689 91.8% 91.8% 0.0
34 FLAG_DOCUMENT_9 float64 1.7 MB 2 <0.1% 0 0% 214,440 99.6% 99.6% 0.0
35 FLAG_EMAIL float64 1.7 MB 2 <0.1% 0 0% 203,006 94.3% 94.3% 0.0
36 FLAG_EMP_PHONE float64 1.7 MB 2 <0.1% 0 0% 176,491 82.0% 82.0% 1.0
37 FLAG_OWN_CAR float64 1.7 MB 2 <0.1% 0 0% 142,086 66.0% 66.0% 0.0
38 FLAG_OWN_REALTY float64 1.7 MB 2 <0.1% 0 0% 149,412 69.4% 69.4% 1.0
39 FLAG_PHONE float64 1.7 MB 2 <0.1% 0 0% 154,906 72.0% 72.0% 0.0
40 FLAG_WORK_PHONE float64 1.7 MB 2 <0.1% 0 0% 172,406 80.1% 80.1% 0.0
41 FLOORSMAX_MEDI float64 1.7 MB 49 <0.1% 0 0% 151,629 70.4% 70.4% 0.16670000553131104
42 FLOORSMIN_MEDI float64 1.7 MB 47 <0.1% 0 0% 169,787 78.9% 78.9% 0.20829999446868896
43 LANDAREA_MEDI float64 1.7 MB 3,393 1.6% 0 0% 127,718 59.3% 59.3% 0.048700001090765
44 NONLIVINGAPARTMENTS_AVG float64 1.7 MB 345 0.2% 0 0% 187,673 87.2% 87.2% 0.0
45 NONLIVINGAREA_MODE float64 1.7 MB 3,090 1.4% 0 0% 118,905 55.2% 55.2% 0.0010999999940395355
46 OBS_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 32 <0.1% 0 0% 115,264 53.5% 53.5% 0.0
47 OWN_CAR_AGE float64 1.7 MB 61 <0.1% 0 0% 145,584 67.6% 67.6% 9.0
48 REGION_POPULATION_RELATIVE float64 1.7 MB 81 <0.1% 0 0% 11,494 5.3% 5.3% 0.03579200059175491
49 REGION_RATING_CLIENT float64 1.7 MB 3 <0.1% 0 0% 158,846 73.8% 73.8% 2.0
50 REG_CITY_NOT_LIVE_CITY float64 1.7 MB 2 <0.1% 0 0% 198,549 92.2% 92.2% 0.0
51 REG_CITY_NOT_WORK_CITY float64 1.7 MB 2 <0.1% 0 0% 165,697 77.0% 77.0% 0.0
52 REG_REGION_NOT_LIVE_REGION float64 1.7 MB 2 <0.1% 0 0% 211,999 98.5% 98.5% 0.0
53 REG_REGION_NOT_WORK_REGION float64 1.7 MB 2 <0.1% 0 0% 204,222 94.9% 94.9% 0.0
54 YEARS_BEGINEXPLUATATION_MODE float64 1.7 MB 210 0.1% 0 0% 107,681 50.0% 50.0% 0.9815999865531921
55 YEARS_BUILD_AVG float64 1.7 MB 146 0.1% 0 0% 144,837 67.3% 67.3% 0.7552000284194946
56 amt_annuity_max float64 1.7 MB 18,638 8.7% 0 0% 159,516 74.1% 74.1% 12500.01
57 amt_annuity_max_previous_application float64 1.7 MB 110,598 51.4% 0 0% 11,756 5.5% 5.5% 17954.865
58 amt_annuity_median float64 1.7 MB 16,441 7.6% 0 0% 159,485 74.1% 74.1% 3942.0
59 amt_annuity_median_previous_application float64 1.7 MB 157,063 73.0% 0 0% 11,753 5.5% 5.5% 10773.157500000001
60 amt_annuity_min float64 1.7 MB 9,921 4.6% 0 0% 196,455 91.3% 91.3% 0.0
61 amt_annuity_min_previous_application float64 1.7 MB 113,816 52.9% 0 0% 16,017 7.4% 7.4% 2250.0
62 amt_annuity_to_credit_ratio float64 1.7 MB 33,148 15.4% 0 0% 20,564 9.6% 9.6% 0.05000000074505806
63 amt_annuity_to_income_per_family_member float64 1.7 MB 88,172 41.0% 0 0% 1,500 0.7% 0.7% 0.3
64 amt_annuity_to_income_ratio float64 1.7 MB 71,916 33.4% 0 0% 2,049 1.0% 1.0% 0.1
65 amt_balance_credit_card_max float64 1.7 MB 40,175 18.7% 0 0% 154,159 71.6% 71.6% 97790.49
66 amt_balance_credit_card_median float64 1.7 MB 27,685 12.9% 0 0% 187,185 87.0% 87.0% 0.0
67 amt_balance_credit_card_min float64 1.7 MB 8,310 3.9% 0 0% 206,302 95.8% 95.8% 0.0
68 amt_credit_limit_actual_median float64 1.7 MB 151 0.1% 0 0% 155,593 72.3% 72.3% 157500.0
69 amt_credit_limit_actual_range float64 1.7 MB 147 0.1% 0 0% 157,689 73.3% 73.3% 45000.0
70 amt_credit_max float64 1.7 MB 49,618 23.1% 0 0% 14,581 6.8% 6.8% 225000.0
71 amt_credit_max_overdue_max float64 1.7 MB 32,871 15.3% 0 0% 166,187 77.2% 77.2% 0.0
72 amt_credit_max_overdue_range float64 1.7 MB 27,267 12.7% 0 0% 175,595 81.6% 81.6% 0.0
73 amt_credit_median float64 1.7 MB 73,966 34.4% 0 0% 11,457 5.3% 5.3% 83054.25
74 amt_credit_min float64 1.7 MB 33,220 15.4% 0 0% 79,660 37.0% 37.0% 0.0
75 amt_credit_range float64 1.7 MB 71,950 33.4% 0 0% 37,038 17.2% 17.2% 0.0
76 amt_credit_sum_debt_mean float64 1.7 MB 121,544 56.5% 0 0% 48,543 22.6% 22.6% 0.0
77 amt_credit_sum_debt_median float64 1.7 MB 48,592 22.6% 0 0% 156,857 72.9% 72.9% 0.0
78 amt_credit_sum_debt_sum float64 1.7 MB 113,811 52.9% 0 0% 53,746 25.0% 25.0% 0.0
79 amt_credit_sum_limit_min float64 1.7 MB 2,121 1.0% 0 0% 212,794 98.9% 98.9% 0.0
80 amt_credit_sum_limit_std float64 1.7 MB 26,937 12.5% 0 0% 183,161 85.1% 85.1% 0.0
81 amt_credit_sum_limit_sum float64 1.7 MB 26,367 12.2% 0 0% 181,184 84.2% 84.2% 0.0
82 amt_credit_sum_median float64 1.7 MB 77,800 36.1% 0 0% 30,841 14.3% 14.3% 133852.5
83 amt_credit_sum_overdue_std float64 1.7 MB 1,618 0.8% 0 0% 213,025 99.0% 99.0% 0.0
84 amt_credit_sum_overdue_sum float64 1.7 MB 930 0.4% 0 0% 212,926 98.9% 98.9% 0.0
85 amt_credit_sum_std float64 1.7 MB 148,440 69.0% 0 0% 55,965 26.0% 26.0% 183202.88926385253
86 amt_credit_sum_sum float64 1.7 MB 147,742 68.6% 0 0% 30,837 14.3% 14.3% 964161.0
87 amt_credit_to_income_ratio float64 1.7 MB 39,372 18.3% 0 0% 3,691 1.7% 1.7% 2.0
88 amt_down_payment_max float64 1.7 MB 17,608 8.2% 0 0% 53,725 25.0% 25.0% 0.0
89 amt_down_payment_mean float64 1.7 MB 42,577 19.8% 0 0% 53,725 25.0% 25.0% 0.0
90 amt_drawings_atm_current_max float64 1.7 MB 1,131 0.5% 0 0% 175,102 81.3% 81.3% 90000.0
91 amt_drawings_atm_current_median float64 1.7 MB 378 0.2% 0 0% 208,835 97.0% 97.0% 0.0
92 amt_drawings_atm_current_min float64 1.7 MB 114 0.1% 0 0% 214,655 99.7% 99.7% 0.0
93 amt_drawings_current_max float64 1.7 MB 17,325 8.0% 0 0% 154,198 71.6% 71.6% 69750.0
94 amt_drawings_current_mean float64 1.7 MB 35,095 16.3% 0 0% 154,159 71.6% 71.6% 3498.702077922078
95 amt_drawings_current_min float64 1.7 MB 1,475 0.7% 0 0% 213,422 99.1% 99.1% 0.0
96 amt_drawings_other_current_max float64 1.7 MB 1,084 0.5% 0 0% 211,253 98.1% 98.1% 0.0
97 amt_drawings_pos_current_max float64 1.7 MB 20,726 9.6% 0 0% 172,260 80.0% 80.0% 6300.0
98 amt_drawings_pos_current_mean float64 1.7 MB 23,516 10.9% 0 0% 172,255 80.0% 80.0% 303.42857142857144
99 amt_drawings_pos_current_min float64 1.7 MB 1,772 0.8% 0 0% 213,337 99.1% 99.1% 0.0
100 amt_goods_price_min float64 1.7 MB 39,171 18.2% 0 0% 12,169 5.7% 5.7% 45735.75
101 amt_inst_min_regularity_min float64 1.7 MB 1,664 0.8% 0 0% 211,946 98.5% 98.5% 0.0
102 amt_payment_current_median float64 1.7 MB 17,066 7.9% 0 0% 172,523 80.1% 80.1% 5850.0
103 amt_payment_current_min float64 1.7 MB 7,398 3.4% 0 0% 199,261 92.6% 92.6% 0.0
104 amt_payment_current_range float64 1.7 MB 22,545 10.5% 0 0% 172,454 80.1% 80.1% 63000.0
105 amt_payment_total_current_min float64 1.7 MB 1,131 0.5% 0 0% 213,443 99.2% 99.2% 0.0
106 any_installments_late_30 float64 1.7 MB 2 <0.1% 0 0% 201,997 93.8% 93.8% 0.0
107 any_installments_late_60 float64 1.7 MB 2 <0.1% 0 0% 209,180 97.2% 97.2% 0.0
108 any_installments_late_7 float64 1.7 MB 2 <0.1% 0 0% 158,592 73.7% 73.7% 0.0
109 bureau_dpd_status_max float64 1.7 MB 6 <0.1% 0 0% 193,628 90.0% 90.0% 0.0
110 bureau_dpd_status_median float64 1.7 MB 11 <0.1% 0 0% 214,312 99.6% 99.6% 0.0
111 bureau_months_balance_max float64 1.7 MB 89 <0.1% 0 0% 212,281 98.6% 98.6% 0.0
112 cnt_credit_prolong_mean float64 1.7 MB 100 <0.1% 0 0% 209,248 97.2% 97.2% 0.0
113 cnt_credit_prolong_sum float64 1.7 MB 10 <0.1% 0 0% 209,248 97.2% 97.2% 0.0
114 cnt_drawings_atm_current_max float64 1.7 MB 43 <0.1% 0 0% 178,554 82.9% 82.9% 3.0
115 cnt_drawings_atm_current_std float64 1.7 MB 16,771 7.8% 0 0% 172,561 80.2% 80.2% 0.7457481920719147
116 cnt_drawings_current_min float64 1.7 MB 39 <0.1% 0 0% 213,436 99.2% 99.2% 0.0
117 cnt_drawings_other_current_max float64 1.7 MB 11 <0.1% 0 0% 211,241 98.1% 98.1% 0.0
118 cnt_drawings_pos_current_max float64 1.7 MB 116 0.1% 0 0% 176,434 82.0% 82.0% 1.0
119 cnt_drawings_pos_current_median float64 1.7 MB 113 0.1% 0 0% 205,975 95.7% 95.7% 0.0
120 cnt_drawings_pos_current_min float64 1.7 MB 40 <0.1% 0 0% 213,337 99.1% 99.1% 0.0
121 cnt_fam_members_excluding_children float64 1.7 MB 2 <0.1% 0 0% 158,302 73.5% 73.5% 2.0
122 cnt_installment_future_min float64 1.7 MB 61 <0.1% 0 0% 196,054 91.1% 91.1% 0.0
123 cnt_installment_mature_cum_max float64 1.7 MB 120 0.1% 0 0% 156,329 72.6% 72.6% 7.0
124 cnt_installment_mature_cum_min float64 1.7 MB 28 <0.1% 0 0% 193,011 89.7% 89.7% 0.0
125 cnt_installment_median float64 1.7 MB 103 <0.1% 0 0% 73,750 34.3% 34.3% 12.0
126 cnt_installment_min float64 1.7 MB 53 <0.1% 0 0% 54,950 25.5% 25.5% 6.0
127 cnt_installment_range float64 1.7 MB 69 <0.1% 0 0% 49,692 23.1% 23.1% 0.0
128 cnt_installments_diff_mean float64 1.7 MB 20,290 9.4% 0 0% 19,490 9.1% 9.1% 5.0
129 cnt_installments_diff_min float64 1.7 MB 58 <0.1% 0 0% 210,671 97.9% 97.9% 0.0
130 cnt_installments_diff_range float64 1.7 MB 82 <0.1% 0 0% 48,330 22.5% 22.5% 12.0
131 cnt_payment_median float64 1.7 MB 87 <0.1% 0 0% 65,750 30.5% 30.5% 12.0
132 cnt_payment_min float64 1.7 MB 31 <0.1% 0 0% 68,588 31.9% 31.9% 0.0
133 cnt_payment_range float64 1.7 MB 69 <0.1% 0 0% 54,639 25.4% 25.4% 0.0
134 days_credit_enddate_max float64 1.7 MB 12,274 5.7% 0 0% 32,491 15.1% 15.1% 911.0
135 days_credit_enddate_min float64 1.7 MB 6,266 2.9% 0 0% 32,492 15.1% 15.1% -1267.0
136 days_credit_enddate_std float64 1.7 MB 134,002 62.3% 0 0% 59,197 27.5% 27.5% 1014.057521898929
137 days_credit_max float64 1.7 MB 2,922 1.4% 0 0% 31,067 14.4% 14.4% -300.0
138 days_credit_median float64 1.7 MB 5,711 2.7% 0 0% 30,932 14.4% 14.4% -957.0
139 days_credit_overdue_max float64 1.7 MB 671 0.3% 0 0% 212,892 98.9% 98.9% 0.0
140 days_credit_overdue_mean float64 1.7 MB 1,195 0.6% 0 0% 212,892 98.9% 98.9% 0.0
141 days_credit_overdue_median float64 1.7 MB 168 0.1% 0 0% 214,955 99.9% 99.9% 0.0
142 days_credit_range float64 1.7 MB 2,913 1.4% 0 0% 30,890 14.4% 14.4% 1262.0
143 days_credit_std float64 1.7 MB 133,053 61.8% 0 0% 55,965 26.0% 26.0% 621.2873840332031
144 days_credit_update_max float64 1.7 MB 2,585 1.2% 0 0% 34,359 16.0% 16.0% -19.0
145 days_credit_update_median float64 1.7 MB 4,779 2.2% 0 0% 30,948 14.4% 14.4% -360.0
146 days_credit_update_range float64 1.7 MB 2,925 1.4% 0 0% 30,911 14.4% 14.4% 904.0
147 days_decision_max float64 1.7 MB 2,921 1.4% 0 0% 11,697 5.4% 5.4% -299.0
148 days_decision_median float64 1.7 MB 5,656 2.6% 0 0% 11,546 5.4% 5.4% -647.0
149 days_decision_range float64 1.7 MB 2,919 1.4% 0 0% 40,565 18.8% 18.8% 0.0
150 days_enddate_fact_max float64 1.7 MB 2,793 1.3% 0 0% 54,020 25.1% 25.1% -345.0
151 days_enddate_fact_median float64 1.7 MB 5,341 2.5% 0 0% 53,910 25.0% 25.0% -872.5
152 days_enddate_fact_range float64 1.7 MB 2,796 1.3% 0 0% 53,924 25.1% 25.1% 821.0
153 days_first_draw_min float64 1.7 MB 2,718 1.3% 0 0% 177,781 82.6% 82.6% 365243.0
154 days_last_due_1st_version_max float64 1.7 MB 4,521 2.1% 0 0% 55,263 25.7% 25.7% 365243.0
155 days_last_due_1st_version_mean float64 1.7 MB 51,499 23.9% 0 0% 12,398 5.8% 5.8% -207.5
156 days_last_due_1st_version_median float64 1.7 MB 10,719 5.0% 0 0% 12,497 5.8% 5.8% -325.0
157 days_last_due_1st_version_min float64 1.7 MB 4,081 1.9% 0 0% 12,430 5.8% 5.8% -1089.0
158 days_last_due_max float64 1.7 MB 2,761 1.3% 0 0% 98,527 45.8% 45.8% 365243.0
159 days_last_due_range float64 1.7 MB 5,592 2.6% 0 0% 58,659 27.3% 27.3% 0.0
160 days_termination_median float64 1.7 MB 7,716 3.6% 0 0% 23,269 10.8% 10.8% 365243.0
161 days_termination_min float64 1.7 MB 2,797 1.3% 0 0% 15,833 7.4% 7.4% 365243.0
162 diff_amt_installment_payment_max float64 1.7 MB 75,445 35.0% 0 0% 127,555 59.3% 59.3% 0.0
163 diff_amt_installment_payment_mean float64 1.7 MB 97,257 45.2% 0 0% 114,097 53.0% 53.0% 0.0
164 diff_amt_installment_payment_median float64 1.7 MB 6,855 3.2% 0 0% 206,997 96.2% 96.2% 0.0
165 diff_amt_installment_payment_range float64 1.7 MB 90,195 41.9% 0 0% 114,099 53.0% 53.0% 0.0
166 diff_days_installment_payment_max float64 1.7 MB 409 0.2% 0 0% 18,396 8.5% 8.5% 31.0
167 diff_days_installment_payment_mean float64 1.7 MB 50,247 23.3% 0 0% 11,037 5.1% 5.1% 9.524199962615967
168 diff_days_installment_payment_median float64 1.7 MB 320 0.1% 0 0% 21,620 10.0% 10.0% 0.0
169 diff_days_installment_payment_range float64 1.7 MB 1,465 0.7% 0 0% 14,802 6.9% 6.9% 37.0
170 diff_days_installment_payment_sum float64 1.7 MB 4,383 2.0% 0 0% 11,369 5.3% 5.3% 240.0
171 diff_days_installment_payment_sum_late_only float64 1.7 MB 1,815 0.8% 0 0% 95,670 44.4% 44.4% 0.0
172 diff_percent_installment_payment_mean float64 1.7 MB 87,934 40.9% 0 0% 114,228 53.1% 53.1% 1.0
173 diff_percent_installment_payment_median float64 1.7 MB 7,969 3.7% 0 0% 206,997 96.2% 96.2% 1.0
174 diff_percent_installment_payment_min float64 1.7 MB 25,589 11.9% 0 0% 189,010 87.8% 87.8% 1.0
175 diff_percent_installment_payment_range float64 1.7 MB 97,055 45.1% 0 0% 114,227 53.1% 53.1% 0.0
176 flag_emergency_state float64 1.7 MB 2 <0.1% 0 0% 213,628 99.2% 99.2% 0.0
177 n_car_loans float64 1.7 MB 9 <0.1% 0 0% 201,519 93.6% 93.6% 0.0
178 n_cash_loans float64 1.7 MB 55 <0.1% 0 0% 83,697 38.9% 38.9% 0.0
179 n_channel_type_ap_minus float64 1.7 MB 33 <0.1% 0 0% 199,207 92.5% 92.5% 0.0
180 n_channel_type_car_dealer float64 1.7 MB 6 <0.1% 0 0% 215,036 99.9% 99.9% 0.0
181 n_channel_type_channel_corporate_sales float64 1.7 MB 20 <0.1% 0 0% 213,745 99.3% 99.3% 0.0
182 n_channel_type_contact_center float64 1.7 MB 19 <0.1% 0 0% 187,077 86.9% 86.9% 0.0
183 n_channel_type_countrywide float64 1.7 MB 34 <0.1% 0 0% 78,922 36.7% 36.7% 1.0
184 n_channel_type_credit_and_cash float64 1.7 MB 52 <0.1% 0 0% 96,482 44.8% 44.8% 0.0
185 n_channel_type_regional_and_local float64 1.7 MB 19 <0.1% 0 0% 169,784 78.9% 78.9% 0.0
186 n_channel_type_stone float64 1.7 MB 22 <0.1% 0 0% 133,139 61.9% 61.9% 0.0
187 n_client_type_new float64 1.7 MB 14 <0.1% 0 0% 165,520 76.9% 76.9% 1.0
188 n_client_type_refreshed float64 1.7 MB 23 <0.1% 0 0% 161,564 75.1% 75.1% 0.0
189 n_client_type_repeater float64 1.7 MB 61 <0.1% 0 0% 49,122 22.8% 22.8% 0.0
190 n_consumer_loans float64 1.7 MB 36 <0.1% 0 0% 78,331 36.4% 36.4% 1.0
191 n_contract_status_refused float64 1.7 MB 44 <0.1% 0 0% 144,850 67.3% 67.3% 0.0
192 n_contract_status_unused_offer float64 1.7 MB 11 <0.1% 0 0% 202,009 93.8% 93.8% 0.0
193 n_contracts_credit_card_completed float64 1.7 MB 40 <0.1% 0 0% 207,783 96.5% 96.5% 0.0
194 n_credit_card_credits float64 1.7 MB 22 <0.1% 0 0% 91,194 42.4% 42.4% 1.0
195 n_credits_active float64 1.7 MB 22 <0.1% 0 0% 71,863 33.4% 33.4% 2.0
196 n_credits_sold float64 1.7 MB 7 <0.1% 0 0% 211,547 98.3% 98.3% 0.0
197 n_credits_total float64 1.7 MB 57 <0.1% 0 0% 51,153 23.8% 23.8% 4.0
198 n_currency_2 float64 1.7 MB 7 <0.1% 0 0% 214,671 99.7% 99.7% 0.0
199 n_different_channels float64 1.7 MB 7 <0.1% 0 0% 90,541 42.1% 42.1% 2.0
200 n_different_contract_types float64 1.7 MB 4 <0.1% 0 0% 89,430 41.5% 41.5% 2.0
201 n_different_credit_types float64 1.7 MB 5 <0.1% 0 0% 131,569 61.1% 61.1% 2.0
202 n_different_currencies float64 1.7 MB 3 <0.1% 0 0% 214,601 99.7% 99.7% 1.0
203 n_installments_late float64 1.7 MB 99 <0.1% 0 0% 95,670 44.4% 44.4% 0.0
204 n_installments_late_30 float64 1.7 MB 42 <0.1% 0 0% 201,997 93.8% 93.8% 0.0
205 n_installments_late_7 float64 1.7 MB 59 <0.1% 0 0% 158,592 73.7% 73.7% 0.0
206 n_installments_total float64 1.7 MB 310 0.1% 0 0% 14,007 6.5% 6.5% 25.0
207 n_microloans float64 1.7 MB 28 <0.1% 0 0% 212,811 98.9% 98.9% 0.0
208 n_mortgages float64 1.7 MB 7 <0.1% 0 0% 205,270 95.4% 95.4% 0.0
209 n_nflag_insured_on_approval_mean float64 1.7 MB 102 <0.1% 0 0% 95,675 44.4% 44.4% 0.0
210 n_nflag_insured_on_approval_sum float64 1.7 MB 19 <0.1% 0 0% 96,596 44.9% 44.9% 0.0
211 n_other_type_credit float64 1.7 MB 9 <0.1% 0 0% 213,209 99.0% 99.0% 0.0
212 n_payment_type_cash_through_bank float64 1.7 MB 44 <0.1% 0 0% 54,943 25.5% 25.5% 1.0
213 n_payment_type_not_available float64 1.7 MB 46 <0.1% 0 0% 71,796 33.4% 33.4% 0.0
214 n_previous_credit_card_applications float64 1.7 MB 126 0.1% 0 0% 155,013 72.0% 72.0% 21.0
215 n_previous_credit_card_applications_signed float64 1.7 MB 37 <0.1% 0 0% 212,249 98.6% 98.6% 0.0
216 n_previous_pos_applications float64 1.7 MB 221 0.1% 0 0% 16,495 7.7% 7.7% 22.0
217 n_previous_pos_applications_completed float64 1.7 MB 45 <0.1% 0 0% 73,226 34.0% 34.0% 1.0
218 n_previous_pos_applications_signed float64 1.7 MB 31 <0.1% 0 0% 174,587 81.1% 81.1% 0.0
219 n_product_type_walk_in float64 1.7 MB 28 <0.1% 0 0% 164,239 76.3% 76.3% 0.0
220 n_reject_reason_limit float64 1.7 MB 22 <0.1% 0 0% 195,275 90.7% 90.7% 0.0
221 n_reject_reason_scoc float64 1.7 MB 20 <0.1% 0 0% 200,014 92.9% 92.9% 0.0
222 n_reject_reason_scofr float64 1.7 MB 16 <0.1% 0 0% 210,511 97.8% 97.8% 0.0
223 n_revolving_loans float64 1.7 MB 25 <0.1% 0 0% 142,248 66.1% 66.1% 0.0
224 n_yield_group_high float64 1.7 MB 30 <0.1% 0 0% 89,153 41.4% 41.4% 0.0
225 n_yield_group_low_action float64 1.7 MB 22 <0.1% 0 0% 174,871 81.2% 81.2% 0.0
226 n_yield_group_low_normal float64 1.7 MB 23 <0.1% 0 0% 94,724 44.0% 44.0% 0.0
227 n_yield_group_middle float64 1.7 MB 25 <0.1% 0 0% 80,132 37.2% 37.2% 1.0
228 percent_installments_early float64 1.7 MB 7,892 3.7% 0 0% 64,688 30.1% 30.1% 1.0
229 percent_installments_late float64 1.7 MB 4,464 2.1% 0 0% 95,670 44.4% 44.4% 0.0
230 percent_installments_late_30 float64 1.7 MB 894 0.4% 0 0% 201,997 93.8% 93.8% 0.0
231 percent_installments_late_60 float64 1.7 MB 629 0.3% 0 0% 209,180 97.2% 97.2% 0.0
232 percent_installments_late_7 float64 1.7 MB 2,595 1.2% 0 0% 158,592 73.7% 73.7% 0.0
233 rate_down_payment_max float64 1.7 MB 84,884 39.4% 0 0% 53,725 25.0% 25.0% 0.0
234 rate_down_payment_range float64 1.7 MB 73,616 34.2% 0 0% 94,887 44.1% 44.1% 0.0
235 rate_interest_privileged_count float64 1.7 MB 4 <0.1% 0 0% 212,016 98.5% 98.5% 0.0
236 sk_dpd_credit_card_max float64 1.7 MB 353 0.2% 0 0% 202,632 94.1% 94.1% 0.0
237 sk_dpd_credit_card_median float64 1.7 MB 222 0.1% 0 0% 214,704 99.7% 99.7% 0.0
238 sk_dpd_def_credit_card_max float64 1.7 MB 47 <0.1% 0 0% 204,810 95.1% 95.1% 0.0
239 sk_dpd_def_pos_applications_max float64 1.7 MB 173 0.1% 0 0% 187,187 87.0% 87.0% 0.0
240 sk_dpd_pos_applications_max float64 1.7 MB 1,595 0.7% 0 0% 176,902 82.2% 82.2% 0.0
241 years_employed float64 1.7 MB 11,769 5.5% 0 0% 38,801 18.0% 18.0% 4.517808219178082
242 missingindicator_AMT_ANNUITY float64 1.7 MB 2 <0.1% 0 0% 215,249 >99.9% >99.9% 0.0
243 missingindicator_AMT_REQ_CREDIT_BUREAU_DAY float64 1.7 MB 2 <0.1% 0 0% 186,176 86.5% 86.5% 0.0
244 missingindicator_AMT_REQ_CREDIT_BUREAU_HOUR float64 1.7 MB 2 <0.1% 0 0% 186,176 86.5% 86.5% 0.0
245 missingindicator_AMT_REQ_CREDIT_BUREAU_MON float64 1.7 MB 2 <0.1% 0 0% 186,176 86.5% 86.5% 0.0
246 missingindicator_AMT_REQ_CREDIT_BUREAU_QRT float64 1.7 MB 2 <0.1% 0 0% 186,176 86.5% 86.5% 0.0
247 missingindicator_AMT_REQ_CREDIT_BUREAU_WEEK float64 1.7 MB 2 <0.1% 0 0% 186,176 86.5% 86.5% 0.0
248 missingindicator_AMT_REQ_CREDIT_BUREAU_YEAR float64 1.7 MB 2 <0.1% 0 0% 186,176 86.5% 86.5% 0.0
249 missingindicator_BASEMENTAREA_MODE float64 1.7 MB 2 <0.1% 0 0% 125,793 58.4% 58.4% 1.0
250 missingindicator_CNT_FAM_MEMBERS float64 1.7 MB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0.0
251 missingindicator_COMMONAREA_MEDI float64 1.7 MB 2 <0.1% 0 0% 150,300 69.8% 69.8% 1.0
252 missingindicator_DAYS_LAST_PHONE_CHANGE float64 1.7 MB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0.0
253 missingindicator_DEF_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 2 <0.1% 0 0% 214,543 99.7% 99.7% 0.0
254 missingindicator_ELEVATORS_AVG float64 1.7 MB 2 <0.1% 0 0% 114,570 53.2% 53.2% 1.0
255 missingindicator_ELEVATORS_MEDI float64 1.7 MB 2 <0.1% 0 0% 114,570 53.2% 53.2% 1.0
256 missingindicator_ENTRANCES_MODE float64 1.7 MB 2 <0.1% 0 0% 108,270 50.3% 50.3% 1.0
257 missingindicator_EXT_SOURCE_1 float64 1.7 MB 2 <0.1% 0 0% 121,373 56.4% 56.4% 1.0
258 missingindicator_EXT_SOURCE_2 float64 1.7 MB 2 <0.1% 0 0% 214,793 99.8% 99.8% 0.0
259 missingindicator_EXT_SOURCE_3 float64 1.7 MB 2 <0.1% 0 0% 172,577 80.2% 80.2% 0.0
260 missingindicator_FLOORSMAX_MEDI float64 1.7 MB 2 <0.1% 0 0% 108,287 50.3% 50.3% 0.0
261 missingindicator_FLOORSMIN_MEDI float64 1.7 MB 2 <0.1% 0 0% 146,054 67.9% 67.9% 1.0
262 missingindicator_LANDAREA_MEDI float64 1.7 MB 2 <0.1% 0 0% 127,644 59.3% 59.3% 1.0
263 missingindicator_NONLIVINGAPARTMENTS_AVG float64 1.7 MB 2 <0.1% 0 0% 149,354 69.4% 69.4% 1.0
264 missingindicator_NONLIVINGAREA_MODE float64 1.7 MB 2 <0.1% 0 0% 118,577 55.1% 55.1% 1.0
265 missingindicator_OBS_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 2 <0.1% 0 0% 214,543 99.7% 99.7% 0.0
266 missingindicator_OWN_CAR_AGE float64 1.7 MB 2 <0.1% 0 0% 142,091 66.0% 66.0% 1.0
267 missingindicator_YEARS_BEGINEXPLUATATION_MODE float64 1.7 MB 2 <0.1% 0 0% 110,347 51.3% 51.3% 0.0
268 missingindicator_YEARS_BUILD_AVG float64 1.7 MB 2 <0.1% 0 0% 143,036 66.4% 66.4% 1.0
269 missingindicator_amt_annuity_max float64 1.7 MB 2 <0.1% 0 0% 159,480 74.1% 74.1% 1.0
270 missingindicator_amt_annuity_max_previous_application float64 1.7 MB 2 <0.1% 0 0% 203,505 94.5% 94.5% 0.0
271 missingindicator_amt_annuity_median float64 1.7 MB 2 <0.1% 0 0% 159,480 74.1% 74.1% 1.0
272 missingindicator_amt_annuity_median_previous_application float64 1.7 MB 2 <0.1% 0 0% 203,505 94.5% 94.5% 0.0
273 missingindicator_amt_annuity_min float64 1.7 MB 2 <0.1% 0 0% 159,480 74.1% 74.1% 1.0
274 missingindicator_amt_annuity_min_previous_application float64 1.7 MB 2 <0.1% 0 0% 203,505 94.5% 94.5% 0.0
275 missingindicator_amt_annuity_to_credit_ratio float64 1.7 MB 2 <0.1% 0 0% 215,249 >99.9% >99.9% 0.0
276 missingindicator_amt_annuity_to_income_per_family_member float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
277 missingindicator_amt_annuity_to_income_ratio float64 1.7 MB 2 <0.1% 0 0% 215,249 >99.9% >99.9% 0.0
278 missingindicator_amt_balance_credit_card_max float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
279 missingindicator_amt_balance_credit_card_median float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
280 missingindicator_amt_balance_credit_card_min float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
281 missingindicator_amt_credit_limit_actual_median float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
282 missingindicator_amt_credit_limit_actual_range float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
283 missingindicator_amt_credit_max float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
284 missingindicator_amt_credit_max_overdue_max float64 1.7 MB 2 <0.1% 0 0% 128,619 59.8% 59.8% 0.0
285 missingindicator_amt_credit_max_overdue_range float64 1.7 MB 2 <0.1% 0 0% 128,619 59.8% 59.8% 0.0
286 missingindicator_amt_credit_median float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
287 missingindicator_amt_credit_min float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
288 missingindicator_amt_credit_range float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
289 missingindicator_amt_credit_sum_debt_mean float64 1.7 MB 2 <0.1% 0 0% 179,218 83.3% 83.3% 0.0
290 missingindicator_amt_credit_sum_debt_median float64 1.7 MB 2 <0.1% 0 0% 179,218 83.3% 83.3% 0.0
291 missingindicator_amt_credit_sum_debt_sum float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
292 missingindicator_amt_credit_sum_limit_min float64 1.7 MB 2 <0.1% 0 0% 169,672 78.8% 78.8% 0.0
293 missingindicator_amt_credit_sum_limit_std float64 1.7 MB 2 <0.1% 0 0% 134,361 62.4% 62.4% 0.0
294 missingindicator_amt_credit_sum_limit_sum float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
295 missingindicator_amt_credit_sum_median float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
296 missingindicator_amt_credit_sum_overdue_std float64 1.7 MB 2 <0.1% 0 0% 159,292 74.0% 74.0% 0.0
297 missingindicator_amt_credit_sum_overdue_sum float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
298 missingindicator_amt_credit_sum_std float64 1.7 MB 2 <0.1% 0 0% 159,292 74.0% 74.0% 0.0
299 missingindicator_amt_credit_sum_sum float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
300 missingindicator_amt_down_payment_max float64 1.7 MB 2 <0.1% 0 0% 191,554 89.0% 89.0% 0.0
301 missingindicator_amt_down_payment_mean float64 1.7 MB 2 <0.1% 0 0% 191,554 89.0% 89.0% 0.0
302 missingindicator_amt_drawings_atm_current_max float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
303 missingindicator_amt_drawings_atm_current_median float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
304 missingindicator_amt_drawings_atm_current_min float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
305 missingindicator_amt_drawings_current_max float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
306 missingindicator_amt_drawings_current_mean float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
307 missingindicator_amt_drawings_current_min float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
308 missingindicator_amt_drawings_other_current_max float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
309 missingindicator_amt_drawings_pos_current_max float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
310 missingindicator_amt_drawings_pos_current_mean float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
311 missingindicator_amt_drawings_pos_current_min float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
312 missingindicator_amt_goods_price_min float64 1.7 MB 2 <0.1% 0 0% 203,088 94.3% 94.3% 0.0
313 missingindicator_amt_inst_min_regularity_min float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
314 missingindicator_amt_payment_current_median float64 1.7 MB 2 <0.1% 0 0% 172,336 80.1% 80.1% 1.0
315 missingindicator_amt_payment_current_min float64 1.7 MB 2 <0.1% 0 0% 172,336 80.1% 80.1% 1.0
316 missingindicator_amt_payment_current_range float64 1.7 MB 2 <0.1% 0 0% 172,336 80.1% 80.1% 1.0
317 missingindicator_amt_payment_total_current_min float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
318 missingindicator_any_installments_late_30 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
319 missingindicator_any_installments_late_60 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
320 missingindicator_any_installments_late_7 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
321 missingindicator_bureau_dpd_status_max float64 1.7 MB 2 <0.1% 0 0% 152,586 70.9% 70.9% 1.0
322 missingindicator_bureau_dpd_status_median float64 1.7 MB 2 <0.1% 0 0% 152,586 70.9% 70.9% 1.0
323 missingindicator_bureau_months_balance_max float64 1.7 MB 2 <0.1% 0 0% 152,586 70.9% 70.9% 1.0
324 missingindicator_cnt_credit_prolong_mean float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
325 missingindicator_cnt_credit_prolong_sum float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
326 missingindicator_cnt_drawings_atm_current_max float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
327 missingindicator_cnt_drawings_atm_current_std float64 1.7 MB 2 <0.1% 0 0% 172,561 80.2% 80.2% 1.0
328 missingindicator_cnt_drawings_current_min float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
329 missingindicator_cnt_drawings_other_current_max float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
330 missingindicator_cnt_drawings_pos_current_max float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
331 missingindicator_cnt_drawings_pos_current_median float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
332 missingindicator_cnt_drawings_pos_current_min float64 1.7 MB 2 <0.1% 0 0% 172,254 80.0% 80.0% 1.0
333 missingindicator_cnt_fam_members_excluding_children float64 1.7 MB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0.0
334 missingindicator_cnt_installment_future_min float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
335 missingindicator_cnt_installment_mature_cum_max float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
336 missingindicator_cnt_installment_mature_cum_min float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
337 missingindicator_cnt_installment_median float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
338 missingindicator_cnt_installment_min float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
339 missingindicator_cnt_installment_range float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
340 missingindicator_cnt_installments_diff_mean float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
341 missingindicator_cnt_installments_diff_min float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
342 missingindicator_cnt_installments_diff_range float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
343 missingindicator_cnt_payment_median float64 1.7 MB 2 <0.1% 0 0% 203,505 94.5% 94.5% 0.0
344 missingindicator_cnt_payment_min float64 1.7 MB 2 <0.1% 0 0% 203,505 94.5% 94.5% 0.0
345 missingindicator_cnt_payment_range float64 1.7 MB 2 <0.1% 0 0% 203,505 94.5% 94.5% 0.0
346 missingindicator_days_credit_enddate_max float64 1.7 MB 2 <0.1% 0 0% 182,825 84.9% 84.9% 0.0
347 missingindicator_days_credit_enddate_min float64 1.7 MB 2 <0.1% 0 0% 182,825 84.9% 84.9% 0.0
348 missingindicator_days_credit_enddate_std float64 1.7 MB 2 <0.1% 0 0% 156,060 72.5% 72.5% 0.0
349 missingindicator_days_credit_max float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
350 missingindicator_days_credit_median float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
351 missingindicator_days_credit_overdue_max float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
352 missingindicator_days_credit_overdue_mean float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
353 missingindicator_days_credit_overdue_median float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
354 missingindicator_days_credit_range float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
355 missingindicator_days_credit_std float64 1.7 MB 2 <0.1% 0 0% 159,292 74.0% 74.0% 0.0
356 missingindicator_days_credit_update_max float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
357 missingindicator_days_credit_update_median float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
358 missingindicator_days_credit_update_range float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
359 missingindicator_days_decision_max float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
360 missingindicator_days_decision_median float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
361 missingindicator_days_decision_range float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
362 missingindicator_days_enddate_fact_max float64 1.7 MB 2 <0.1% 0 0% 161,387 75.0% 75.0% 0.0
363 missingindicator_days_enddate_fact_median float64 1.7 MB 2 <0.1% 0 0% 161,387 75.0% 75.0% 0.0
364 missingindicator_days_enddate_fact_range float64 1.7 MB 2 <0.1% 0 0% 161,387 75.0% 75.0% 0.0
365 missingindicator_days_first_draw_min float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
366 missingindicator_days_last_due_1st_version_max float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
367 missingindicator_days_last_due_1st_version_mean float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
368 missingindicator_days_last_due_1st_version_median float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
369 missingindicator_days_last_due_1st_version_min float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
370 missingindicator_days_last_due_max float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
371 missingindicator_days_last_due_range float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
372 missingindicator_days_termination_median float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
373 missingindicator_days_termination_min float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
374 missingindicator_diff_amt_installment_payment_max float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
375 missingindicator_diff_amt_installment_payment_mean float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
376 missingindicator_diff_amt_installment_payment_median float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
377 missingindicator_diff_amt_installment_payment_range float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
378 missingindicator_diff_days_installment_payment_max float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
379 missingindicator_diff_days_installment_payment_mean float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
380 missingindicator_diff_days_installment_payment_median float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
381 missingindicator_diff_days_installment_payment_range float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
382 missingindicator_diff_days_installment_payment_sum float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
383 missingindicator_diff_days_installment_payment_sum_late_only float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
384 missingindicator_diff_percent_installment_payment_mean float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
385 missingindicator_diff_percent_installment_payment_median float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
386 missingindicator_diff_percent_installment_payment_min float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
387 missingindicator_diff_percent_installment_payment_range float64 1.7 MB 2 <0.1% 0 0% 204,220 94.9% 94.9% 0.0
388 missingindicator_n_car_loans float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
389 missingindicator_n_cash_loans float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
390 missingindicator_n_channel_type_ap_minus float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
391 missingindicator_n_channel_type_car_dealer float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
392 missingindicator_n_channel_type_channel_corporate_sales float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
393 missingindicator_n_channel_type_contact_center float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
394 missingindicator_n_channel_type_countrywide float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
395 missingindicator_n_channel_type_credit_and_cash float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
396 missingindicator_n_channel_type_regional_and_local float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
397 missingindicator_n_channel_type_stone float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
398 missingindicator_n_client_type_new float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
399 missingindicator_n_client_type_refreshed float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
400 missingindicator_n_client_type_repeater float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
401 missingindicator_n_consumer_loans float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
402 missingindicator_n_contract_status_refused float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
403 missingindicator_n_contract_status_unused_offer float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
404 missingindicator_n_contracts_credit_card_completed float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
405 missingindicator_n_credit_card_credits float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
406 missingindicator_n_credits_active float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
407 missingindicator_n_credits_sold float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
408 missingindicator_n_credits_total float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
409 missingindicator_n_currency_2 float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
410 missingindicator_n_different_channels float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
411 missingindicator_n_different_contract_types float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
412 missingindicator_n_different_credit_types float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
413 missingindicator_n_different_currencies float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
414 missingindicator_n_installments_late float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
415 missingindicator_n_installments_late_30 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
416 missingindicator_n_installments_late_7 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
417 missingindicator_n_installments_total float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
418 missingindicator_n_microloans float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
419 missingindicator_n_mortgages float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
420 missingindicator_n_nflag_insured_on_approval_mean float64 1.7 MB 2 <0.1% 0 0% 202,880 94.3% 94.3% 0.0
421 missingindicator_n_nflag_insured_on_approval_sum float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
422 missingindicator_n_other_type_credit float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
423 missingindicator_n_payment_type_cash_through_bank float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
424 missingindicator_n_payment_type_not_available float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
425 missingindicator_n_previous_credit_card_applications float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
426 missingindicator_n_previous_credit_card_applications_signed float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
427 missingindicator_n_previous_pos_applications float64 1.7 MB 2 <0.1% 0 0% 202,687 94.2% 94.2% 0.0
428 missingindicator_n_previous_pos_applications_completed float64 1.7 MB 2 <0.1% 0 0% 202,687 94.2% 94.2% 0.0
429 missingindicator_n_previous_pos_applications_signed float64 1.7 MB 2 <0.1% 0 0% 202,687 94.2% 94.2% 0.0
430 missingindicator_n_product_type_walk_in float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
431 missingindicator_n_reject_reason_limit float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
432 missingindicator_n_reject_reason_scoc float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
433 missingindicator_n_reject_reason_scofr float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
434 missingindicator_n_revolving_loans float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
435 missingindicator_n_yield_group_high float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
436 missingindicator_n_yield_group_low_action float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
437 missingindicator_n_yield_group_low_normal float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
438 missingindicator_n_yield_group_middle float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
439 missingindicator_percent_installments_early float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
440 missingindicator_percent_installments_late float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
441 missingindicator_percent_installments_late_30 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
442 missingindicator_percent_installments_late_60 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
443 missingindicator_percent_installments_late_7 float64 1.7 MB 2 <0.1% 0 0% 204,223 94.9% 94.9% 0.0
444 missingindicator_rate_down_payment_max float64 1.7 MB 2 <0.1% 0 0% 191,554 89.0% 89.0% 0.0
445 missingindicator_rate_down_payment_range float64 1.7 MB 2 <0.1% 0 0% 191,554 89.0% 89.0% 0.0
446 missingindicator_rate_interest_privileged_count float64 1.7 MB 2 <0.1% 0 0% 203,801 94.7% 94.7% 0.0
447 missingindicator_sk_dpd_credit_card_max float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
448 missingindicator_sk_dpd_credit_card_median float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
449 missingindicator_sk_dpd_def_credit_card_max float64 1.7 MB 2 <0.1% 0 0% 154,158 71.6% 71.6% 1.0
450 missingindicator_sk_dpd_def_pos_applications_max float64 1.7 MB 2 <0.1% 0 0% 202,687 94.2% 94.2% 0.0
451 missingindicator_sk_dpd_pos_applications_max float64 1.7 MB 2 <0.1% 0 0% 202,687 94.2% 94.2% 0.0
452 missingindicator_years_employed float64 1.7 MB 2 <0.1% 0 0% 176,501 82.0% 82.0% 0.0
453 fondkapremont_mode_not_specified float64 1.7 MB 2 <0.1% 0 0% 211,294 98.2% 98.2% 0.0
454 fondkapremont_mode_org_spec_account float64 1.7 MB 2 <0.1% 0 0% 211,329 98.2% 98.2% 0.0
455 fondkapremont_mode_reg_oper_account float64 1.7 MB 2 <0.1% 0 0% 163,472 75.9% 75.9% 0.0
456 fondkapremont_mode_reg_oper_spec_account float64 1.7 MB 2 <0.1% 0 0% 206,775 96.1% 96.1% 0.0
457 fondkapremont_mode_nan float64 1.7 MB 2 <0.1% 0 0% 147,099 68.3% 68.3% 1.0
458 housetype_mode_block_of_flats float64 1.7 MB 2 <0.1% 0 0% 109,742 51.0% 51.0% 0.0
459 housetype_mode_specific_housing float64 1.7 MB 2 <0.1% 0 0% 214,216 99.5% 99.5% 0.0
460 housetype_mode_terraced_house float64 1.7 MB 2 <0.1% 0 0% 214,390 99.6% 99.6% 0.0
461 housetype_mode_nan float64 1.7 MB 2 <0.1% 0 0% 107,834 50.1% 50.1% 1.0
462 name_contract_type_cash_loans float64 1.7 MB 2 <0.1% 0 0% 194,675 90.4% 90.4% 1.0
463 name_contract_type_revolving_loans float64 1.7 MB 2 <0.1% 0 0% 194,675 90.4% 90.4% 0.0
464 name_housing_type_co_op_apartment float64 1.7 MB 2 <0.1% 0 0% 214,466 99.6% 99.6% 0.0
465 name_housing_type_house_apartment float64 1.7 MB 2 <0.1% 0 0% 191,159 88.8% 88.8% 1.0
466 name_housing_type_municipal_apartment float64 1.7 MB 2 <0.1% 0 0% 207,454 96.4% 96.4% 0.0
467 name_housing_type_office_apartment float64 1.7 MB 2 <0.1% 0 0% 213,440 99.2% 99.2% 0.0
468 name_housing_type_rented_apartment float64 1.7 MB 2 <0.1% 0 0% 211,900 98.4% 98.4% 0.0
469 name_housing_type_with_parents float64 1.7 MB 2 <0.1% 0 0% 204,927 95.2% 95.2% 0.0
470 name_income_type_businessman float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
471 name_income_type_commercial_associate float64 1.7 MB 2 <0.1% 0 0% 165,151 76.7% 76.7% 0.0
472 name_income_type_maternity_leave float64 1.7 MB 2 <0.1% 0 0% 215,254 >99.9% >99.9% 0.0
473 name_income_type_pensioner float64 1.7 MB 2 <0.1% 0 0% 176,509 82.0% 82.0% 0.0
474 name_income_type_state_servant float64 1.7 MB 2 <0.1% 0 0% 199,875 92.9% 92.9% 0.0
475 name_income_type_student float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
476 name_income_type_unemployed float64 1.7 MB 2 <0.1% 0 0% 215,241 >99.9% >99.9% 0.0
477 name_income_type_working float64 1.7 MB 2 <0.1% 0 0% 110,984 51.6% 51.6% 1.0
478 name_type_suite_children float64 1.7 MB 2 <0.1% 0 0% 212,930 98.9% 98.9% 0.0
479 name_type_suite_family float64 1.7 MB 2 <0.1% 0 0% 187,256 87.0% 87.0% 0.0
480 name_type_suite_group_of_people float64 1.7 MB 2 <0.1% 0 0% 215,059 99.9% 99.9% 0.0
481 name_type_suite_other_a float64 1.7 MB 2 <0.1% 0 0% 214,643 99.7% 99.7% 0.0
482 name_type_suite_other_b float64 1.7 MB 2 <0.1% 0 0% 214,009 99.4% 99.4% 0.0
483 name_type_suite_spouse_partner float64 1.7 MB 2 <0.1% 0 0% 207,378 96.3% 96.3% 0.0
484 name_type_suite_unaccompanied float64 1.7 MB 2 <0.1% 0 0% 174,089 80.9% 80.9% 1.0
485 name_type_suite_nan float64 1.7 MB 2 <0.1% 0 0% 214,356 99.6% 99.6% 0.0
486 occupation_type_accountants float64 1.7 MB 2 <0.1% 0 0% 208,415 96.8% 96.8% 0.0
487 occupation_type_cleaning_staff float64 1.7 MB 2 <0.1% 0 0% 211,947 98.5% 98.5% 0.0
488 occupation_type_cooking_staff float64 1.7 MB 2 <0.1% 0 0% 211,079 98.1% 98.1% 0.0
489 occupation_type_core_staff float64 1.7 MB 2 <0.1% 0 0% 195,912 91.0% 91.0% 0.0
490 occupation_type_drivers float64 1.7 MB 2 <0.1% 0 0% 202,169 93.9% 93.9% 0.0
491 occupation_type_hr_staff float64 1.7 MB 2 <0.1% 0 0% 214,884 99.8% 99.8% 0.0
492 occupation_type_high_skill_tech_staff float64 1.7 MB 2 <0.1% 0 0% 207,280 96.3% 96.3% 0.0
493 occupation_type_it_staff float64 1.7 MB 2 <0.1% 0 0% 214,896 99.8% 99.8% 0.0
494 occupation_type_laborers float64 1.7 MB 2 <0.1% 0 0% 176,666 82.1% 82.1% 0.0
495 occupation_type_low_skill_laborers float64 1.7 MB 2 <0.1% 0 0% 213,777 99.3% 99.3% 0.0
496 occupation_type_managers float64 1.7 MB 2 <0.1% 0 0% 200,272 93.0% 93.0% 0.0
497 occupation_type_medicine_staff float64 1.7 MB 2 <0.1% 0 0% 209,207 97.2% 97.2% 0.0
498 occupation_type_private_service_staff float64 1.7 MB 2 <0.1% 0 0% 213,406 99.1% 99.1% 0.0
499 occupation_type_realty_agents float64 1.7 MB 2 <0.1% 0 0% 214,733 99.8% 99.8% 0.0
500 occupation_type_sales_staff float64 1.7 MB 2 <0.1% 0 0% 192,972 89.6% 89.6% 0.0
501 occupation_type_secretaries float64 1.7 MB 2 <0.1% 0 0% 214,342 99.6% 99.6% 0.0
502 occupation_type_security_staff float64 1.7 MB 2 <0.1% 0 0% 210,559 97.8% 97.8% 0.0
503 occupation_type_waiters_barmen_staff float64 1.7 MB 2 <0.1% 0 0% 214,333 99.6% 99.6% 0.0
504 occupation_type_nan float64 1.7 MB 2 <0.1% 0 0% 147,777 68.7% 68.7% 0.0
505 organization_type_advertising float64 1.7 MB 2 <0.1% 0 0% 214,968 99.9% 99.9% 0.0
506 organization_type_agriculture float64 1.7 MB 2 <0.1% 0 0% 213,527 99.2% 99.2% 0.0
507 organization_type_bank float64 1.7 MB 2 <0.1% 0 0% 213,522 99.2% 99.2% 0.0
508 organization_type_business_entity_type_1 float64 1.7 MB 2 <0.1% 0 0% 211,043 98.0% 98.0% 0.0
509 organization_type_business_entity_type_2 float64 1.7 MB 2 <0.1% 0 0% 207,883 96.6% 96.6% 0.0
510 organization_type_business_entity_type_3 float64 1.7 MB 2 <0.1% 0 0% 167,675 77.9% 77.9% 0.0
511 organization_type_cleaning float64 1.7 MB 2 <0.1% 0 0% 215,062 99.9% 99.9% 0.0
512 organization_type_construction float64 1.7 MB 2 <0.1% 0 0% 210,553 97.8% 97.8% 0.0
513 organization_type_culture float64 1.7 MB 2 <0.1% 0 0% 214,988 99.9% 99.9% 0.0
514 organization_type_electricity float64 1.7 MB 2 <0.1% 0 0% 214,583 99.7% 99.7% 0.0
515 organization_type_emergency float64 1.7 MB 2 <0.1% 0 0% 214,862 99.8% 99.8% 0.0
516 organization_type_government float64 1.7 MB 2 <0.1% 0 0% 207,933 96.6% 96.6% 0.0
517 organization_type_hotel float64 1.7 MB 2 <0.1% 0 0% 214,571 99.7% 99.7% 0.0
518 organization_type_housing float64 1.7 MB 2 <0.1% 0 0% 213,202 99.0% 99.0% 0.0
519 organization_type_industry_type_1 float64 1.7 MB 2 <0.1% 0 0% 214,520 99.7% 99.7% 0.0
520 organization_type_industry_type_10 float64 1.7 MB 2 <0.1% 0 0% 215,182 >99.9% >99.9% 0.0
521 organization_type_industry_type_11 float64 1.7 MB 2 <0.1% 0 0% 213,369 99.1% 99.1% 0.0
522 organization_type_industry_type_12 float64 1.7 MB 2 <0.1% 0 0% 214,999 99.9% 99.9% 0.0
523 organization_type_industry_type_13 float64 1.7 MB 2 <0.1% 0 0% 215,211 >99.9% >99.9% 0.0
524 organization_type_industry_type_2 float64 1.7 MB 2 <0.1% 0 0% 214,931 99.8% 99.8% 0.0
525 organization_type_industry_type_3 float64 1.7 MB 2 <0.1% 0 0% 212,965 98.9% 98.9% 0.0
526 organization_type_industry_type_4 float64 1.7 MB 2 <0.1% 0 0% 214,624 99.7% 99.7% 0.0
527 organization_type_industry_type_5 float64 1.7 MB 2 <0.1% 0 0% 214,864 99.8% 99.8% 0.0
528 organization_type_industry_type_6 float64 1.7 MB 2 <0.1% 0 0% 215,180 >99.9% >99.9% 0.0
529 organization_type_industry_type_7 float64 1.7 MB 2 <0.1% 0 0% 214,354 99.6% 99.6% 0.0
530 organization_type_industry_type_8 float64 1.7 MB 2 <0.1% 0 0% 215,240 >99.9% >99.9% 0.0
531 organization_type_industry_type_9 float64 1.7 MB 2 <0.1% 0 0% 212,861 98.9% 98.9% 0.0
532 organization_type_insurance float64 1.7 MB 2 <0.1% 0 0% 214,842 99.8% 99.8% 0.0
533 organization_type_kindergarten float64 1.7 MB 2 <0.1% 0 0% 210,366 97.7% 97.7% 0.0
534 organization_type_legal_services float64 1.7 MB 2 <0.1% 0 0% 215,039 99.9% 99.9% 0.0
535 organization_type_medicine float64 1.7 MB 2 <0.1% 0 0% 207,340 96.3% 96.3% 0.0
536 organization_type_military float64 1.7 MB 2 <0.1% 0 0% 213,400 99.1% 99.1% 0.0
537 organization_type_mobile float64 1.7 MB 2 <0.1% 0 0% 215,046 99.9% 99.9% 0.0
538 organization_type_other float64 1.7 MB 2 <0.1% 0 0% 203,595 94.6% 94.6% 0.0
539 organization_type_police float64 1.7 MB 2 <0.1% 0 0% 213,649 99.3% 99.3% 0.0
540 organization_type_postal float64 1.7 MB 2 <0.1% 0 0% 213,737 99.3% 99.3% 0.0
541 organization_type_realtor float64 1.7 MB 2 <0.1% 0 0% 214,978 99.9% 99.9% 0.0
542 organization_type_religion float64 1.7 MB 2 <0.1% 0 0% 215,198 >99.9% >99.9% 0.0
543 organization_type_restaurant float64 1.7 MB 2 <0.1% 0 0% 213,972 99.4% 99.4% 0.0
544 organization_type_school float64 1.7 MB 2 <0.1% 0 0% 208,961 97.1% 97.1% 0.0
545 organization_type_security float64 1.7 MB 2 <0.1% 0 0% 212,955 98.9% 98.9% 0.0
546 organization_type_security_ministries float64 1.7 MB 2 <0.1% 0 0% 213,854 99.3% 99.3% 0.0
547 organization_type_self_employed float64 1.7 MB 2 <0.1% 0 0% 188,576 87.6% 87.6% 0.0
548 organization_type_services float64 1.7 MB 2 <0.1% 0 0% 214,168 99.5% 99.5% 0.0
549 organization_type_telecom float64 1.7 MB 2 <0.1% 0 0% 214,861 99.8% 99.8% 0.0
550 organization_type_trade_type_1 float64 1.7 MB 2 <0.1% 0 0% 215,020 99.9% 99.9% 0.0
551 organization_type_trade_type_2 float64 1.7 MB 2 <0.1% 0 0% 213,919 99.4% 99.4% 0.0
552 organization_type_trade_type_3 float64 1.7 MB 2 <0.1% 0 0% 212,832 98.9% 98.9% 0.0
553 organization_type_trade_type_4 float64 1.7 MB 2 <0.1% 0 0% 215,212 >99.9% >99.9% 0.0
554 organization_type_trade_type_5 float64 1.7 MB 2 <0.1% 0 0% 215,223 >99.9% >99.9% 0.0
555 organization_type_trade_type_6 float64 1.7 MB 2 <0.1% 0 0% 214,832 99.8% 99.8% 0.0
556 organization_type_trade_type_7 float64 1.7 MB 2 <0.1% 0 0% 209,807 97.5% 97.5% 0.0
557 organization_type_transport_type_1 float64 1.7 MB 2 <0.1% 0 0% 215,112 99.9% 99.9% 0.0
558 organization_type_transport_type_2 float64 1.7 MB 2 <0.1% 0 0% 213,728 99.3% 99.3% 0.0
559 organization_type_transport_type_3 float64 1.7 MB 2 <0.1% 0 0% 214,406 99.6% 99.6% 0.0
560 organization_type_transport_type_4 float64 1.7 MB 2 <0.1% 0 0% 211,508 98.3% 98.3% 0.0
561 organization_type_university float64 1.7 MB 2 <0.1% 0 0% 214,340 99.6% 99.6% 0.0
562 organization_type_xna float64 1.7 MB 2 <0.1% 0 0% 176,501 82.0% 82.0% 0.0
563 wallsmaterial_mode_block float64 1.7 MB 2 <0.1% 0 0% 208,728 97.0% 97.0% 0.0
564 wallsmaterial_mode_mixed float64 1.7 MB 2 <0.1% 0 0% 213,683 99.3% 99.3% 0.0
565 wallsmaterial_mode_monolithic float64 1.7 MB 2 <0.1% 0 0% 214,008 99.4% 99.4% 0.0
566 wallsmaterial_mode_others float64 1.7 MB 2 <0.1% 0 0% 214,119 99.5% 99.5% 0.0
567 wallsmaterial_mode_panel float64 1.7 MB 2 <0.1% 0 0% 168,959 78.5% 78.5% 0.0
568 wallsmaterial_mode_stone_brick float64 1.7 MB 2 <0.1% 0 0% 169,849 78.9% 78.9% 0.0
569 wallsmaterial_mode_wooden float64 1.7 MB 2 <0.1% 0 0% 211,525 98.3% 98.3% 0.0
570 wallsmaterial_mode_nan float64 1.7 MB 2 <0.1% 0 0% 109,329 50.8% 50.8% 1.0
571 mode_credit_type_car_loan float64 1.7 MB 2 <0.1% 0 0% 212,121 98.5% 98.5% 0.0
572 mode_credit_type_consumer_credit float64 1.7 MB 2 <0.1% 0 0% 160,802 74.7% 74.7% 1.0
573 mode_credit_type_credit_card float64 1.7 MB 2 <0.1% 0 0% 196,123 91.1% 91.1% 0.0
574 mode_credit_type_microloan float64 1.7 MB 2 <0.1% 0 0% 214,789 99.8% 99.8% 0.0
575 mode_credit_type_mortgage float64 1.7 MB 2 <0.1% 0 0% 214,478 99.6% 99.6% 0.0
576 mode_credit_type_other float64 1.7 MB 2 <0.1% 0 0% 215,155 >99.9% >99.9% 0.0
577 mode_credit_type_nan float64 1.7 MB 2 <0.1% 0 0% 184,421 85.7% 85.7% 0.0
578 name_education_type_academic_degree float64 1.7 MB 2 <0.1% 0 0% 215,153 >99.9% >99.9% 0.0
579 name_education_type_higher_education float64 1.7 MB 2 <0.1% 0 0% 163,003 75.7% 75.7% 0.0
580 name_education_type_incomplete_higher float64 1.7 MB 2 <0.1% 0 0% 208,006 96.6% 96.6% 0.0
581 name_education_type_lower_secondary float64 1.7 MB 2 <0.1% 0 0% 212,602 98.8% 98.8% 0.0
582 name_education_type_secondary_secondary_special float64 1.7 MB 2 <0.1% 0 0% 152,993 71.1% 71.1% 1.0

6.1.3 Steps After Pre-Processing

Next, let’s identify the problematic columns after this step:

Code
problematic_columns_2 = df_processed_col_info.query(
    "n_unique <= 1 or p_missing >= 90.00 or p_dom_excl_na >= 99.85"
)
print(f"N columns to remove: {problematic_columns_2.shape[0]}")
problematic_columns_2.pipe(an.style_col_info)
N columns to remove: 33
Table 6.4. Info on problematic columns to remove after preprocessing
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
27 FLAG_DOCUMENT_15 float64 1.7 MB 2 <0.1% 0 0% 215,015 99.9% 99.9% 0.0
141 days_credit_overdue_median float64 1.7 MB 168 0.1% 0 0% 214,955 99.9% 99.9% 0.0
180 n_channel_type_car_dealer float64 1.7 MB 6 <0.1% 0 0% 215,036 99.9% 99.9% 0.0
242 missingindicator_AMT_ANNUITY float64 1.7 MB 2 <0.1% 0 0% 215,249 >99.9% >99.9% 0.0
250 missingindicator_CNT_FAM_MEMBERS float64 1.7 MB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0.0
252 missingindicator_DAYS_LAST_PHONE_CHANGE float64 1.7 MB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0.0
275 missingindicator_amt_annuity_to_credit_ratio float64 1.7 MB 2 <0.1% 0 0% 215,249 >99.9% >99.9% 0.0
276 missingindicator_amt_annuity_to_income_per_family_member float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
277 missingindicator_amt_annuity_to_income_ratio float64 1.7 MB 2 <0.1% 0 0% 215,249 >99.9% >99.9% 0.0
333 missingindicator_cnt_fam_members_excluding_children float64 1.7 MB 2 <0.1% 0 0% 215,256 >99.9% >99.9% 0.0
470 name_income_type_businessman float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
472 name_income_type_maternity_leave float64 1.7 MB 2 <0.1% 0 0% 215,254 >99.9% >99.9% 0.0
475 name_income_type_student float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
476 name_income_type_unemployed float64 1.7 MB 2 <0.1% 0 0% 215,241 >99.9% >99.9% 0.0
480 name_type_suite_group_of_people float64 1.7 MB 2 <0.1% 0 0% 215,059 99.9% 99.9% 0.0
505 organization_type_advertising float64 1.7 MB 2 <0.1% 0 0% 214,968 99.9% 99.9% 0.0
511 organization_type_cleaning float64 1.7 MB 2 <0.1% 0 0% 215,062 99.9% 99.9% 0.0
513 organization_type_culture float64 1.7 MB 2 <0.1% 0 0% 214,988 99.9% 99.9% 0.0
520 organization_type_industry_type_10 float64 1.7 MB 2 <0.1% 0 0% 215,182 >99.9% >99.9% 0.0
522 organization_type_industry_type_12 float64 1.7 MB 2 <0.1% 0 0% 214,999 99.9% 99.9% 0.0
523 organization_type_industry_type_13 float64 1.7 MB 2 <0.1% 0 0% 215,211 >99.9% >99.9% 0.0
528 organization_type_industry_type_6 float64 1.7 MB 2 <0.1% 0 0% 215,180 >99.9% >99.9% 0.0
530 organization_type_industry_type_8 float64 1.7 MB 2 <0.1% 0 0% 215,240 >99.9% >99.9% 0.0
534 organization_type_legal_services float64 1.7 MB 2 <0.1% 0 0% 215,039 99.9% 99.9% 0.0
537 organization_type_mobile float64 1.7 MB 2 <0.1% 0 0% 215,046 99.9% 99.9% 0.0
541 organization_type_realtor float64 1.7 MB 2 <0.1% 0 0% 214,978 99.9% 99.9% 0.0
542 organization_type_religion float64 1.7 MB 2 <0.1% 0 0% 215,198 >99.9% >99.9% 0.0
550 organization_type_trade_type_1 float64 1.7 MB 2 <0.1% 0 0% 215,020 99.9% 99.9% 0.0
553 organization_type_trade_type_4 float64 1.7 MB 2 <0.1% 0 0% 215,212 >99.9% >99.9% 0.0
554 organization_type_trade_type_5 float64 1.7 MB 2 <0.1% 0 0% 215,223 >99.9% >99.9% 0.0
557 organization_type_transport_type_1 float64 1.7 MB 2 <0.1% 0 0% 215,112 99.9% 99.9% 0.0
576 mode_credit_type_other float64 1.7 MB 2 <0.1% 0 0% 215,155 >99.9% >99.9% 0.0
578 name_education_type_academic_degree float64 1.7 MB 2 <0.1% 0 0% 215,153 >99.9% >99.9% 0.0

Next, problematic and redundant features after pre-processing will be identified in the same way as before pre-processing:

Code
cols_to_keep_2 = list(
    set(credits_train_transformed.columns) - set(problematic_columns_2.column)
)

pipeline_selection = Pipeline(
    steps=[
        ("column_selector_2", ColumnSelector(cols_to_keep_2)),
        ("drop_duplicate_features", DropDuplicateFeatures()),
        (
            "drop_corr_features",
            SmartCorrelatedSelection(selection_method="variance"),
        ),
    ]
)

pipeline_selection.fit(credits_train_transformed)
# Time: 2m 8.4s
Pipeline(steps=[('column_selector_2',
                 ColumnSelector(keep=['cnt_installment_mature_cum_min',
                                      'days_credit_update_max',
                                      'ORGANIZATION_TYPE_Trade_type_5',
                                      'REG_REGION_NOT_WORK_REGION',
                                      'missingindicator_days_last_due_1st_version_min',
                                      'FLOORSMIN_MEDI',
                                      'ORGANIZATION_TYPE_Business_Entity_Type_1',
                                      'missingindicator_amt_down_payment_max',
                                      'diff_percent_installme...
                                      'missingindicator_diff_days_installment_payment_sum_late_only',
                                      'missingindicator_cnt_drawings_pos_current_median',
                                      'amt_inst_min_regularity_min',
                                      'missingindicator_percent_installments_early',
                                      'missingindicator_amt_credit_sum_limit_sum',
                                      'n_reject_reason_limit', ...])),
                ('drop_duplicate_features', DropDuplicateFeatures()),
                ('drop_corr_features',
                 SmartCorrelatedSelection(selection_method='variance'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
credits_train_transformed_not_correlated_cols = pipeline_selection.transform(
    credits_train_transformed
).sort_index(axis=1)
Code
credits_train_transformed_not_correlated_cols.shape
(215257, 361)
Code
credits_train_transformed_not_correlated_cols.head()
AMT_ANNUITY AMT_CREDIT AMT_INCOME_TOTAL AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_YEAR BASEMENTAREA_MODE CNT_FAM_MEMBERS COMMONAREA_MEDI DAYS_ID_PUBLISH DAYS_LAST_PHONE_CHANGE DAYS_REGISTRATION DEF_30_CNT_SOCIAL_CIRCLE ELEVATORS_AVG ENTRANCES_MODE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 FLAG_CONT_MOBILE FLAG_DOCUMENT_11 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_16 FLAG_DOCUMENT_18 FLAG_DOCUMENT_3 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_EMAIL FLAG_EMP_PHONE FLAG_IS_EMERGENCY FLAG_OWN_CAR FLAG_OWN_REALTY FLAG_PHONE FLAG_WORK_PHONE FLOORSMAX_MEDI FLOORSMIN_MEDI FONDKAPREMONT_MODE_not_specified FONDKAPREMONT_MODE_org_spec_account FONDKAPREMONT_MODE_reg_oper_account FONDKAPREMONT_MODE_reg_oper_spec_account HOUSETYPE_MODE_nan HOUSETYPE_MODE_specific_housing HOUSETYPE_MODE_terraced_house LANDAREA_MEDI NAME_CONTRACT_TYPE_Cash_loans NAME_EDUCATION_TYPE_Academic_degree NAME_EDUCATION_TYPE_Incomplete_higher NAME_EDUCATION_TYPE_Lower_secondary NAME_HOUSING_TYPE_Co_op_apartment NAME_HOUSING_TYPE_House_apartment NAME_HOUSING_TYPE_Municipal_apartment NAME_HOUSING_TYPE_Office_apartment NAME_HOUSING_TYPE_Rented_apartment NAME_HOUSING_TYPE_With_parents NAME_INCOME_TYPE_Businessman NAME_INCOME_TYPE_Commercial_associate NAME_INCOME_TYPE_Maternity_leave NAME_INCOME_TYPE_State_servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_TYPE_SUITE_Children NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group_of_people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse_partner NAME_TYPE_SUITE_Unaccompanied NAME_TYPE_SUITE_nan NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE OCCUPATION_TYPE_Accountants OCCUPATION_TYPE_Cleaning_staff OCCUPATION_TYPE_Cooking_staff OCCUPATION_TYPE_Core_staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR_staff OCCUPATION_TYPE_High_skill_tech_staff OCCUPATION_TYPE_IT_staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low_skill_Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine_staff OCCUPATION_TYPE_Private_service_staff OCCUPATION_TYPE_Realty_agents OCCUPATION_TYPE_Sales_staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security_staff OCCUPATION_TYPE_Waiters_barmen_staff OCCUPATION_TYPE_nan ORGANIZATION_TYPE_Advertising ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business_Entity_Type_1 ORGANIZATION_TYPE_Business_Entity_Type_2 ORGANIZATION_TYPE_Business_Entity_Type_3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry_type_1 ORGANIZATION_TYPE_Industry_type_10 ORGANIZATION_TYPE_Industry_type_11 ORGANIZATION_TYPE_Industry_type_12 ORGANIZATION_TYPE_Industry_type_13 ORGANIZATION_TYPE_Industry_type_2 ORGANIZATION_TYPE_Industry_type_3 ORGANIZATION_TYPE_Industry_type_4 ORGANIZATION_TYPE_Industry_type_5 ORGANIZATION_TYPE_Industry_type_6 ORGANIZATION_TYPE_Industry_type_7 ORGANIZATION_TYPE_Industry_type_8 ORGANIZATION_TYPE_Industry_type_9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal_Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security_Ministries ORGANIZATION_TYPE_Self_employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade_type_1 ORGANIZATION_TYPE_Trade_type_2 ORGANIZATION_TYPE_Trade_type_3 ORGANIZATION_TYPE_Trade_type_4 ORGANIZATION_TYPE_Trade_type_5 ORGANIZATION_TYPE_Trade_type_6 ORGANIZATION_TYPE_Trade_type_7 ORGANIZATION_TYPE_Transport_type_1 ORGANIZATION_TYPE_Transport_type_2 ... amt_payment_current_range amt_payment_total_current_min any_installments_late_30 any_installments_late_60 any_installments_late_7 bureau_dpd_status_max bureau_dpd_status_median bureau_months_balance_max cnt_credit_prolong_mean cnt_credit_prolong_sum cnt_drawings_atm_current_max cnt_drawings_current_min cnt_drawings_other_current_max cnt_drawings_pos_current_max cnt_drawings_pos_current_median cnt_drawings_pos_current_min cnt_fam_members_excluding_children cnt_installment_future_min cnt_installment_mature_cum_max cnt_installment_mature_cum_min cnt_installment_median cnt_installment_min cnt_installment_range cnt_installments_diff_min cnt_installments_diff_range cnt_payment_median cnt_payment_min cnt_payment_range days_credit_enddate_max days_credit_enddate_min days_credit_max days_credit_median days_credit_overdue_max days_credit_overdue_mean days_credit_range days_credit_std days_credit_update_max days_credit_update_median days_credit_update_range days_decision_max days_decision_median days_decision_range days_enddate_fact_max days_enddate_fact_median days_enddate_fact_range days_first_draw_min days_last_due_1st_version_max days_last_due_1st_version_mean days_last_due_1st_version_median days_last_due_1st_version_min days_last_due_max days_termination_median days_termination_min diff_amt_installment_payment_max diff_amt_installment_payment_mean diff_amt_installment_payment_median diff_amt_installment_payment_range diff_days_installment_payment_max diff_days_installment_payment_mean diff_days_installment_payment_median diff_days_installment_payment_range diff_days_installment_payment_sum diff_days_installment_payment_sum_late_only diff_percent_installment_payment_mean diff_percent_installment_payment_median diff_percent_installment_payment_min diff_percent_installment_payment_range missingindicator_DEF_30_CNT_SOCIAL_CIRCLE missingindicator_EXT_SOURCE_1 missingindicator_EXT_SOURCE_2 missingindicator_EXT_SOURCE_3 missingindicator_YEARS_BUILD_AVG missingindicator_amt_credit_max_overdue_max missingindicator_amt_credit_sum_debt_mean missingindicator_amt_credit_sum_limit_min missingindicator_amt_credit_sum_limit_std missingindicator_amt_down_payment_max missingindicator_bureau_months_balance_max missingindicator_cnt_installment_range missingindicator_days_credit_enddate_std missingindicator_days_enddate_fact_range mode_credit_type_Car_loan mode_credit_type_Consumer_credit mode_credit_type_Credit_card mode_credit_type_Microloan mode_credit_type_Mortgage mode_credit_type_Other n_car_loans n_channel_type_ap_minus n_channel_type_channel_corporate_sales n_channel_type_contact_center n_channel_type_countrywide n_channel_type_regional_and_local n_channel_type_stone n_client_type_new n_client_type_refreshed n_client_type_repeater n_consumer_loans n_contract_status_refused n_contract_status_unused_offer n_contracts_credit_card_completed n_credit_card_credits n_credits_active n_credits_sold n_credits_total n_currency_2 n_different_channels n_different_contract_types n_different_credit_types n_different_currencies n_installments_late n_installments_late_30 n_installments_late_7 n_installments_total n_microloans n_mortgages n_nflag_insured_on_approval_mean n_nflag_insured_on_approval_sum n_other_type_credit n_payment_type_cash_through_bank n_payment_type_not_available n_previous_credit_card_applications n_previous_credit_card_applications_signed n_previous_pos_applications n_previous_pos_applications_completed n_previous_pos_applications_signed n_product_type_walk_in n_reject_reason_limit n_reject_reason_scoc n_reject_reason_scofr n_revolving_loans n_yield_group_high n_yield_group_low_action n_yield_group_low_normal n_yield_group_middle ord_education_type percent_installments_early percent_installments_late percent_installments_late_30 percent_installments_late_60 percent_installments_late_7 rate_down_payment_max rate_down_payment_range rate_interest_privileged_count sk_dpd_credit_card_max sk_dpd_credit_card_median sk_dpd_def_credit_card_max sk_dpd_def_pos_applications_max sk_dpd_pos_applications_max years_employed
0 68643.00 1971072.00 405000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 4.00 0.02 -1823.00 -2169.00 -7460.00 0.00 0.00 0.24 0.68 0.33 0.64 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.17 0.21 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.03 4.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 63000.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 12.00 12.00 0.00 0.00 12.00 12.00 12.00 0.00 934.00 -746.00 -145.00 -1001.50 0.00 0.00 1094.00 489.28 -7.00 -189.50 734.00 -2169.00 -2169.00 0.00 -362.00 -554.00 384.00 365243.00 -1808.00 -1808.00 -1808.00 -1808.00 -1808.00 -1805.00 -1805.00 0.00 0.00 0.00 0.00 25.00 12.75 13.50 24.00 153.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 2.00 0.00 4.00 0.00 1.00 1.00 3.00 1.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 21.00 0.00 13.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 3.00 1.00 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.82
1 38146.50 508495.50 337500.00 0.00 0.00 0.00 0.00 0.00 6.00 0.07 2.00 0.02 -1090.00 -659.00 -4054.00 1.00 0.00 0.14 0.51 0.62 0.44 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.17 0.21 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.05 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 63000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 1.00 0.00 0.00 2.00 0.00 0.00 0.00 12.00 11.00 13.00 0.00 11.00 12.00 0.00 24.00 911.00 -1267.00 -300.00 -957.00 0.00 0.00 1262.00 621.29 -19.00 -360.00 904.00 -330.00 -361.00 329.00 -345.00 -872.50 821.00 365243.00 365243.00 121778.00 61.00 30.00 365243.00 365243.00 -325.00 0.00 0.00 0.00 0.00 7.00 3.24 3.00 6.00 68.00 0.00 1.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 4.00 0.00 0.00 0.00 0.00 1.00 2.00 0.00 4.00 0.00 1.00 2.00 2.00 1.00 0.00 0.00 0.00 21.00 0.00 0.00 0.67 2.00 0.00 2.00 3.00 11.00 0.00 22.00 1.00 0.00 1.00 0.00 0.00 0.00 2.00 0.00 0.00 1.00 1.00 3.00 1.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.31
2 13068.00 110146.50 112500.00 0.00 0.00 0.00 0.00 0.00 1.00 0.07 3.00 0.02 -4130.00 -172.00 -5554.00 0.00 0.00 0.14 0.36 0.65 0.54 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 0.17 0.21 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.05 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 63000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 8.00 2.00 58.00 0.00 10.00 10.00 4.00 56.00 911.00 -1267.00 -300.00 -957.00 0.00 0.00 1262.00 621.29 -19.00 -360.00 904.00 -121.00 -172.00 2606.00 -345.00 -872.50 821.00 365243.00 1628.00 -301.50 -204.00 -2426.00 -112.00 -229.00 -2420.00 0.00 -15000.00 0.00 285159.69 27.00 16.19 16.00 23.00 340.00 0.00 0.95 1.00 0.09 0.91 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 2.00 0.00 1.00 1.00 5.00 3.00 1.00 0.00 0.00 1.00 2.00 0.00 4.00 0.00 3.00 2.00 2.00 1.00 0.00 0.00 0.00 21.00 0.00 0.00 0.75 3.00 0.00 5.00 2.00 21.00 0.00 26.00 3.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 3.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.47 0.47 0.00 0.00 0.00 0.00 0.00 0.00 1.62
3 3519.00 66384.00 40500.00 0.00 0.00 1.00 0.00 0.00 2.00 0.07 4.00 0.02 -5290.00 -1576.00 -5285.00 0.00 0.00 0.14 0.39 0.60 0.45 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.17 0.21 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.05 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 63000.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 24.00 6.00 18.00 0.00 24.00 15.00 6.00 24.00 30905.00 -679.00 -325.00 -545.00 0.00 0.00 1020.00 398.50 -14.00 -20.00 629.00 -575.00 -1190.00 2293.00 -518.00 -583.50 131.00 365243.00 -84.00 -1387.67 -1392.00 -2687.00 -84.00 -1388.00 -2683.00 9004.50 243.42 0.00 9004.50 20.00 4.76 5.00 30.00 176.00 -11.00 121.18 1.00 1.00 4446.67 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 2.00 0.00 0.00 0.00 1.00 3.00 2.00 1.00 0.00 0.00 1.00 3.00 0.00 5.00 0.00 3.00 2.00 2.00 1.00 2.00 0.00 1.00 37.00 0.00 0.00 0.67 2.00 0.00 2.00 2.00 21.00 0.00 38.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 2.00 1.00 0.78 0.05 0.00 0.00 0.03 0.10 0.10 0.00 0.00 0.00 0.00 0.00 0.00 14.73
4 31801.50 298512.00 225000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.14 2.00 0.10 -3033.00 -624.00 -86.00 0.00 0.40 0.17 0.74 0.66 0.72 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.46 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 3.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 63000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 1.00 0.00 0.00 2.00 0.00 7.00 0.00 10.00 5.00 5.00 0.00 5.00 10.00 10.00 0.00 703.00 -2526.00 -965.00 -1106.00 0.00 0.00 1896.00 1056.31 -50.00 -696.00 2445.00 -624.00 -624.00 0.00 -723.00 -1612.00 1778.00 365243.00 -323.00 -323.00 -323.00 -323.00 -473.00 -467.00 -467.00 0.00 0.00 0.00 0.00 14.00 6.40 4.00 14.00 32.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 1.00 0.00 3.00 0.00 1.00 1.00 2.00 1.00 0.00 0.00 0.00 5.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 21.00 0.00 6.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.80 0.00 0.00 0.00 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.27

5 rows × 361 columns

credits_train_transformed_not_correlated_col_info = an.col_info(
    credits_train_transformed_not_correlated_cols
)

credits_train_transformed_not_correlated_col_info.pipe(an.style_col_info)
Table 6.5. Info on the final set of columns after preprocessing.
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 AMT_ANNUITY float64 1.7 MB 12,801 5.9% 0 0% 4,499 2.1% 2.1% 9000.0
2 AMT_CREDIT float64 1.7 MB 5,097 2.4% 0 0% 6,823 3.2% 3.2% 450000.0
3 AMT_INCOME_TOTAL float64 1.7 MB 1,949 0.9% 0 0% 24,982 11.6% 11.6% 135000.0
4 AMT_REQ_CREDIT_BUREAU_DAY float64 1.7 MB 9 <0.1% 0 0% 214,228 99.5% 99.5% 0.0
5 AMT_REQ_CREDIT_BUREAU_HOUR float64 1.7 MB 5 <0.1% 0 0% 214,142 99.5% 99.5% 0.0
6 AMT_REQ_CREDIT_BUREAU_MON float64 1.7 MB 22 <0.1% 0 0% 184,760 85.8% 85.8% 0.0
7 AMT_REQ_CREDIT_BUREAU_QRT float64 1.7 MB 10 <0.1% 0 0% 179,976 83.6% 83.6% 0.0
8 AMT_REQ_CREDIT_BUREAU_WEEK float64 1.7 MB 9 <0.1% 0 0% 209,327 97.2% 97.2% 0.0
9 AMT_REQ_CREDIT_BUREAU_YEAR float64 1.7 MB 24 <0.1% 0 0% 73,441 34.1% 34.1% 1.0
10 BASEMENTAREA_MODE float64 1.7 MB 3,687 1.7% 0 0% 125,860 58.5% 58.5% 0.07460000365972519
11 CNT_FAM_MEMBERS float64 1.7 MB 12 <0.1% 0 0% 110,672 51.4% 51.4% 2.0
12 COMMONAREA_MEDI float64 1.7 MB 2,982 1.4% 0 0% 150,382 69.9% 69.9% 0.020899999886751175
13 DAYS_ID_PUBLISH float64 1.7 MB 6,122 2.8% 0 0% 119 0.1% 0.1% -4074.0
14 DAYS_LAST_PHONE_CHANGE float64 1.7 MB 3,720 1.7% 0 0% 26,201 12.2% 12.2% 0.0
15 DAYS_REGISTRATION float64 1.7 MB 15,249 7.1% 0 0% 79 <0.1% <0.1% -7.0
16 DEF_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 10 <0.1% 0 0% 190,702 88.6% 88.6% 0.0
17 ELEVATORS_AVG float64 1.7 MB 241 0.1% 0 0% 174,679 81.1% 81.1% 0.0
18 ENTRANCES_MODE float64 1.7 MB 30 <0.1% 0 0% 133,580 62.1% 62.1% 0.1378999948501587
19 EXT_SOURCE_1 float64 1.7 MB 83,962 39.0% 0 0% 121,373 56.4% 56.4% 0.5052886605262756
20 EXT_SOURCE_2 float64 1.7 MB 102,229 47.5% 0 0% 503 0.2% 0.2% 0.2858978807926178
21 EXT_SOURCE_3 float64 1.7 MB 804 0.4% 0 0% 43,202 20.1% 20.1% 0.5352762341499329
22 FLAG_CONT_MOBILE float64 1.7 MB 2 <0.1% 0 0% 214,855 99.8% 99.8% 1.0
23 FLAG_DOCUMENT_11 float64 1.7 MB 2 <0.1% 0 0% 214,448 99.6% 99.6% 0.0
24 FLAG_DOCUMENT_13 float64 1.7 MB 2 <0.1% 0 0% 214,541 99.7% 99.7% 0.0
25 FLAG_DOCUMENT_14 float64 1.7 MB 2 <0.1% 0 0% 214,614 99.7% 99.7% 0.0
26 FLAG_DOCUMENT_16 float64 1.7 MB 2 <0.1% 0 0% 213,089 99.0% 99.0% 0.0
27 FLAG_DOCUMENT_18 float64 1.7 MB 2 <0.1% 0 0% 213,525 99.2% 99.2% 0.0
28 FLAG_DOCUMENT_3 float64 1.7 MB 2 <0.1% 0 0% 152,845 71.0% 71.0% 1.0
29 FLAG_DOCUMENT_5 float64 1.7 MB 2 <0.1% 0 0% 212,025 98.5% 98.5% 0.0
30 FLAG_DOCUMENT_6 float64 1.7 MB 2 <0.1% 0 0% 196,348 91.2% 91.2% 0.0
31 FLAG_DOCUMENT_8 float64 1.7 MB 2 <0.1% 0 0% 197,689 91.8% 91.8% 0.0
32 FLAG_DOCUMENT_9 float64 1.7 MB 2 <0.1% 0 0% 214,440 99.6% 99.6% 0.0
33 FLAG_EMAIL float64 1.7 MB 2 <0.1% 0 0% 203,006 94.3% 94.3% 0.0
34 FLAG_EMP_PHONE float64 1.7 MB 2 <0.1% 0 0% 176,491 82.0% 82.0% 1.0
35 FLAG_IS_EMERGENCY float64 1.7 MB 2 <0.1% 0 0% 213,628 99.2% 99.2% 0.0
36 FLAG_OWN_CAR float64 1.7 MB 2 <0.1% 0 0% 142,086 66.0% 66.0% 0.0
37 FLAG_OWN_REALTY float64 1.7 MB 2 <0.1% 0 0% 149,412 69.4% 69.4% 1.0
38 FLAG_PHONE float64 1.7 MB 2 <0.1% 0 0% 154,906 72.0% 72.0% 0.0
39 FLAG_WORK_PHONE float64 1.7 MB 2 <0.1% 0 0% 172,406 80.1% 80.1% 0.0
40 FLOORSMAX_MEDI float64 1.7 MB 49 <0.1% 0 0% 151,629 70.4% 70.4% 0.16670000553131104
41 FLOORSMIN_MEDI float64 1.7 MB 47 <0.1% 0 0% 169,787 78.9% 78.9% 0.20829999446868896
42 FONDKAPREMONT_MODE_not_specified float64 1.7 MB 2 <0.1% 0 0% 211,294 98.2% 98.2% 0.0
43 FONDKAPREMONT_MODE_org_spec_account float64 1.7 MB 2 <0.1% 0 0% 211,329 98.2% 98.2% 0.0
44 FONDKAPREMONT_MODE_reg_oper_account float64 1.7 MB 2 <0.1% 0 0% 163,472 75.9% 75.9% 0.0
45 FONDKAPREMONT_MODE_reg_oper_spec_account float64 1.7 MB 2 <0.1% 0 0% 206,775 96.1% 96.1% 0.0
46 HOUSETYPE_MODE_nan float64 1.7 MB 2 <0.1% 0 0% 107,834 50.1% 50.1% 1.0
47 HOUSETYPE_MODE_specific_housing float64 1.7 MB 2 <0.1% 0 0% 214,216 99.5% 99.5% 0.0
48 HOUSETYPE_MODE_terraced_house float64 1.7 MB 2 <0.1% 0 0% 214,390 99.6% 99.6% 0.0
49 LANDAREA_MEDI float64 1.7 MB 3,393 1.6% 0 0% 127,718 59.3% 59.3% 0.048700001090765
50 NAME_CONTRACT_TYPE_Cash_loans float64 1.7 MB 2 <0.1% 0 0% 194,675 90.4% 90.4% 1.0
51 NAME_EDUCATION_TYPE_Academic_degree float64 1.7 MB 2 <0.1% 0 0% 215,153 >99.9% >99.9% 0.0
52 NAME_EDUCATION_TYPE_Incomplete_higher float64 1.7 MB 2 <0.1% 0 0% 208,006 96.6% 96.6% 0.0
53 NAME_EDUCATION_TYPE_Lower_secondary float64 1.7 MB 2 <0.1% 0 0% 212,602 98.8% 98.8% 0.0
54 NAME_HOUSING_TYPE_Co_op_apartment float64 1.7 MB 2 <0.1% 0 0% 214,466 99.6% 99.6% 0.0
55 NAME_HOUSING_TYPE_House_apartment float64 1.7 MB 2 <0.1% 0 0% 191,159 88.8% 88.8% 1.0
56 NAME_HOUSING_TYPE_Municipal_apartment float64 1.7 MB 2 <0.1% 0 0% 207,454 96.4% 96.4% 0.0
57 NAME_HOUSING_TYPE_Office_apartment float64 1.7 MB 2 <0.1% 0 0% 213,440 99.2% 99.2% 0.0
58 NAME_HOUSING_TYPE_Rented_apartment float64 1.7 MB 2 <0.1% 0 0% 211,900 98.4% 98.4% 0.0
59 NAME_HOUSING_TYPE_With_parents float64 1.7 MB 2 <0.1% 0 0% 204,927 95.2% 95.2% 0.0
60 NAME_INCOME_TYPE_Businessman float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
61 NAME_INCOME_TYPE_Commercial_associate float64 1.7 MB 2 <0.1% 0 0% 165,151 76.7% 76.7% 0.0
62 NAME_INCOME_TYPE_Maternity_leave float64 1.7 MB 2 <0.1% 0 0% 215,254 >99.9% >99.9% 0.0
63 NAME_INCOME_TYPE_State_servant float64 1.7 MB 2 <0.1% 0 0% 199,875 92.9% 92.9% 0.0
64 NAME_INCOME_TYPE_Student float64 1.7 MB 2 <0.1% 0 0% 215,248 >99.9% >99.9% 0.0
65 NAME_INCOME_TYPE_Unemployed float64 1.7 MB 2 <0.1% 0 0% 215,241 >99.9% >99.9% 0.0
66 NAME_INCOME_TYPE_Working float64 1.7 MB 2 <0.1% 0 0% 110,984 51.6% 51.6% 1.0
67 NAME_TYPE_SUITE_Children float64 1.7 MB 2 <0.1% 0 0% 212,930 98.9% 98.9% 0.0
68 NAME_TYPE_SUITE_Family float64 1.7 MB 2 <0.1% 0 0% 187,256 87.0% 87.0% 0.0
69 NAME_TYPE_SUITE_Group_of_people float64 1.7 MB 2 <0.1% 0 0% 215,059 99.9% 99.9% 0.0
70 NAME_TYPE_SUITE_Other_A float64 1.7 MB 2 <0.1% 0 0% 214,643 99.7% 99.7% 0.0
71 NAME_TYPE_SUITE_Other_B float64 1.7 MB 2 <0.1% 0 0% 214,009 99.4% 99.4% 0.0
72 NAME_TYPE_SUITE_Spouse_partner float64 1.7 MB 2 <0.1% 0 0% 207,378 96.3% 96.3% 0.0
73 NAME_TYPE_SUITE_Unaccompanied float64 1.7 MB 2 <0.1% 0 0% 174,089 80.9% 80.9% 1.0
74 NAME_TYPE_SUITE_nan float64 1.7 MB 2 <0.1% 0 0% 214,356 99.6% 99.6% 0.0
75 NONLIVINGAPARTMENTS_AVG float64 1.7 MB 345 0.2% 0 0% 187,673 87.2% 87.2% 0.0
76 NONLIVINGAREA_MODE float64 1.7 MB 3,090 1.4% 0 0% 118,905 55.2% 55.2% 0.0010999999940395355
77 OBS_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 32 <0.1% 0 0% 115,264 53.5% 53.5% 0.0
78 OCCUPATION_TYPE_Accountants float64 1.7 MB 2 <0.1% 0 0% 208,415 96.8% 96.8% 0.0
79 OCCUPATION_TYPE_Cleaning_staff float64 1.7 MB 2 <0.1% 0 0% 211,947 98.5% 98.5% 0.0
80 OCCUPATION_TYPE_Cooking_staff float64 1.7 MB 2 <0.1% 0 0% 211,079 98.1% 98.1% 0.0
81 OCCUPATION_TYPE_Core_staff float64 1.7 MB 2 <0.1% 0 0% 195,912 91.0% 91.0% 0.0
82 OCCUPATION_TYPE_Drivers float64 1.7 MB 2 <0.1% 0 0% 202,169 93.9% 93.9% 0.0
83 OCCUPATION_TYPE_HR_staff float64 1.7 MB 2 <0.1% 0 0% 214,884 99.8% 99.8% 0.0
84 OCCUPATION_TYPE_High_skill_tech_staff float64 1.7 MB 2 <0.1% 0 0% 207,280 96.3% 96.3% 0.0
85 OCCUPATION_TYPE_IT_staff float64 1.7 MB 2 <0.1% 0 0% 214,896 99.8% 99.8% 0.0
86 OCCUPATION_TYPE_Laborers float64 1.7 MB 2 <0.1% 0 0% 176,666 82.1% 82.1% 0.0
87 OCCUPATION_TYPE_Low_skill_Laborers float64 1.7 MB 2 <0.1% 0 0% 213,777 99.3% 99.3% 0.0
88 OCCUPATION_TYPE_Managers float64 1.7 MB 2 <0.1% 0 0% 200,272 93.0% 93.0% 0.0
89 OCCUPATION_TYPE_Medicine_staff float64 1.7 MB 2 <0.1% 0 0% 209,207 97.2% 97.2% 0.0
90 OCCUPATION_TYPE_Private_service_staff float64 1.7 MB 2 <0.1% 0 0% 213,406 99.1% 99.1% 0.0
91 OCCUPATION_TYPE_Realty_agents float64 1.7 MB 2 <0.1% 0 0% 214,733 99.8% 99.8% 0.0
92 OCCUPATION_TYPE_Sales_staff float64 1.7 MB 2 <0.1% 0 0% 192,972 89.6% 89.6% 0.0
93 OCCUPATION_TYPE_Secretaries float64 1.7 MB 2 <0.1% 0 0% 214,342 99.6% 99.6% 0.0
94 OCCUPATION_TYPE_Security_staff float64 1.7 MB 2 <0.1% 0 0% 210,559 97.8% 97.8% 0.0
95 OCCUPATION_TYPE_Waiters_barmen_staff float64 1.7 MB 2 <0.1% 0 0% 214,333 99.6% 99.6% 0.0
96 OCCUPATION_TYPE_nan float64 1.7 MB 2 <0.1% 0 0% 147,777 68.7% 68.7% 0.0
97 ORGANIZATION_TYPE_Advertising float64 1.7 MB 2 <0.1% 0 0% 214,968 99.9% 99.9% 0.0
98 ORGANIZATION_TYPE_Agriculture float64 1.7 MB 2 <0.1% 0 0% 213,527 99.2% 99.2% 0.0
99 ORGANIZATION_TYPE_Bank float64 1.7 MB 2 <0.1% 0 0% 213,522 99.2% 99.2% 0.0
100 ORGANIZATION_TYPE_Business_Entity_Type_1 float64 1.7 MB 2 <0.1% 0 0% 211,043 98.0% 98.0% 0.0
101 ORGANIZATION_TYPE_Business_Entity_Type_2 float64 1.7 MB 2 <0.1% 0 0% 207,883 96.6% 96.6% 0.0
102 ORGANIZATION_TYPE_Business_Entity_Type_3 float64 1.7 MB 2 <0.1% 0 0% 167,675 77.9% 77.9% 0.0
103 ORGANIZATION_TYPE_Cleaning float64 1.7 MB 2 <0.1% 0 0% 215,062 99.9% 99.9% 0.0
104 ORGANIZATION_TYPE_Construction float64 1.7 MB 2 <0.1% 0 0% 210,553 97.8% 97.8% 0.0
105 ORGANIZATION_TYPE_Culture float64 1.7 MB 2 <0.1% 0 0% 214,988 99.9% 99.9% 0.0
106 ORGANIZATION_TYPE_Electricity float64 1.7 MB 2 <0.1% 0 0% 214,583 99.7% 99.7% 0.0
107 ORGANIZATION_TYPE_Emergency float64 1.7 MB 2 <0.1% 0 0% 214,862 99.8% 99.8% 0.0
108 ORGANIZATION_TYPE_Government float64 1.7 MB 2 <0.1% 0 0% 207,933 96.6% 96.6% 0.0
109 ORGANIZATION_TYPE_Hotel float64 1.7 MB 2 <0.1% 0 0% 214,571 99.7% 99.7% 0.0
110 ORGANIZATION_TYPE_Housing float64 1.7 MB 2 <0.1% 0 0% 213,202 99.0% 99.0% 0.0
111 ORGANIZATION_TYPE_Industry_type_1 float64 1.7 MB 2 <0.1% 0 0% 214,520 99.7% 99.7% 0.0
112 ORGANIZATION_TYPE_Industry_type_10 float64 1.7 MB 2 <0.1% 0 0% 215,182 >99.9% >99.9% 0.0
113 ORGANIZATION_TYPE_Industry_type_11 float64 1.7 MB 2 <0.1% 0 0% 213,369 99.1% 99.1% 0.0
114 ORGANIZATION_TYPE_Industry_type_12 float64 1.7 MB 2 <0.1% 0 0% 214,999 99.9% 99.9% 0.0
115 ORGANIZATION_TYPE_Industry_type_13 float64 1.7 MB 2 <0.1% 0 0% 215,211 >99.9% >99.9% 0.0
116 ORGANIZATION_TYPE_Industry_type_2 float64 1.7 MB 2 <0.1% 0 0% 214,931 99.8% 99.8% 0.0
117 ORGANIZATION_TYPE_Industry_type_3 float64 1.7 MB 2 <0.1% 0 0% 212,965 98.9% 98.9% 0.0
118 ORGANIZATION_TYPE_Industry_type_4 float64 1.7 MB 2 <0.1% 0 0% 214,624 99.7% 99.7% 0.0
119 ORGANIZATION_TYPE_Industry_type_5 float64 1.7 MB 2 <0.1% 0 0% 214,864 99.8% 99.8% 0.0
120 ORGANIZATION_TYPE_Industry_type_6 float64 1.7 MB 2 <0.1% 0 0% 215,180 >99.9% >99.9% 0.0
121 ORGANIZATION_TYPE_Industry_type_7 float64 1.7 MB 2 <0.1% 0 0% 214,354 99.6% 99.6% 0.0
122 ORGANIZATION_TYPE_Industry_type_8 float64 1.7 MB 2 <0.1% 0 0% 215,240 >99.9% >99.9% 0.0
123 ORGANIZATION_TYPE_Industry_type_9 float64 1.7 MB 2 <0.1% 0 0% 212,861 98.9% 98.9% 0.0
124 ORGANIZATION_TYPE_Insurance float64 1.7 MB 2 <0.1% 0 0% 214,842 99.8% 99.8% 0.0
125 ORGANIZATION_TYPE_Kindergarten float64 1.7 MB 2 <0.1% 0 0% 210,366 97.7% 97.7% 0.0
126 ORGANIZATION_TYPE_Legal_Services float64 1.7 MB 2 <0.1% 0 0% 215,039 99.9% 99.9% 0.0
127 ORGANIZATION_TYPE_Medicine float64 1.7 MB 2 <0.1% 0 0% 207,340 96.3% 96.3% 0.0
128 ORGANIZATION_TYPE_Military float64 1.7 MB 2 <0.1% 0 0% 213,400 99.1% 99.1% 0.0
129 ORGANIZATION_TYPE_Mobile float64 1.7 MB 2 <0.1% 0 0% 215,046 99.9% 99.9% 0.0
130 ORGANIZATION_TYPE_Other float64 1.7 MB 2 <0.1% 0 0% 203,595 94.6% 94.6% 0.0
131 ORGANIZATION_TYPE_Police float64 1.7 MB 2 <0.1% 0 0% 213,649 99.3% 99.3% 0.0
132 ORGANIZATION_TYPE_Postal float64 1.7 MB 2 <0.1% 0 0% 213,737 99.3% 99.3% 0.0
133 ORGANIZATION_TYPE_Realtor float64 1.7 MB 2 <0.1% 0 0% 214,978 99.9% 99.9% 0.0
134 ORGANIZATION_TYPE_Religion float64 1.7 MB 2 <0.1% 0 0% 215,198 >99.9% >99.9% 0.0
135 ORGANIZATION_TYPE_Restaurant float64 1.7 MB 2 <0.1% 0 0% 213,972 99.4% 99.4% 0.0
136 ORGANIZATION_TYPE_School float64 1.7 MB 2 <0.1% 0 0% 208,961 97.1% 97.1% 0.0
137 ORGANIZATION_TYPE_Security float64 1.7 MB 2 <0.1% 0 0% 212,955 98.9% 98.9% 0.0
138 ORGANIZATION_TYPE_Security_Ministries float64 1.7 MB 2 <0.1% 0 0% 213,854 99.3% 99.3% 0.0
139 ORGANIZATION_TYPE_Self_employed float64 1.7 MB 2 <0.1% 0 0% 188,576 87.6% 87.6% 0.0
140 ORGANIZATION_TYPE_Services float64 1.7 MB 2 <0.1% 0 0% 214,168 99.5% 99.5% 0.0
141 ORGANIZATION_TYPE_Telecom float64 1.7 MB 2 <0.1% 0 0% 214,861 99.8% 99.8% 0.0
142 ORGANIZATION_TYPE_Trade_type_1 float64 1.7 MB 2 <0.1% 0 0% 215,020 99.9% 99.9% 0.0
143 ORGANIZATION_TYPE_Trade_type_2 float64 1.7 MB 2 <0.1% 0 0% 213,919 99.4% 99.4% 0.0
144 ORGANIZATION_TYPE_Trade_type_3 float64 1.7 MB 2 <0.1% 0 0% 212,832 98.9% 98.9% 0.0
145 ORGANIZATION_TYPE_Trade_type_4 float64 1.7 MB 2 <0.1% 0 0% 215,212 >99.9% >99.9% 0.0
146 ORGANIZATION_TYPE_Trade_type_5 float64 1.7 MB 2 <0.1% 0 0% 215,223 >99.9% >99.9% 0.0
147 ORGANIZATION_TYPE_Trade_type_6 float64 1.7 MB 2 <0.1% 0 0% 214,832 99.8% 99.8% 0.0
148 ORGANIZATION_TYPE_Trade_type_7 float64 1.7 MB 2 <0.1% 0 0% 209,807 97.5% 97.5% 0.0
149 ORGANIZATION_TYPE_Transport_type_1 float64 1.7 MB 2 <0.1% 0 0% 215,112 99.9% 99.9% 0.0
150 ORGANIZATION_TYPE_Transport_type_2 float64 1.7 MB 2 <0.1% 0 0% 213,728 99.3% 99.3% 0.0
151 ORGANIZATION_TYPE_Transport_type_3 float64 1.7 MB 2 <0.1% 0 0% 214,406 99.6% 99.6% 0.0
152 ORGANIZATION_TYPE_Transport_type_4 float64 1.7 MB 2 <0.1% 0 0% 211,508 98.3% 98.3% 0.0
153 ORGANIZATION_TYPE_University float64 1.7 MB 2 <0.1% 0 0% 214,340 99.6% 99.6% 0.0
154 OWN_CAR_AGE float64 1.7 MB 61 <0.1% 0 0% 145,584 67.6% 67.6% 9.0
155 REGION_POPULATION_RELATIVE float64 1.7 MB 81 <0.1% 0 0% 11,494 5.3% 5.3% 0.03579200059175491
156 REGION_RATING_CLIENT float64 1.7 MB 3 <0.1% 0 0% 158,846 73.8% 73.8% 2.0
157 REG_CITY_NOT_LIVE_CITY float64 1.7 MB 2 <0.1% 0 0% 198,549 92.2% 92.2% 0.0
158 REG_CITY_NOT_WORK_CITY float64 1.7 MB 2 <0.1% 0 0% 165,697 77.0% 77.0% 0.0
159 REG_REGION_NOT_LIVE_REGION float64 1.7 MB 2 <0.1% 0 0% 211,999 98.5% 98.5% 0.0
160 REG_REGION_NOT_WORK_REGION float64 1.7 MB 2 <0.1% 0 0% 204,222 94.9% 94.9% 0.0
161 WALLSMATERIAL_MODE_Block float64 1.7 MB 2 <0.1% 0 0% 208,728 97.0% 97.0% 0.0
162 WALLSMATERIAL_MODE_Mixed float64 1.7 MB 2 <0.1% 0 0% 213,683 99.3% 99.3% 0.0
163 WALLSMATERIAL_MODE_Monolithic float64 1.7 MB 2 <0.1% 0 0% 214,008 99.4% 99.4% 0.0
164 WALLSMATERIAL_MODE_Others float64 1.7 MB 2 <0.1% 0 0% 214,119 99.5% 99.5% 0.0
165 WALLSMATERIAL_MODE_Panel float64 1.7 MB 2 <0.1% 0 0% 168,959 78.5% 78.5% 0.0
166 WALLSMATERIAL_MODE_Stone_brick float64 1.7 MB 2 <0.1% 0 0% 169,849 78.9% 78.9% 0.0
167 WALLSMATERIAL_MODE_Wooden float64 1.7 MB 2 <0.1% 0 0% 211,525 98.3% 98.3% 0.0
168 YEARS_BEGINEXPLUATATION_MODE float64 1.7 MB 210 0.1% 0 0% 107,681 50.0% 50.0% 0.9815999865531921
169 YEARS_BUILD_AVG float64 1.7 MB 146 0.1% 0 0% 144,837 67.3% 67.3% 0.7552000284194946
170 amt_annuity_max float64 1.7 MB 18,638 8.7% 0 0% 159,516 74.1% 74.1% 12500.01
171 amt_annuity_median float64 1.7 MB 16,441 7.6% 0 0% 159,485 74.1% 74.1% 3942.0
172 amt_annuity_median_previous_application float64 1.7 MB 157,063 73.0% 0 0% 11,753 5.5% 5.5% 10773.157500000001
173 amt_annuity_min float64 1.7 MB 9,921 4.6% 0 0% 196,455 91.3% 91.3% 0.0
174 amt_annuity_min_previous_application float64 1.7 MB 113,816 52.9% 0 0% 16,017 7.4% 7.4% 2250.0
175 amt_annuity_to_credit_ratio float64 1.7 MB 33,148 15.4% 0 0% 20,564 9.6% 9.6% 0.05000000074505806
176 amt_annuity_to_income_per_family_member float64 1.7 MB 88,172 41.0% 0 0% 1,500 0.7% 0.7% 0.3
177 amt_annuity_to_income_ratio float64 1.7 MB 71,916 33.4% 0 0% 2,049 1.0% 1.0% 0.1
178 amt_balance_credit_card_max float64 1.7 MB 40,175 18.7% 0 0% 154,159 71.6% 71.6% 97790.49
179 amt_balance_credit_card_median float64 1.7 MB 27,685 12.9% 0 0% 187,185 87.0% 87.0% 0.0
180 amt_balance_credit_card_min float64 1.7 MB 8,310 3.9% 0 0% 206,302 95.8% 95.8% 0.0
181 amt_credit_limit_actual_median float64 1.7 MB 151 0.1% 0 0% 155,593 72.3% 72.3% 157500.0
182 amt_credit_limit_actual_range float64 1.7 MB 147 0.1% 0 0% 157,689 73.3% 73.3% 45000.0
183 amt_credit_max float64 1.7 MB 49,618 23.1% 0 0% 14,581 6.8% 6.8% 225000.0
184 amt_credit_max_overdue_max float64 1.7 MB 32,871 15.3% 0 0% 166,187 77.2% 77.2% 0.0
185 amt_credit_max_overdue_range float64 1.7 MB 27,267 12.7% 0 0% 175,595 81.6% 81.6% 0.0
186 amt_credit_median float64 1.7 MB 73,966 34.4% 0 0% 11,457 5.3% 5.3% 83054.25
187 amt_credit_min float64 1.7 MB 33,220 15.4% 0 0% 79,660 37.0% 37.0% 0.0
188 amt_credit_range float64 1.7 MB 71,950 33.4% 0 0% 37,038 17.2% 17.2% 0.0
189 amt_credit_sum_debt_mean float64 1.7 MB 121,544 56.5% 0 0% 48,543 22.6% 22.6% 0.0
190 amt_credit_sum_debt_sum float64 1.7 MB 113,811 52.9% 0 0% 53,746 25.0% 25.0% 0.0
191 amt_credit_sum_limit_min float64 1.7 MB 2,121 1.0% 0 0% 212,794 98.9% 98.9% 0.0
192 amt_credit_sum_limit_sum float64 1.7 MB 26,367 12.2% 0 0% 181,184 84.2% 84.2% 0.0
193 amt_credit_sum_median float64 1.7 MB 77,800 36.1% 0 0% 30,841 14.3% 14.3% 133852.5
194 amt_credit_sum_overdue_sum float64 1.7 MB 930 0.4% 0 0% 212,926 98.9% 98.9% 0.0
195 amt_credit_sum_std float64 1.7 MB 148,440 69.0% 0 0% 55,965 26.0% 26.0% 183202.88926385253
196 amt_credit_sum_sum float64 1.7 MB 147,742 68.6% 0 0% 30,837 14.3% 14.3% 964161.0
197 amt_credit_to_income_ratio float64 1.7 MB 39,372 18.3% 0 0% 3,691 1.7% 1.7% 2.0
198 amt_down_payment_max float64 1.7 MB 17,608 8.2% 0 0% 53,725 25.0% 25.0% 0.0
199 amt_drawings_atm_current_max float64 1.7 MB 1,131 0.5% 0 0% 175,102 81.3% 81.3% 90000.0
200 amt_drawings_atm_current_median float64 1.7 MB 378 0.2% 0 0% 208,835 97.0% 97.0% 0.0
201 amt_drawings_atm_current_min float64 1.7 MB 114 0.1% 0 0% 214,655 99.7% 99.7% 0.0
202 amt_drawings_current_mean float64 1.7 MB 35,095 16.3% 0 0% 154,159 71.6% 71.6% 3498.702077922078
203 amt_drawings_current_min float64 1.7 MB 1,475 0.7% 0 0% 213,422 99.1% 99.1% 0.0
204 amt_drawings_other_current_max float64 1.7 MB 1,084 0.5% 0 0% 211,253 98.1% 98.1% 0.0
205 amt_drawings_pos_current_max float64 1.7 MB 20,726 9.6% 0 0% 172,260 80.0% 80.0% 6300.0
206 amt_drawings_pos_current_mean float64 1.7 MB 23,516 10.9% 0 0% 172,255 80.0% 80.0% 303.42857142857144
207 amt_drawings_pos_current_min float64 1.7 MB 1,772 0.8% 0 0% 213,337 99.1% 99.1% 0.0
208 amt_goods_price_min float64 1.7 MB 39,171 18.2% 0 0% 12,169 5.7% 5.7% 45735.75
209 amt_inst_min_regularity_min float64 1.7 MB 1,664 0.8% 0 0% 211,946 98.5% 98.5% 0.0
210 amt_payment_current_median float64 1.7 MB 17,066 7.9% 0 0% 172,523 80.1% 80.1% 5850.0
211 amt_payment_current_min float64 1.7 MB 7,398 3.4% 0 0% 199,261 92.6% 92.6% 0.0
212 amt_payment_current_range float64 1.7 MB 22,545 10.5% 0 0% 172,454 80.1% 80.1% 63000.0
213 amt_payment_total_current_min float64 1.7 MB 1,131 0.5% 0 0% 213,443 99.2% 99.2% 0.0
214 any_installments_late_30 float64 1.7 MB 2 <0.1% 0 0% 201,997 93.8% 93.8% 0.0
215 any_installments_late_60 float64 1.7 MB 2 <0.1% 0 0% 209,180 97.2% 97.2% 0.0
216 any_installments_late_7 float64 1.7 MB 2 <0.1% 0 0% 158,592 73.7% 73.7% 0.0
217 bureau_dpd_status_max float64 1.7 MB 6 <0.1% 0 0% 193,628 90.0% 90.0% 0.0
218 bureau_dpd_status_median float64 1.7 MB 11 <0.1% 0 0% 214,312 99.6% 99.6% 0.0
219 bureau_months_balance_max float64 1.7 MB 89 <0.1% 0 0% 212,281 98.6% 98.6% 0.0
220 cnt_credit_prolong_mean float64 1.7 MB 100 <0.1% 0 0% 209,248 97.2% 97.2% 0.0
221 cnt_credit_prolong_sum float64 1.7 MB 10 <0.1% 0 0% 209,248 97.2% 97.2% 0.0
222 cnt_drawings_atm_current_max float64 1.7 MB 43 <0.1% 0 0% 178,554 82.9% 82.9% 3.0
223 cnt_drawings_current_min float64 1.7 MB 39 <0.1% 0 0% 213,436 99.2% 99.2% 0.0
224 cnt_drawings_other_current_max float64 1.7 MB 11 <0.1% 0 0% 211,241 98.1% 98.1% 0.0
225 cnt_drawings_pos_current_max float64 1.7 MB 116 0.1% 0 0% 176,434 82.0% 82.0% 1.0
226 cnt_drawings_pos_current_median float64 1.7 MB 113 0.1% 0 0% 205,975 95.7% 95.7% 0.0
227 cnt_drawings_pos_current_min float64 1.7 MB 40 <0.1% 0 0% 213,337 99.1% 99.1% 0.0
228 cnt_fam_members_excluding_children float64 1.7 MB 2 <0.1% 0 0% 158,302 73.5% 73.5% 2.0
229 cnt_installment_future_min float64 1.7 MB 61 <0.1% 0 0% 196,054 91.1% 91.1% 0.0
230 cnt_installment_mature_cum_max float64 1.7 MB 120 0.1% 0 0% 156,329 72.6% 72.6% 7.0
231 cnt_installment_mature_cum_min float64 1.7 MB 28 <0.1% 0 0% 193,011 89.7% 89.7% 0.0
232 cnt_installment_median float64 1.7 MB 103 <0.1% 0 0% 73,750 34.3% 34.3% 12.0
233 cnt_installment_min float64 1.7 MB 53 <0.1% 0 0% 54,950 25.5% 25.5% 6.0
234 cnt_installment_range float64 1.7 MB 69 <0.1% 0 0% 49,692 23.1% 23.1% 0.0
235 cnt_installments_diff_min float64 1.7 MB 58 <0.1% 0 0% 210,671 97.9% 97.9% 0.0
236 cnt_installments_diff_range float64 1.7 MB 82 <0.1% 0 0% 48,330 22.5% 22.5% 12.0
237 cnt_payment_median float64 1.7 MB 87 <0.1% 0 0% 65,750 30.5% 30.5% 12.0
238 cnt_payment_min float64 1.7 MB 31 <0.1% 0 0% 68,588 31.9% 31.9% 0.0
239 cnt_payment_range float64 1.7 MB 69 <0.1% 0 0% 54,639 25.4% 25.4% 0.0
240 days_credit_enddate_max float64 1.7 MB 12,274 5.7% 0 0% 32,491 15.1% 15.1% 911.0
241 days_credit_enddate_min float64 1.7 MB 6,266 2.9% 0 0% 32,492 15.1% 15.1% -1267.0
242 days_credit_max float64 1.7 MB 2,922 1.4% 0 0% 31,067 14.4% 14.4% -300.0
243 days_credit_median float64 1.7 MB 5,711 2.7% 0 0% 30,932 14.4% 14.4% -957.0
244 days_credit_overdue_max float64 1.7 MB 671 0.3% 0 0% 212,892 98.9% 98.9% 0.0
245 days_credit_overdue_mean float64 1.7 MB 1,195 0.6% 0 0% 212,892 98.9% 98.9% 0.0
246 days_credit_range float64 1.7 MB 2,913 1.4% 0 0% 30,890 14.4% 14.4% 1262.0
247 days_credit_std float64 1.7 MB 133,053 61.8% 0 0% 55,965 26.0% 26.0% 621.2873840332031
248 days_credit_update_max float64 1.7 MB 2,585 1.2% 0 0% 34,359 16.0% 16.0% -19.0
249 days_credit_update_median float64 1.7 MB 4,779 2.2% 0 0% 30,948 14.4% 14.4% -360.0
250 days_credit_update_range float64 1.7 MB 2,925 1.4% 0 0% 30,911 14.4% 14.4% 904.0
251 days_decision_max float64 1.7 MB 2,921 1.4% 0 0% 11,697 5.4% 5.4% -299.0
252 days_decision_median float64 1.7 MB 5,656 2.6% 0 0% 11,546 5.4% 5.4% -647.0
253 days_decision_range float64 1.7 MB 2,919 1.4% 0 0% 40,565 18.8% 18.8% 0.0
254 days_enddate_fact_max float64 1.7 MB 2,793 1.3% 0 0% 54,020 25.1% 25.1% -345.0
255 days_enddate_fact_median float64 1.7 MB 5,341 2.5% 0 0% 53,910 25.0% 25.0% -872.5
256 days_enddate_fact_range float64 1.7 MB 2,796 1.3% 0 0% 53,924 25.1% 25.1% 821.0
257 days_first_draw_min float64 1.7 MB 2,718 1.3% 0 0% 177,781 82.6% 82.6% 365243.0
258 days_last_due_1st_version_max float64 1.7 MB 4,521 2.1% 0 0% 55,263 25.7% 25.7% 365243.0
259 days_last_due_1st_version_mean float64 1.7 MB 51,499 23.9% 0 0% 12,398 5.8% 5.8% -207.5
260 days_last_due_1st_version_median float64 1.7 MB 10,719 5.0% 0 0% 12,497 5.8% 5.8% -325.0
261 days_last_due_1st_version_min float64 1.7 MB 4,081 1.9% 0 0% 12,430 5.8% 5.8% -1089.0
262 days_last_due_max float64 1.7 MB 2,761 1.3% 0 0% 98,527 45.8% 45.8% 365243.0
263 days_termination_median float64 1.7 MB 7,716 3.6% 0 0% 23,269 10.8% 10.8% 365243.0
264 days_termination_min float64 1.7 MB 2,797 1.3% 0 0% 15,833 7.4% 7.4% 365243.0
265 diff_amt_installment_payment_max float64 1.7 MB 75,445 35.0% 0 0% 127,555 59.3% 59.3% 0.0
266 diff_amt_installment_payment_mean float64 1.7 MB 97,257 45.2% 0 0% 114,097 53.0% 53.0% 0.0
267 diff_amt_installment_payment_median float64 1.7 MB 6,855 3.2% 0 0% 206,997 96.2% 96.2% 0.0
268 diff_amt_installment_payment_range float64 1.7 MB 90,195 41.9% 0 0% 114,099 53.0% 53.0% 0.0
269 diff_days_installment_payment_max float64 1.7 MB 409 0.2% 0 0% 18,396 8.5% 8.5% 31.0
270 diff_days_installment_payment_mean float64 1.7 MB 50,247 23.3% 0 0% 11,037 5.1% 5.1% 9.524199962615967
271 diff_days_installment_payment_median float64 1.7 MB 320 0.1% 0 0% 21,620 10.0% 10.0% 0.0
272 diff_days_installment_payment_range float64 1.7 MB 1,465 0.7% 0 0% 14,802 6.9% 6.9% 37.0
273 diff_days_installment_payment_sum float64 1.7 MB 4,383 2.0% 0 0% 11,369 5.3% 5.3% 240.0
274 diff_days_installment_payment_sum_late_only float64 1.7 MB 1,815 0.8% 0 0% 95,670 44.4% 44.4% 0.0
275 diff_percent_installment_payment_mean float64 1.7 MB 87,934 40.9% 0 0% 114,228 53.1% 53.1% 1.0
276 diff_percent_installment_payment_median float64 1.7 MB 7,969 3.7% 0 0% 206,997 96.2% 96.2% 1.0
277 diff_percent_installment_payment_min float64 1.7 MB 25,589 11.9% 0 0% 189,010 87.8% 87.8% 1.0
278 diff_percent_installment_payment_range float64 1.7 MB 97,055 45.1% 0 0% 114,227 53.1% 53.1% 0.0
279 missingindicator_DEF_30_CNT_SOCIAL_CIRCLE float64 1.7 MB 2 <0.1% 0 0% 214,543 99.7% 99.7% 0.0
280 missingindicator_EXT_SOURCE_1 float64 1.7 MB 2 <0.1% 0 0% 121,373 56.4% 56.4% 1.0
281 missingindicator_EXT_SOURCE_2 float64 1.7 MB 2 <0.1% 0 0% 214,793 99.8% 99.8% 0.0
282 missingindicator_EXT_SOURCE_3 float64 1.7 MB 2 <0.1% 0 0% 172,577 80.2% 80.2% 0.0
283 missingindicator_YEARS_BUILD_AVG float64 1.7 MB 2 <0.1% 0 0% 143,036 66.4% 66.4% 1.0
284 missingindicator_amt_credit_max_overdue_max float64 1.7 MB 2 <0.1% 0 0% 128,619 59.8% 59.8% 0.0
285 missingindicator_amt_credit_sum_debt_mean float64 1.7 MB 2 <0.1% 0 0% 179,218 83.3% 83.3% 0.0
286 missingindicator_amt_credit_sum_limit_min float64 1.7 MB 2 <0.1% 0 0% 169,672 78.8% 78.8% 0.0
287 missingindicator_amt_credit_sum_limit_std float64 1.7 MB 2 <0.1% 0 0% 134,361 62.4% 62.4% 0.0
288 missingindicator_amt_down_payment_max float64 1.7 MB 2 <0.1% 0 0% 191,554 89.0% 89.0% 0.0
289 missingindicator_bureau_months_balance_max float64 1.7 MB 2 <0.1% 0 0% 152,586 70.9% 70.9% 1.0
290 missingindicator_cnt_installment_range float64 1.7 MB 2 <0.1% 0 0% 202,669 94.2% 94.2% 0.0
291 missingindicator_days_credit_enddate_std float64 1.7 MB 2 <0.1% 0 0% 156,060 72.5% 72.5% 0.0
292 missingindicator_days_enddate_fact_range float64 1.7 MB 2 <0.1% 0 0% 161,387 75.0% 75.0% 0.0
293 mode_credit_type_Car_loan float64 1.7 MB 2 <0.1% 0 0% 212,121 98.5% 98.5% 0.0
294 mode_credit_type_Consumer_credit float64 1.7 MB 2 <0.1% 0 0% 160,802 74.7% 74.7% 1.0
295 mode_credit_type_Credit_card float64 1.7 MB 2 <0.1% 0 0% 196,123 91.1% 91.1% 0.0
296 mode_credit_type_Microloan float64 1.7 MB 2 <0.1% 0 0% 214,789 99.8% 99.8% 0.0
297 mode_credit_type_Mortgage float64 1.7 MB 2 <0.1% 0 0% 214,478 99.6% 99.6% 0.0
298 mode_credit_type_Other float64 1.7 MB 2 <0.1% 0 0% 215,155 >99.9% >99.9% 0.0
299 n_car_loans float64 1.7 MB 9 <0.1% 0 0% 201,519 93.6% 93.6% 0.0
300 n_channel_type_ap_minus float64 1.7 MB 33 <0.1% 0 0% 199,207 92.5% 92.5% 0.0
301 n_channel_type_channel_corporate_sales float64 1.7 MB 20 <0.1% 0 0% 213,745 99.3% 99.3% 0.0
302 n_channel_type_contact_center float64 1.7 MB 19 <0.1% 0 0% 187,077 86.9% 86.9% 0.0
303 n_channel_type_countrywide float64 1.7 MB 34 <0.1% 0 0% 78,922 36.7% 36.7% 1.0
304 n_channel_type_regional_and_local float64 1.7 MB 19 <0.1% 0 0% 169,784 78.9% 78.9% 0.0
305 n_channel_type_stone float64 1.7 MB 22 <0.1% 0 0% 133,139 61.9% 61.9% 0.0
306 n_client_type_new float64 1.7 MB 14 <0.1% 0 0% 165,520 76.9% 76.9% 1.0
307 n_client_type_refreshed float64 1.7 MB 23 <0.1% 0 0% 161,564 75.1% 75.1% 0.0
308 n_client_type_repeater float64 1.7 MB 61 <0.1% 0 0% 49,122 22.8% 22.8% 0.0
309 n_consumer_loans float64 1.7 MB 36 <0.1% 0 0% 78,331 36.4% 36.4% 1.0
310 n_contract_status_refused float64 1.7 MB 44 <0.1% 0 0% 144,850 67.3% 67.3% 0.0
311 n_contract_status_unused_offer float64 1.7 MB 11 <0.1% 0 0% 202,009 93.8% 93.8% 0.0
312 n_contracts_credit_card_completed float64 1.7 MB 40 <0.1% 0 0% 207,783 96.5% 96.5% 0.0
313 n_credit_card_credits float64 1.7 MB 22 <0.1% 0 0% 91,194 42.4% 42.4% 1.0
314 n_credits_active float64 1.7 MB 22 <0.1% 0 0% 71,863 33.4% 33.4% 2.0
315 n_credits_sold float64 1.7 MB 7 <0.1% 0 0% 211,547 98.3% 98.3% 0.0
316 n_credits_total float64 1.7 MB 57 <0.1% 0 0% 51,153 23.8% 23.8% 4.0
317 n_currency_2 float64 1.7 MB 7 <0.1% 0 0% 214,671 99.7% 99.7% 0.0
318 n_different_channels float64 1.7 MB 7 <0.1% 0 0% 90,541 42.1% 42.1% 2.0
319 n_different_contract_types float64 1.7 MB 4 <0.1% 0 0% 89,430 41.5% 41.5% 2.0
320 n_different_credit_types float64 1.7 MB 5 <0.1% 0 0% 131,569 61.1% 61.1% 2.0
321 n_different_currencies float64 1.7 MB 3 <0.1% 0 0% 214,601 99.7% 99.7% 1.0
322 n_installments_late float64 1.7 MB 99 <0.1% 0 0% 95,670 44.4% 44.4% 0.0
323 n_installments_late_30 float64 1.7 MB 42 <0.1% 0 0% 201,997 93.8% 93.8% 0.0
324 n_installments_late_7 float64 1.7 MB 59 <0.1% 0 0% 158,592 73.7% 73.7% 0.0
325 n_installments_total float64 1.7 MB 310 0.1% 0 0% 14,007 6.5% 6.5% 25.0
326 n_microloans float64 1.7 MB 28 <0.1% 0 0% 212,811 98.9% 98.9% 0.0
327 n_mortgages float64 1.7 MB 7 <0.1% 0 0% 205,270 95.4% 95.4% 0.0
328 n_nflag_insured_on_approval_mean float64 1.7 MB 102 <0.1% 0 0% 95,675 44.4% 44.4% 0.0
329 n_nflag_insured_on_approval_sum float64 1.7 MB 19 <0.1% 0 0% 96,596 44.9% 44.9% 0.0
330 n_other_type_credit float64 1.7 MB 9 <0.1% 0 0% 213,209 99.0% 99.0% 0.0
331 n_payment_type_cash_through_bank float64 1.7 MB 44 <0.1% 0 0% 54,943 25.5% 25.5% 1.0
332 n_payment_type_not_available float64 1.7 MB 46 <0.1% 0 0% 71,796 33.4% 33.4% 0.0
333 n_previous_credit_card_applications float64 1.7 MB 126 0.1% 0 0% 155,013 72.0% 72.0% 21.0
334 n_previous_credit_card_applications_signed float64 1.7 MB 37 <0.1% 0 0% 212,249 98.6% 98.6% 0.0
335 n_previous_pos_applications float64 1.7 MB 221 0.1% 0 0% 16,495 7.7% 7.7% 22.0
336 n_previous_pos_applications_completed float64 1.7 MB 45 <0.1% 0 0% 73,226 34.0% 34.0% 1.0
337 n_previous_pos_applications_signed float64 1.7 MB 31 <0.1% 0 0% 174,587 81.1% 81.1% 0.0
338 n_product_type_walk_in float64 1.7 MB 28 <0.1% 0 0% 164,239 76.3% 76.3% 0.0
339 n_reject_reason_limit float64 1.7 MB 22 <0.1% 0 0% 195,275 90.7% 90.7% 0.0
340 n_reject_reason_scoc float64 1.7 MB 20 <0.1% 0 0% 200,014 92.9% 92.9% 0.0
341 n_reject_reason_scofr float64 1.7 MB 16 <0.1% 0 0% 210,511 97.8% 97.8% 0.0
342 n_revolving_loans float64 1.7 MB 25 <0.1% 0 0% 142,248 66.1% 66.1% 0.0
343 n_yield_group_high float64 1.7 MB 30 <0.1% 0 0% 89,153 41.4% 41.4% 0.0
344 n_yield_group_low_action float64 1.7 MB 22 <0.1% 0 0% 174,871 81.2% 81.2% 0.0
345 n_yield_group_low_normal float64 1.7 MB 23 <0.1% 0 0% 94,724 44.0% 44.0% 0.0
346 n_yield_group_middle float64 1.7 MB 25 <0.1% 0 0% 80,132 37.2% 37.2% 1.0
347 ord_education_type float64 1.7 MB 5 <0.1% 0 0% 152,993 71.1% 71.1% 1.0
348 percent_installments_early float64 1.7 MB 7,892 3.7% 0 0% 64,688 30.1% 30.1% 1.0
349 percent_installments_late float64 1.7 MB 4,464 2.1% 0 0% 95,670 44.4% 44.4% 0.0
350 percent_installments_late_30 float64 1.7 MB 894 0.4% 0 0% 201,997 93.8% 93.8% 0.0
351 percent_installments_late_60 float64 1.7 MB 629 0.3% 0 0% 209,180 97.2% 97.2% 0.0
352 percent_installments_late_7 float64 1.7 MB 2,595 1.2% 0 0% 158,592 73.7% 73.7% 0.0
353 rate_down_payment_max float64 1.7 MB 84,884 39.4% 0 0% 53,725 25.0% 25.0% 0.0
354 rate_down_payment_range float64 1.7 MB 73,616 34.2% 0 0% 94,887 44.1% 44.1% 0.0
355 rate_interest_privileged_count float64 1.7 MB 4 <0.1% 0 0% 212,016 98.5% 98.5% 0.0
356 sk_dpd_credit_card_max float64 1.7 MB 353 0.2% 0 0% 202,632 94.1% 94.1% 0.0
357 sk_dpd_credit_card_median float64 1.7 MB 222 0.1% 0 0% 214,704 99.7% 99.7% 0.0
358 sk_dpd_def_credit_card_max float64 1.7 MB 47 <0.1% 0 0% 204,810 95.1% 95.1% 0.0
359 sk_dpd_def_pos_applications_max float64 1.7 MB 173 0.1% 0 0% 187,187 87.0% 87.0% 0.0
360 sk_dpd_pos_applications_max float64 1.7 MB 1,595 0.7% 0 0% 176,902 82.2% 82.2% 0.0
361 years_employed float64 1.7 MB 11,769 5.5% 0 0% 38,801 18.0% 18.0% 4.517808219178082
Code
# Save to file
file_path = dir_interim + "colnames--cols_to_keep_after_preprocessing.csv"
credits_train_transformed_not_correlated_col_info.column.to_csv(file_path, index=False)

# Load from file (to check)
cols_to_keep_after_preprocessing = pd.read_csv(file_path).column.tolist()
del file_path
Code
# Clean up a bit
del (
    credits_train,
    credits_train_transformed,
    credits_train_transformed_not_correlated_cols,
)

6.2 Train, Validation, and Test Sets

In this section, the training, validation, and test sets will be created by merging datasets and applying the pre-processing steps created in the previous sections. The results will be cached to avoid repeating the same steps in the future.

Code
file = dir_interim + "merged-selected--credit_train.feather"

if os.path.exists(file):
    credit_train = pd.read_feather(file)
else:
    credit_train = (
        merge_credit_history(to=application_train)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
        .loc[:, cols_to_include_in_preprocessing + ["TARGET"]]
    )
    credit_train.to_feather(file)

X_credit_train = credit_train.drop(columns=["TARGET"])
y_credit_train = credit_train["TARGET"]

del file
Code
X_credit_train.shape
(215257, 251)
Code
file = dir_interim + "merged-selected--credit_validation.feather"

if os.path.exists(file):
    credit_validation = pd.read_feather(file)
else:
    credit_validation = (
        merge_credit_history(to=application_validation)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
        .loc[:, cols_to_include_in_preprocessing + ["TARGET"]]
    )
    credit_validation.to_feather(file)

X_credit_validation = credit_validation.drop(columns=["TARGET"])
y_credit_validation = credit_validation["TARGET"]

del file
Code
X_credit_validation.shape
(46127, 251)
Code
file = dir_interim + "merged-selected--credit_test.feather"

if os.path.exists(file):
    credit_test = pd.read_feather(file)
else:
    credit_test = (
        merge_credit_history(to=application_test)
        .pipe(klib.convert_datatypes)
        .pipe(preprocess_credit_data)
        .loc[:, cols_to_include_in_preprocessing + ["TARGET"]]
    )
    credit_test.to_feather(file)

X_credit_test = credit_test.drop(columns=["TARGET"])
y_credit_test = credit_test["TARGET"]

del file
Code
X_credit_test.shape
(46127, 251)

7 Modeling (w/ Historical Data)

In this section, models based on application data and historical data will be trained and evaluated.

The steps are similar to those in the section Modeling (w/o Historical Data), so most of the steps will not be commented.

7.1 Train Full Model

Let’s start with the model that employs all 361 features that are left after feature filtering step.

Code
if "models" not in locals():
    models = {}


@my.cache_results(dir_interim + "task-2-w-credit-history--01_lgbm.pickle")
def fit_lgbm_extended():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
            ("preprocessor_2", clone(pre_processing)),
            ("selector_2", ColumnSelector(cols_to_keep_after_preprocessing)),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_credit_train, y_credit_train)
    return pipeline


# Time: (1-2 minutes)
models["LGBM (FULL | 361 feat.)"] = fit_lgbm_extended()
models["LGBM (FULL | 361 feat.)"]
Pipeline(steps=[('selector_1',
                 ColumnSelector(keep=['AMT_ANNUITY', 'AMT_CREDIT',
                                      'AMT_INCOME_TOTAL',
                                      'AMT_REQ_CREDIT_BUREAU_DAY',
                                      'AMT_REQ_CREDIT_BUREAU_HOUR',
                                      'AMT_REQ_CREDIT_BUREAU_MON',
                                      'AMT_REQ_CREDIT_BUREAU_QRT',
                                      'AMT_REQ_CREDIT_BUREAU_WEEK',
                                      'AMT_REQ_CREDIT_BUREAU_YEAR',
                                      'BASEMENTAREA_MODE', 'CNT_FAM_MEMBERS',
                                      'COMMONAREA_MEDI', 'DAYS_ID_PUBLISH',
                                      'DAYS_LA...
                                      'DEF_30_CNT_SOCIAL_CIRCLE',
                                      'ELEVATORS_AVG', 'ENTRANCES_MODE',
                                      'EXT_SOURCE_1', 'EXT_SOURCE_2',
                                      'EXT_SOURCE_3', 'FLAG_CONT_MOBILE',
                                      'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_13',
                                      'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16',
                                      'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_3',
                                      'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', ...])),
                ('classifier',
                 LGBMClassifier(class_weight='balanced', device='gpu',
                                n_jobs=-1, random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.2 Evaluate Models

The validation perdurance of the model that uses historical credit data is slightly better (ROC AUC = 0.778) compared to the best model that does not use historical data (ROC AUC = 0.759). However, the difference is very small (only 0.019).

Code
print("--- Train ---")

ml.classification_scores(
    models,
    X_credit_train,
    y_credit_train,
    color="orange",
    sort_by="ROC_AUC",
)
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (FULL | 361 feat.) 215257 0.919 0.737 0.748 0.495 0.318 0.837 0.760 0.735 0.201 0.972 0.827
Code
print("--- Validation ---")

ml.classification_scores(
    models,
    X_credit_validation,
    y_credit_validation,
    sort_by="ROC_AUC",
)
--- Validation ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (FULL | 361 feat.) 46127 0.919 0.724 0.704 0.409 0.285 0.829 0.680 0.728 0.180 0.963 0.778
Code
sns.set_style("white")
y_pred_validation_lgbm = models["LGBM (FULL | 361 feat.)"].predict(X_credit_validation)
ml.plot_confusion_matrices(y_credit_validation, y_pred_validation_lgbm, figsize=(13, 4));

7.3 Feature Importance

Feature importance analysis revealed that the 6 most important features are from or are based on the application table only. Nad only the 7th most important feature is based on historical data.

Note. Feature names in CAPITALS indicate the original features from the application table and feature names in lowercase indicate that these are derived or extracted features either from the original application table or from the credit history data tables.

Find the details below.

Code
@my.cache_results(dir_interim + "task-2--shap_lgbm_k=all.pickle")
def get_shap_values_lgbm_extended():
    model = "LGBM (FULL | 361 feat.)"
    preproc = Pipeline(steps=models[model].steps[:-1])
    classifier = models[model]["classifier"]
    X_validation_preproc = preproc.transform(X_credit_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm_ext, data_for_lgbm_ext = get_shap_values_lgbm_extended()
Code
vals = np.abs(shap_values_lgbm_ext).mean(0).mean(0)
feature_importance_ext = (
    pd.DataFrame(
        list(zip(data_for_lgbm_ext.columns, vals)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)
Code
sns.set_style("whitegrid")
lgb.plot_importance(
    models["LGBM (FULL | 361 feat.)"]["classifier"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance",
);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_ext[1],
    data_for_lgbm_ext,
    plot_type="bar",
    max_display=110,
    plot_size=(10, 15),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_ext[1],
    data_for_lgbm_ext,
    max_display=50,
    plot_size=(10, 9),
    show=False,
)
plt.title("SHAP Feature Importance", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

feature_importance_ext.style.format(precision=6)
Table 7.1. SHAP values of the features for the LGBM model.
  index col_name importance
0 19 EXT_SOURCE_2 0.334729
1 20 EXT_SOURCE_3 0.283458
2 18 EXT_SOURCE_1 0.152720
3 174 amt_annuity_to_credit_ratio 0.109900
4 360 years_employed 0.104907
5 346 ord_education_type 0.078970
6 348 percent_installments_late 0.078802
7 0 AMT_ANNUITY 0.076940
8 188 amt_credit_sum_debt_mean 0.058354
9 309 n_contract_status_refused 0.056706
10 197 amt_down_payment_max 0.050801
11 236 cnt_payment_median 0.050325
12 334 n_previous_pos_applications 0.049682
13 171 amt_annuity_median_previous_application 0.049286
14 233 cnt_installment_range 0.045852
15 265 diff_amt_installment_payment_mean 0.044301
16 347 percent_installments_early 0.043852
17 153 OWN_CAR_AGE 0.041045
18 239 days_credit_enddate_max 0.040445
19 342 n_yield_group_high 0.040265
20 259 days_last_due_1st_version_median 0.039270
21 238 cnt_payment_range 0.039066
22 253 days_enddate_fact_max 0.039035
23 183 amt_credit_max_overdue_max 0.036707
24 343 n_yield_group_low_action 0.036122
25 227 cnt_fam_members_excluding_children 0.035529
26 257 days_last_due_1st_version_max 0.034028
27 49 NAME_CONTRACT_TYPE_Cash_loans 0.030754
28 241 days_credit_max 0.030463
29 65 NAME_INCOME_TYPE_Working 0.028907
30 15 DEF_30_CNT_SOCIAL_CIRCLE 0.028700
31 176 amt_annuity_to_income_ratio 0.028551
32 1 AMT_CREDIT 0.028165
33 27 FLAG_DOCUMENT_3 0.027166
34 286 missingindicator_amt_credit_sum_limit_std 0.026530
35 201 amt_drawings_current_mean 0.026340
36 242 days_credit_median 0.024028
37 85 OCCUPATION_TYPE_Laborers 0.023403
38 177 amt_balance_credit_card_max 0.023300
39 279 missingindicator_EXT_SOURCE_1 0.023174
40 12 DAYS_ID_PUBLISH 0.023152
41 155 REGION_RATING_CLIENT 0.023097
42 13 DAYS_LAST_PHONE_CHANGE 0.022974
43 261 days_last_due_max 0.022880
44 324 n_installments_total 0.022650
45 173 amt_annuity_min_previous_application 0.021348
46 14 DAYS_REGISTRATION 0.021207
47 344 n_yield_group_low_normal 0.019307
48 351 percent_installments_late_7 0.018375
49 6 AMT_REQ_CREDIT_BUREAU_QRT 0.018232
50 272 diff_days_installment_payment_sum 0.017944
51 252 days_decision_range 0.017601
52 235 cnt_installments_diff_range 0.017023
53 192 amt_credit_sum_median 0.016875
54 81 OCCUPATION_TYPE_Drivers 0.016874
55 156 REG_CITY_NOT_LIVE_CITY 0.015121
56 263 days_termination_min 0.015035
57 352 rate_down_payment_max 0.014277
58 274 diff_percent_installment_payment_mean 0.014275
59 101 ORGANIZATION_TYPE_Business_Entity_Type_3 0.013745
60 326 n_mortgages 0.013138
61 358 sk_dpd_def_pos_applications_max 0.012471
62 196 amt_credit_to_income_ratio 0.012435
63 35 FLAG_OWN_CAR 0.012376
64 313 n_credits_active 0.012074
65 138 ORGANIZATION_TYPE_Self_employed 0.011357
66 167 YEARS_BEGINEXPLUATATION_MODE 0.009927
67 195 amt_credit_sum_sum 0.009862
68 260 days_last_due_1st_version_min 0.009566
69 33 FLAG_EMP_PHONE 0.009426
70 80 OCCUPATION_TYPE_Core_staff 0.009391
71 255 days_enddate_fact_range 0.008932
72 194 amt_credit_sum_std 0.008921
73 248 days_credit_update_median 0.008893
74 178 amt_balance_credit_card_median 0.008372
75 164 WALLSMATERIAL_MODE_Panel 0.008108
76 283 missingindicator_amt_credit_max_overdue_max 0.008048
77 154 REGION_POPULATION_RELATIVE 0.007998
78 182 amt_credit_max 0.007632
79 175 amt_annuity_to_income_per_family_member 0.007558
80 337 n_product_type_walk_in 0.007203
81 273 diff_days_installment_payment_sum_late_only 0.007141
82 325 n_microloans 0.007115
83 221 cnt_drawings_atm_current_max 0.007105
84 207 amt_goods_price_min 0.006913
85 103 ORGANIZATION_TYPE_Construction 0.006912
86 191 amt_credit_sum_limit_sum 0.006807
87 38 FLAG_WORK_PHONE 0.006806
88 335 n_previous_pos_applications_completed 0.006764
89 288 missingindicator_bureau_months_balance_max 0.006737
90 232 cnt_installment_min 0.006655
91 251 days_decision_median 0.006493
92 37 FLAG_PHONE 0.006251
93 209 amt_payment_current_median 0.006012
94 39 FLOORSMAX_MEDI 0.006009
95 290 missingindicator_days_credit_enddate_std 0.005797
96 77 OCCUPATION_TYPE_Accountants 0.005781
97 199 amt_drawings_atm_current_median 0.005756
98 269 diff_days_installment_payment_mean 0.005724
99 353 rate_down_payment_range 0.005708
100 180 amt_credit_limit_actual_median 0.005691
101 179 amt_balance_credit_card_min 0.005593
102 189 amt_credit_sum_debt_sum 0.005370
103 338 n_reject_reason_limit 0.005277
104 249 days_credit_update_range 0.005114
105 276 diff_percent_installment_payment_min 0.004991
106 271 diff_days_installment_payment_range 0.004699
107 267 diff_amt_installment_payment_range 0.004486
108 245 days_credit_range 0.004452
109 169 amt_annuity_max 0.004424
110 250 days_decision_max 0.004413
111 258 days_last_due_1st_version_mean 0.004406
112 281 missingindicator_EXT_SOURCE_3 0.004364
113 247 days_credit_update_max 0.004330
114 16 ELEVATORS_AVG 0.004316
115 211 amt_payment_current_range 0.004173
116 186 amt_credit_min 0.004173
117 2 AMT_INCOME_TOTAL 0.004114
118 300 n_channel_type_channel_corporate_sales 0.004060
119 315 n_credits_total 0.004046
120 135 ORGANIZATION_TYPE_School 0.003835
121 305 n_client_type_new 0.003779
122 327 n_nflag_insured_on_approval_mean 0.003658
123 291 missingindicator_days_enddate_fact_range 0.003585
124 268 diff_days_installment_payment_max 0.003481
125 187 amt_credit_range 0.003452
126 240 days_credit_enddate_min 0.003372
127 184 amt_credit_max_overdue_range 0.003139
128 185 amt_credit_median 0.002895
129 262 days_termination_median 0.002754
130 62 NAME_INCOME_TYPE_State_servant 0.002689
131 229 cnt_installment_mature_cum_max 0.002672
132 264 diff_amt_installment_payment_max 0.002671
133 17 ENTRANCES_MODE 0.002638
134 86 OCCUPATION_TYPE_Low_skill_Laborers 0.002633
135 224 cnt_drawings_pos_current_max 0.002612
136 321 n_installments_late 0.002576
137 168 YEARS_BUILD_AVG 0.002534
138 9 BASEMENTAREA_MODE 0.002169
139 48 LANDAREA_MEDI 0.002018
140 11 COMMONAREA_MEDI 0.001994
141 299 n_channel_type_ap_minus 0.001951
142 60 NAME_INCOME_TYPE_Commercial_associate 0.001926
143 256 days_first_draw_min 0.001872
144 76 OBS_30_CNT_SOCIAL_CIRCLE 0.001866
145 204 amt_drawings_pos_current_max 0.001863
146 172 amt_annuity_min 0.001810
147 127 ORGANIZATION_TYPE_Military 0.001797
148 340 n_reject_reason_scofr 0.001793
149 301 n_channel_type_contact_center 0.001719
150 98 ORGANIZATION_TYPE_Bank 0.001705
151 270 diff_days_installment_payment_median 0.001700
152 93 OCCUPATION_TYPE_Security_staff 0.001660
153 124 ORGANIZATION_TYPE_Kindergarten 0.001577
154 8 AMT_REQ_CREDIT_BUREAU_YEAR 0.001543
155 10 CNT_FAM_MEMBERS 0.001543
156 284 missingindicator_amt_credit_sum_debt_mean 0.001454
157 304 n_channel_type_stone 0.001438
158 285 missingindicator_amt_credit_sum_limit_min 0.001342
159 193 amt_credit_sum_overdue_sum 0.001334
160 266 diff_amt_installment_payment_median 0.001291
161 75 NONLIVINGAREA_MODE 0.001231
162 298 n_car_loans 0.001193
163 277 diff_percent_installment_payment_range 0.001146
164 228 cnt_installment_future_min 0.001089
165 150 ORGANIZATION_TYPE_Transport_type_3 0.001076
166 122 ORGANIZATION_TYPE_Industry_type_9 0.001053
167 142 ORGANIZATION_TYPE_Trade_type_2 0.000975
168 246 days_credit_std 0.000886
169 345 n_yield_group_middle 0.000860
170 306 n_client_type_refreshed 0.000839
171 331 n_payment_type_not_available 0.000825
172 198 amt_drawings_atm_current_max 0.000804
173 355 sk_dpd_credit_card_max 0.000761
174 231 cnt_installment_median 0.000751
175 40 FLOORSMIN_MEDI 0.000716
176 312 n_credit_card_credits 0.000665
177 359 sk_dpd_pos_applications_max 0.000651
178 311 n_contracts_credit_card_completed 0.000636
179 210 amt_payment_current_min 0.000606
180 170 amt_annuity_median 0.000594
181 275 diff_percent_installment_payment_median 0.000547
182 293 mode_credit_type_Consumer_credit 0.000520
183 130 ORGANIZATION_TYPE_Police 0.000480
184 23 FLAG_DOCUMENT_13 0.000465
185 36 FLAG_OWN_REALTY 0.000456
186 165 WALLSMATERIAL_MODE_Stone_brick 0.000455
187 74 NONLIVINGAPARTMENTS_AVG 0.000438
188 26 FLAG_DOCUMENT_18 0.000433
189 5 AMT_REQ_CREDIT_BUREAU_MON 0.000426
190 303 n_channel_type_regional_and_local 0.000414
191 296 mode_credit_type_Mortgage 0.000363
192 57 NAME_HOUSING_TYPE_Rented_apartment 0.000362
193 219 cnt_credit_prolong_mean 0.000323
194 330 n_payment_type_cash_through_bank 0.000318
195 254 days_enddate_fact_median 0.000284
196 237 cnt_payment_min 0.000279
197 56 NAME_HOUSING_TYPE_Office_apartment 0.000269
198 339 n_reject_reason_scoc 0.000251
199 206 amt_drawings_pos_current_min 0.000246
200 350 percent_installments_late_60 0.000242
201 132 ORGANIZATION_TYPE_Realtor 0.000232
202 349 percent_installments_late_30 0.000226
203 310 n_contract_status_unused_offer 0.000216
204 181 amt_credit_limit_actual_range 0.000209
205 217 bureau_dpd_status_median 0.000203
206 30 FLAG_DOCUMENT_8 0.000196
207 212 amt_payment_total_current_min 0.000173
208 50 NAME_EDUCATION_TYPE_Academic_degree 0.000171
209 157 REG_CITY_NOT_WORK_CITY 0.000165
210 243 days_credit_overdue_max 0.000164
211 95 OCCUPATION_TYPE_nan 0.000152
212 203 amt_drawings_other_current_max 0.000141
213 341 n_revolving_loans 0.000127
214 308 n_consumer_loans 0.000127
215 125 ORGANIZATION_TYPE_Legal_Services 0.000101
216 218 bureau_months_balance_max 0.000094
217 190 amt_credit_sum_limit_min 0.000090
218 323 n_installments_late_7 0.000089
219 147 ORGANIZATION_TYPE_Trade_type_7 0.000087
220 91 OCCUPATION_TYPE_Sales_staff 0.000079
221 356 sk_dpd_credit_card_median 0.000079
222 34 FLAG_IS_EMERGENCY 0.000077
223 67 NAME_TYPE_SUITE_Family 0.000075
224 230 cnt_installment_mature_cum_min 0.000069
225 7 AMT_REQ_CREDIT_BUREAU_WEEK 0.000064
226 161 WALLSMATERIAL_MODE_Mixed 0.000057
227 79 OCCUPATION_TYPE_Cooking_staff 0.000056
228 149 ORGANIZATION_TYPE_Transport_type_2 0.000033
229 333 n_previous_credit_card_applications_signed 0.000027
230 21 FLAG_CONT_MOBILE 0.000000
231 31 FLAG_DOCUMENT_9 0.000000
232 320 n_different_currencies 0.000000
233 282 missingindicator_YEARS_BUILD_AVG 0.000000
234 47 HOUSETYPE_MODE_terraced_house 0.000000
235 332 n_previous_credit_card_applications 0.000000
236 319 n_different_credit_types 0.000000
237 280 missingindicator_EXT_SOURCE_2 0.000000
238 322 n_installments_late_30 0.000000
239 51 NAME_EDUCATION_TYPE_Incomplete_higher 0.000000
240 287 missingindicator_amt_down_payment_max 0.000000
241 354 rate_interest_privileged_count 0.000000
242 278 missingindicator_DEF_30_CNT_SOCIAL_CIRCLE 0.000000
243 357 sk_dpd_def_credit_card_max 0.000000
244 329 n_other_type_credit 0.000000
245 32 FLAG_EMAIL 0.000000
246 52 NAME_EDUCATION_TYPE_Lower_secondary 0.000000
247 42 FONDKAPREMONT_MODE_org_spec_account 0.000000
248 317 n_different_channels 0.000000
249 318 n_different_contract_types 0.000000
250 3 AMT_REQ_CREDIT_BUREAU_DAY 0.000000
251 307 n_client_type_repeater 0.000000
252 41 FONDKAPREMONT_MODE_not_specified 0.000000
253 302 n_channel_type_countrywide 0.000000
254 336 n_previous_pos_applications_signed 0.000000
255 43 FONDKAPREMONT_MODE_reg_oper_account 0.000000
256 53 NAME_HOUSING_TYPE_Co_op_apartment 0.000000
257 24 FLAG_DOCUMENT_14 0.000000
258 297 mode_credit_type_Other 0.000000
259 28 FLAG_DOCUMENT_5 0.000000
260 29 FLAG_DOCUMENT_6 0.000000
261 314 n_credits_sold 0.000000
262 295 mode_credit_type_Microloan 0.000000
263 294 mode_credit_type_Credit_card 0.000000
264 316 n_currency_2 0.000000
265 22 FLAG_DOCUMENT_11 0.000000
266 292 mode_credit_type_Car_loan 0.000000
267 44 FONDKAPREMONT_MODE_reg_oper_spec_account 0.000000
268 45 HOUSETYPE_MODE_nan 0.000000
269 289 missingindicator_cnt_installment_range 0.000000
270 25 FLAG_DOCUMENT_16 0.000000
271 46 HOUSETYPE_MODE_specific_housing 0.000000
272 328 n_nflag_insured_on_approval_sum 0.000000
273 128 ORGANIZATION_TYPE_Mobile 0.000000
274 54 NAME_HOUSING_TYPE_House_apartment 0.000000
275 115 ORGANIZATION_TYPE_Industry_type_2 0.000000
276 121 ORGANIZATION_TYPE_Industry_type_8 0.000000
277 166 WALLSMATERIAL_MODE_Wooden 0.000000
278 120 ORGANIZATION_TYPE_Industry_type_7 0.000000
279 119 ORGANIZATION_TYPE_Industry_type_6 0.000000
280 118 ORGANIZATION_TYPE_Industry_type_5 0.000000
281 117 ORGANIZATION_TYPE_Industry_type_4 0.000000
282 4 AMT_REQ_CREDIT_BUREAU_HOUR 0.000000
283 116 ORGANIZATION_TYPE_Industry_type_3 0.000000
284 114 ORGANIZATION_TYPE_Industry_type_13 0.000000
285 162 WALLSMATERIAL_MODE_Monolithic 0.000000
286 113 ORGANIZATION_TYPE_Industry_type_12 0.000000
287 112 ORGANIZATION_TYPE_Industry_type_11 0.000000
288 111 ORGANIZATION_TYPE_Industry_type_10 0.000000
289 110 ORGANIZATION_TYPE_Industry_type_1 0.000000
290 109 ORGANIZATION_TYPE_Housing 0.000000
291 108 ORGANIZATION_TYPE_Hotel 0.000000
292 107 ORGANIZATION_TYPE_Government 0.000000
293 106 ORGANIZATION_TYPE_Emergency 0.000000
294 163 WALLSMATERIAL_MODE_Others 0.000000
295 160 WALLSMATERIAL_MODE_Block 0.000000
296 104 ORGANIZATION_TYPE_Culture 0.000000
297 141 ORGANIZATION_TYPE_Trade_type_1 0.000000
298 131 ORGANIZATION_TYPE_Postal 0.000000
299 133 ORGANIZATION_TYPE_Religion 0.000000
300 134 ORGANIZATION_TYPE_Restaurant 0.000000
301 136 ORGANIZATION_TYPE_Security 0.000000
302 137 ORGANIZATION_TYPE_Security_Ministries 0.000000
303 126 ORGANIZATION_TYPE_Medicine 0.000000
304 139 ORGANIZATION_TYPE_Services 0.000000
305 140 ORGANIZATION_TYPE_Telecom 0.000000
306 143 ORGANIZATION_TYPE_Trade_type_3 0.000000
307 159 REG_REGION_NOT_WORK_REGION 0.000000
308 144 ORGANIZATION_TYPE_Trade_type_4 0.000000
309 145 ORGANIZATION_TYPE_Trade_type_5 0.000000
310 146 ORGANIZATION_TYPE_Trade_type_6 0.000000
311 148 ORGANIZATION_TYPE_Transport_type_1 0.000000
312 151 ORGANIZATION_TYPE_Transport_type_4 0.000000
313 152 ORGANIZATION_TYPE_University 0.000000
314 123 ORGANIZATION_TYPE_Insurance 0.000000
315 158 REG_REGION_NOT_LIVE_REGION 0.000000
316 105 ORGANIZATION_TYPE_Electricity 0.000000
317 102 ORGANIZATION_TYPE_Cleaning 0.000000
318 55 NAME_HOUSING_TYPE_Municipal_apartment 0.000000
319 71 NAME_TYPE_SUITE_Spouse_partner 0.000000
320 87 OCCUPATION_TYPE_Managers 0.000000
321 84 OCCUPATION_TYPE_IT_staff 0.000000
322 83 OCCUPATION_TYPE_High_skill_tech_staff 0.000000
323 82 OCCUPATION_TYPE_HR_staff 0.000000
324 78 OCCUPATION_TYPE_Cleaning_staff 0.000000
325 129 ORGANIZATION_TYPE_Other 0.000000
326 73 NAME_TYPE_SUITE_nan 0.000000
327 72 NAME_TYPE_SUITE_Unaccompanied 0.000000
328 70 NAME_TYPE_SUITE_Other_B 0.000000
329 88 OCCUPATION_TYPE_Medicine_staff 0.000000
330 69 NAME_TYPE_SUITE_Other_A 0.000000
331 68 NAME_TYPE_SUITE_Group_of_people 0.000000
332 66 NAME_TYPE_SUITE_Children 0.000000
333 64 NAME_INCOME_TYPE_Unemployed 0.000000
334 63 NAME_INCOME_TYPE_Student 0.000000
335 61 NAME_INCOME_TYPE_Maternity_leave 0.000000
336 59 NAME_INCOME_TYPE_Businessman 0.000000
337 58 NAME_HOUSING_TYPE_With_parents 0.000000
338 234 cnt_installments_diff_min 0.000000
339 89 OCCUPATION_TYPE_Private_service_staff 0.000000
340 100 ORGANIZATION_TYPE_Business_Entity_Type_2 0.000000
341 92 OCCUPATION_TYPE_Secretaries 0.000000
342 99 ORGANIZATION_TYPE_Business_Entity_Type_1 0.000000
343 97 ORGANIZATION_TYPE_Agriculture 0.000000
344 96 ORGANIZATION_TYPE_Advertising 0.000000
345 200 amt_drawings_atm_current_min 0.000000
346 202 amt_drawings_current_min 0.000000
347 205 amt_drawings_pos_current_mean 0.000000
348 94 OCCUPATION_TYPE_Waiters_barmen_staff 0.000000
349 208 amt_inst_min_regularity_min 0.000000
350 213 any_installments_late_30 0.000000
351 90 OCCUPATION_TYPE_Realty_agents 0.000000
352 214 any_installments_late_60 0.000000
353 215 any_installments_late_7 0.000000
354 216 bureau_dpd_status_max 0.000000
355 220 cnt_credit_prolong_sum 0.000000
356 222 cnt_drawings_current_min 0.000000
357 223 cnt_drawings_other_current_max 0.000000
358 225 cnt_drawings_pos_current_median 0.000000
359 226 cnt_drawings_pos_current_min 0.000000
360 244 days_credit_overdue_mean 0.000000

7.4 Training Models with Feature Selection

The models are trained based on smaller subsets of features. These subsets are created based on the arbitrarily selected thresholds of SHAP values (the threshold values were selected to reduce the number of features by 20-40 in most cases).

The model with 216 features (SHAP > 0.0001) shows the best validation performance in terms of ROC AUC (0.779) and as well as some other metrics.

Code
def fit_lgbm_ext_on_features(features):
    """Template to fit a LGBM model with a smaller number of features."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
            ("preprocessor_2", clone(pre_processing)),
            ("selector_2", ColumnSelector(features)),
            ("classifier", clone(lgbm_classifier)),
        ]
    )
    pipeline.fit(X_credit_train, y_credit_train)
    return pipeline


def get_feature_names_by_shap_thereshold(threshold):
    return feature_importance_ext.query(f"importance > {threshold}").col_name.to_list()


def n_features_by_shap_threshold(thresholds):
    [
        print(
            f"Threshold: {threshold} | "
            f"Number of features: {len(get_feature_names_by_shap_thereshold(threshold))}"
        )
        for threshold in thresholds
    ]


def fit_lgbm_ext_with_shap_threshold(threshold):
    """Function for feature selection based on SHAP values"""
    features = feature_importance_ext.query(
        f"importance > {threshold}"
    ).col_name.to_list()
    k = len(features)
    return f"LGBM ({k} features)", fit_lgbm_ext_on_features(features)
Code
thresholds = [
    0.0001,
    0.0005,
    0.0010,
    0.0020,
    0.0040,
    0.0050,
    0.0070,
    0.0100,
    0.0200,
    0.0300,
    0.0400,
    0.0500,
    0.1000,
]
Code
n_features_by_shap_threshold(thresholds)
Threshold: 0.0001 | Number of features: 216
Threshold: 0.0005 | Number of features: 183
Threshold: 0.001 | Number of features: 167
Threshold: 0.002 | Number of features: 140
Threshold: 0.004 | Number of features: 120
Threshold: 0.005 | Number of features: 105
Threshold: 0.007 | Number of features: 84
Threshold: 0.01 | Number of features: 66
Threshold: 0.02 | Number of features: 47
Threshold: 0.03 | Number of features: 29
Threshold: 0.04 | Number of features: 20
Threshold: 0.05 | Number of features: 12
Threshold: 0.1 | Number of features: 5
Code
# Restore from file or calculate
file = dir_interim + "task-2-w-credit-history--lgbm_molels_as_dict.pkl"

if os.path.exists(file):
    with open(file, "rb") as f:
        models = joblib.load(f)
else:
    for threshold in thresholds:
        model_name, model = fit_lgbm_ext_with_shap_threshold(threshold)
        models[model_name] = model

    with open(file, "wb") as f:
        joblib.dump(models, f)

del file
# Time: 5m 7.1s
Code
print("--- Train ---")
ml.classification_scores(
    models,
    X_credit_train,
    y_credit_train,
    sort_by="ROC_AUC",
    color="orange",
)
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (216 features) 215257 0.919 0.737 0.747 0.493 0.317 0.837 0.759 0.735 0.201 0.972 0.827
LGBM (167 features) 215257 0.919 0.737 0.747 0.495 0.318 0.837 0.759 0.735 0.201 0.972 0.827
LGBM (FULL | 361 feat.) 215257 0.919 0.737 0.748 0.495 0.318 0.837 0.760 0.735 0.201 0.972 0.827
LGBM (183 features) 215257 0.919 0.737 0.748 0.496 0.318 0.837 0.761 0.735 0.201 0.972 0.827
LGBM (140 features) 215257 0.919 0.737 0.748 0.496 0.319 0.837 0.761 0.735 0.201 0.972 0.827
LGBM (120 features) 215257 0.919 0.737 0.746 0.492 0.317 0.837 0.757 0.735 0.201 0.972 0.826
LGBM (105 features) 215257 0.919 0.736 0.746 0.493 0.317 0.836 0.759 0.734 0.200 0.972 0.826
LGBM (84 features) 215257 0.919 0.734 0.744 0.488 0.315 0.835 0.755 0.733 0.199 0.971 0.824
LGBM (66 features) 215257 0.919 0.733 0.743 0.486 0.313 0.834 0.756 0.731 0.198 0.971 0.823
LGBM (47 features) 215257 0.919 0.730 0.741 0.481 0.311 0.832 0.753 0.728 0.196 0.971 0.819
LGBM (29 features) 215257 0.919 0.724 0.734 0.468 0.304 0.828 0.746 0.722 0.191 0.970 0.812
LGBM (20 features) 215257 0.919 0.720 0.731 0.462 0.300 0.825 0.744 0.718 0.188 0.970 0.807
LGBM (12 features) 215257 0.919 0.711 0.719 0.437 0.289 0.819 0.728 0.710 0.180 0.967 0.795
LGBM (5 features) 215257 0.919 0.701 0.701 0.403 0.275 0.812 0.702 0.701 0.171 0.964 0.777
Code
print("--- Validation ---")
ml.classification_scores(
    models,
    X_credit_validation,
    y_credit_validation,
    sort_by="ROC_AUC",
)
--- Validation ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (216 features) 46127 0.919 0.726 0.709 0.419 0.289 0.830 0.689 0.730 0.183 0.964 0.779
LGBM (183 features) 46127 0.919 0.724 0.705 0.411 0.286 0.829 0.682 0.728 0.181 0.963 0.779
LGBM (167 features) 46127 0.919 0.727 0.706 0.412 0.287 0.831 0.681 0.731 0.182 0.963 0.778
LGBM (140 features) 46127 0.919 0.724 0.706 0.411 0.286 0.829 0.684 0.728 0.181 0.963 0.778
LGBM (FULL | 361 feat.) 46127 0.919 0.724 0.704 0.409 0.285 0.829 0.680 0.728 0.180 0.963 0.778
LGBM (120 features) 46127 0.919 0.726 0.706 0.413 0.287 0.831 0.683 0.730 0.182 0.963 0.777
LGBM (105 features) 46127 0.919 0.725 0.704 0.408 0.285 0.829 0.679 0.729 0.180 0.963 0.777
LGBM (66 features) 46127 0.919 0.723 0.708 0.415 0.287 0.828 0.689 0.726 0.181 0.964 0.777
LGBM (84 features) 46127 0.919 0.722 0.706 0.412 0.285 0.828 0.687 0.726 0.180 0.963 0.776
LGBM (47 features) 46127 0.919 0.720 0.703 0.407 0.283 0.826 0.683 0.723 0.178 0.963 0.774
LGBM (29 features) 46127 0.919 0.714 0.700 0.399 0.278 0.822 0.682 0.717 0.175 0.963 0.770
LGBM (20 features) 46127 0.919 0.710 0.695 0.391 0.274 0.819 0.677 0.713 0.172 0.962 0.767
LGBM (12 features) 46127 0.919 0.705 0.693 0.386 0.271 0.815 0.679 0.707 0.169 0.962 0.760
LGBM (5 features) 46127 0.919 0.696 0.685 0.370 0.263 0.809 0.671 0.699 0.164 0.960 0.748

7.5 Tune Hyperparameters

In this section, the best-performing model based on 216 features will be tuned. To tune hyperparameters, the Optuna package is used.

Code
# Use the subset of the selected features
file_path = dir_interim + "colnames--cols_to_include_in_preprocessing.csv"
cols_to_include_in_preprocessing = pd.read_csv(file_path).column.tolist()
features_to_tune = feature_importance_ext.query(
    f"importance > 0.0001"
).col_name.to_list()
del file_path

# Use 3-fold stratified CV
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
NameError: name 'feature_importance_ext' is not defined
Code
# Define objective function for optuna
def objective(trial):
    "Objective fuction for hyperparameter tuning"
    # LGBM params
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 1000, step=50),
        "max_depth": trial.suggest_int("max_depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["gbdt"]),
        # Tree Structure and Complexity
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        # Regularization
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 1.0),
        # Learning Rate and Feature Selection
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
        # Other Parameters
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_weight": trial.suggest_float(
            "min_child_weight", 1e-3, 1e3, log=True
        ),
        "min_split_gain": trial.suggest_float("min_split_gain", 0.0, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 50),
        "max_delta_step": trial.suggest_int("max_delta_step", 0, 10),
    }

    model = LGBMClassifier(
        objective="binary",
        metric="auc",
        random_state=1,
        class_weight="balanced",
        n_jobs=-1,
        device="gpu",
        **params,
    )

    pipeline_to_tune = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
            ("preprocessor_2", clone(pre_processing)),
            ("selector_2", ColumnSelector(features_to_tune)),
            ("classifier", model),
        ]
    )

    scores = cross_val_score(
        pipeline_to_tune, X_credit_train, y_credit_train, n_jobs=-1, cv=stratified_kfold
    )

    return scores.mean()


study_name = "tune-w-credit-history"
storage_name = f"sqlite:///{dir_interim}/optuna--{study_name}.db"

study = optuna.create_study(
    study_name=study_name,
    storage=storage_name,
    load_if_exists=True,
    direction="maximize",
)
study.optimize(objective, n_trials=100, timeout=3600)
# Time 62m 20.0s
[I 2023-12-27 19:20:13,041] A new study created in RDB with name: tune-w-credit-history
[I 2023-12-27 19:21:14,358] Trial 0 finished with value: 0.7133844678540423 and parameters: {'n_estimators': 800, 'max_depth': 1, 'boosting_type': 'gbdt', 'num_leaves': 68, 'min_child_samples': 29, 'lambda_l1': 0.0001520382569789408, 'lambda_l2': 1.0837050089743732e-05, 'reg_alpha': 0.06205900866526515, 'reg_lambda': 0.9973509941432356, 'learning_rate': 0.17385115867266585, 'feature_fraction': 0.457613838544846, 'subsample': 0.7641858502243166, 'colsample_bytree': 0.4691247981564376, 'bagging_fraction': 0.9776044061943601, 'bagging_freq': 3, 'min_child_weight': 0.2632724059022625, 'min_split_gain': 0.3144179175251397, 'min_data_in_leaf': 20, 'max_delta_step': 6}. Best is trial 0 with value: 0.7133844678540423.
[I 2023-12-27 19:21:54,405] Trial 1 finished with value: 0.7188662922677773 and parameters: {'n_estimators': 300, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 111, 'min_child_samples': 72, 'lambda_l1': 0.0010433150020650284, 'lambda_l2': 0.000733681979055871, 'reg_alpha': 0.35178729517365503, 'reg_lambda': 0.03423976796711303, 'learning_rate': 0.2903376154019435, 'feature_fraction': 0.6147235927730519, 'subsample': 0.6121461820293557, 'colsample_bytree': 0.8644863075835109, 'bagging_fraction': 0.6308830635202857, 'bagging_freq': 2, 'min_child_weight': 0.2432421435462729, 'min_split_gain': 0.223990834897291, 'min_data_in_leaf': 40, 'max_delta_step': 2}. Best is trial 1 with value: 0.7188662922677773.
[I 2023-12-27 19:22:23,909] Trial 2 finished with value: 0.7126876338496727 and parameters: {'n_estimators': 150, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 139, 'min_child_samples': 39, 'lambda_l1': 0.4651706361118245, 'lambda_l2': 0.0113567650942596, 'reg_alpha': 0.28596868511468554, 'reg_lambda': 0.5617766235501069, 'learning_rate': 0.2862326501066601, 'feature_fraction': 0.7760128851639144, 'subsample': 0.48788054317390844, 'colsample_bytree': 0.2983688315154828, 'bagging_fraction': 0.7507872036234606, 'bagging_freq': 7, 'min_child_weight': 814.1423255572674, 'min_split_gain': 0.12315860127817202, 'min_data_in_leaf': 26, 'max_delta_step': 1}. Best is trial 1 with value: 0.7188662922677773.
[I 2023-12-27 19:23:52,349] Trial 3 finished with value: 0.8479259574144425 and parameters: {'n_estimators': 1000, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 112, 'min_child_samples': 96, 'lambda_l1': 0.00017306252387348225, 'lambda_l2': 1.977539005227824e-05, 'reg_alpha': 0.4283787945872213, 'reg_lambda': 0.9893280264194702, 'learning_rate': 0.17128861427623032, 'feature_fraction': 0.6845818343036181, 'subsample': 0.17203387975679538, 'colsample_bytree': 0.7980915047057495, 'bagging_fraction': 0.8435277923028028, 'bagging_freq': 4, 'min_child_weight': 0.004024480124332665, 'min_split_gain': 0.4465134798497602, 'min_data_in_leaf': 38, 'max_delta_step': 3}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:25:11,579] Trial 4 finished with value: 0.7387123283474638 and parameters: {'n_estimators': 500, 'max_depth': 6, 'boosting_type': 'gbdt', 'num_leaves': 223, 'min_child_samples': 21, 'lambda_l1': 1.036153287921386e-07, 'lambda_l2': 1.6295147516128015e-07, 'reg_alpha': 0.6652264420514161, 'reg_lambda': 0.18748862643684328, 'learning_rate': 0.01347760446234664, 'feature_fraction': 0.8228363548713917, 'subsample': 0.5904533867022965, 'colsample_bytree': 0.9358815070918077, 'bagging_fraction': 0.5846974106187401, 'bagging_freq': 5, 'min_child_weight': 0.7074335671273558, 'min_split_gain': 0.29439756447372956, 'min_data_in_leaf': 17, 'max_delta_step': 9}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:25:39,516] Trial 5 finished with value: 0.6993268474558821 and parameters: {'n_estimators': 150, 'max_depth': 2, 'boosting_type': 'gbdt', 'num_leaves': 216, 'min_child_samples': 14, 'lambda_l1': 9.750757318137843, 'lambda_l2': 0.34479627430862014, 'reg_alpha': 0.8213573723009959, 'reg_lambda': 0.9754288921619616, 'learning_rate': 0.08283760652331447, 'feature_fraction': 0.782639149015379, 'subsample': 0.43314082646508134, 'colsample_bytree': 0.2156419154071122, 'bagging_fraction': 0.9768194695171522, 'bagging_freq': 3, 'min_child_weight': 0.005020398733771443, 'min_split_gain': 0.710369095642509, 'min_data_in_leaf': 19, 'max_delta_step': 6}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:26:50,629] Trial 6 finished with value: 0.8316849025663076 and parameters: {'n_estimators': 200, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 120, 'min_child_samples': 100, 'lambda_l1': 4.911477023170647e-08, 'lambda_l2': 1.3489936875313564e-05, 'reg_alpha': 0.673138884928222, 'reg_lambda': 0.003695339233514061, 'learning_rate': 0.11318013505175774, 'feature_fraction': 0.9208947900557085, 'subsample': 0.27865211549049074, 'colsample_bytree': 0.7964060768332551, 'bagging_fraction': 0.6686535176120247, 'bagging_freq': 1, 'min_child_weight': 4.080570526996787, 'min_split_gain': 0.9121238054632345, 'min_data_in_leaf': 37, 'max_delta_step': 6}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:27:54,315] Trial 7 finished with value: 0.7522496444555378 and parameters: {'n_estimators': 600, 'max_depth': 5, 'boosting_type': 'gbdt', 'num_leaves': 244, 'min_child_samples': 43, 'lambda_l1': 2.534823760275499e-08, 'lambda_l2': 2.7995860787697047, 'reg_alpha': 0.28650632087067907, 'reg_lambda': 0.6249923343905277, 'learning_rate': 0.039342315509374066, 'feature_fraction': 0.7293660504039947, 'subsample': 0.6741250410202273, 'colsample_bytree': 0.16784925739666362, 'bagging_fraction': 0.5496868541785603, 'bagging_freq': 3, 'min_child_weight': 6.446646717695389, 'min_split_gain': 0.9250331900530822, 'min_data_in_leaf': 49, 'max_delta_step': 2}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:28:21,006] Trial 8 finished with value: 0.7072801486649237 and parameters: {'n_estimators': 50, 'max_depth': 3, 'boosting_type': 'gbdt', 'num_leaves': 59, 'min_child_samples': 6, 'lambda_l1': 2.05655116866198e-08, 'lambda_l2': 4.0069835388076785e-08, 'reg_alpha': 0.26041375594498495, 'reg_lambda': 0.38611463427103676, 'learning_rate': 0.2781185770603419, 'feature_fraction': 0.741937028506221, 'subsample': 0.4462685297259511, 'colsample_bytree': 0.5284777436645074, 'bagging_fraction': 0.9732383128478747, 'bagging_freq': 1, 'min_child_weight': 2.319742366056286, 'min_split_gain': 0.4202994519219667, 'min_data_in_leaf': 15, 'max_delta_step': 2}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:28:49,372] Trial 9 finished with value: 0.6756110093515885 and parameters: {'n_estimators': 50, 'max_depth': 4, 'boosting_type': 'gbdt', 'num_leaves': 11, 'min_child_samples': 41, 'lambda_l1': 7.36483774066561e-07, 'lambda_l2': 0.0004694219095532689, 'reg_alpha': 0.8791483544217241, 'reg_lambda': 0.42223959207093187, 'learning_rate': 0.013380954064159494, 'feature_fraction': 0.48647182501042135, 'subsample': 0.33239085856469064, 'colsample_bytree': 0.4521145383014534, 'bagging_fraction': 0.6908504039445515, 'bagging_freq': 6, 'min_child_weight': 82.2433055046293, 'min_split_gain': 0.28892951590992133, 'min_data_in_leaf': 25, 'max_delta_step': 1}. Best is trial 3 with value: 0.8479259574144425.
[I 2023-12-27 19:36:23,915] Trial 10 finished with value: 0.9043887034768833 and parameters: {'n_estimators': 1000, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 171, 'min_child_samples': 71, 'lambda_l1': 1.6894405269684752e-05, 'lambda_l2': 2.5479861382375376e-06, 'reg_alpha': 0.5305753243887574, 'reg_lambda': 0.7834468951359224, 'learning_rate': 0.05897985586531902, 'feature_fraction': 0.608096917733464, 'subsample': 0.0823584299376674, 'colsample_bytree': 0.6951910076532127, 'bagging_fraction': 0.4298117967213491, 'bagging_freq': 5, 'min_child_weight': 0.0011663326986012653, 'min_split_gain': 0.553012365836588, 'min_data_in_leaf': 1, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 19:44:20,709] Trial 11 finished with value: 0.9002541073988085 and parameters: {'n_estimators': 1000, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 173, 'min_child_samples': 78, 'lambda_l1': 1.0708866586763564e-05, 'lambda_l2': 2.4683340848716906e-06, 'reg_alpha': 0.5302837150243924, 'reg_lambda': 0.8033717158930774, 'learning_rate': 0.05047074984874552, 'feature_fraction': 0.6196066363480937, 'subsample': 0.08527296183385108, 'colsample_bytree': 0.7184725298338295, 'bagging_fraction': 0.43735780133140306, 'bagging_freq': 5, 'min_child_weight': 0.0010552040603009237, 'min_split_gain': 0.5642018881288399, 'min_data_in_leaf': 2, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 19:53:01,431] Trial 12 finished with value: 0.9002541115424737 and parameters: {'n_estimators': 1000, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 173, 'min_child_samples': 70, 'lambda_l1': 1.165530319451606e-05, 'lambda_l2': 6.557722320701904e-07, 'reg_alpha': 0.55443498043106, 'reg_lambda': 0.750406064876292, 'learning_rate': 0.04803394354033783, 'feature_fraction': 0.5418748467431972, 'subsample': 0.0527277396464233, 'colsample_bytree': 0.6409664835561325, 'bagging_fraction': 0.4057583105199048, 'bagging_freq': 5, 'min_child_weight': 0.0012744723872042992, 'min_split_gain': 0.6235370931106026, 'min_data_in_leaf': 1, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 19:57:59,299] Trial 13 finished with value: 0.8785080092391984 and parameters: {'n_estimators': 800, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 179, 'min_child_samples': 59, 'lambda_l1': 3.870504370488863e-06, 'lambda_l2': 3.5632798838577234e-07, 'reg_alpha': 0.5552302283068307, 'reg_lambda': 0.7441647781250288, 'learning_rate': 0.0331882192882924, 'feature_fraction': 0.5379323013978335, 'subsample': 0.05782808696150672, 'colsample_bytree': 0.6668916697472603, 'bagging_fraction': 0.4042138039242215, 'bagging_freq': 5, 'min_child_weight': 0.02749957725080116, 'min_split_gain': 0.6358682934280196, 'min_data_in_leaf': 2, 'max_delta_step': 8}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:02:08,027] Trial 14 finished with value: 0.8543276011471223 and parameters: {'n_estimators': 850, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 197, 'min_child_samples': 65, 'lambda_l1': 5.857325396819327e-06, 'lambda_l2': 1.1540765814883997e-08, 'reg_alpha': 0.9449558716340772, 'reg_lambda': 0.7643735518759827, 'learning_rate': 0.0264377319047264, 'feature_fraction': 0.4138809416134508, 'subsample': 0.9785014763384117, 'colsample_bytree': 0.6217631862933801, 'bagging_fraction': 0.5004604633571963, 'bagging_freq': 7, 'min_child_weight': 0.029231874860064476, 'min_split_gain': 0.7346524444201824, 'min_data_in_leaf': 9, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:05:28,515] Trial 15 finished with value: 0.8773558974807143 and parameters: {'n_estimators': 650, 'max_depth': 9, 'boosting_type': 'gbdt', 'num_leaves': 159, 'min_child_samples': 84, 'lambda_l1': 0.0010947979254749935, 'lambda_l2': 6.566180247402054e-07, 'reg_alpha': 0.700542554170593, 'reg_lambda': 0.6367861929705528, 'learning_rate': 0.06963364500865421, 'feature_fraction': 0.5454571723060675, 'subsample': 0.21789311730283067, 'colsample_bytree': 0.9417867015781536, 'bagging_fraction': 0.4613056213776964, 'bagging_freq': 6, 'min_child_weight': 0.0010050650360104857, 'min_split_gain': 0.4781261442568392, 'min_data_in_leaf': 8, 'max_delta_step': 5}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:09:52,587] Trial 16 finished with value: 0.8996594719728325 and parameters: {'n_estimators': 900, 'max_depth': 11, 'boosting_type': 'gbdt', 'num_leaves': 150, 'min_child_samples': 56, 'lambda_l1': 5.2101850657853695e-05, 'lambda_l2': 1.0270011883222986e-08, 'reg_alpha': 0.47817116755622174, 'reg_lambda': 0.8455458639986815, 'learning_rate': 0.05796270751653336, 'feature_fraction': 0.6127662875632117, 'subsample': 0.1571666236334549, 'colsample_bytree': 0.5457412302228843, 'bagging_fraction': 0.5123096232048837, 'bagging_freq': 4, 'min_child_weight': 0.03226715486140934, 'min_split_gain': 0.5688182774542483, 'min_data_in_leaf': 9, 'max_delta_step': 0}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:11:43,889] Trial 17 finished with value: 0.7970983517618615 and parameters: {'n_estimators': 400, 'max_depth': 8, 'boosting_type': 'gbdt', 'num_leaves': 249, 'min_child_samples': 88, 'lambda_l1': 8.485857943906983e-07, 'lambda_l2': 2.332740208848319e-06, 'reg_alpha': 0.9978838168722097, 'reg_lambda': 0.6798468844976686, 'learning_rate': 0.028667289635909032, 'feature_fraction': 0.40793738816795067, 'subsample': 0.0673086918921274, 'colsample_bytree': 0.9946989885817998, 'bagging_fraction': 0.4246976859779345, 'bagging_freq': 6, 'min_child_weight': 0.004197888784056201, 'min_split_gain': 0.7772416365414636, 'min_data_in_leaf': 1, 'max_delta_step': 8}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:14:34,560] Trial 18 finished with value: 0.8411340787793012 and parameters: {'n_estimators': 700, 'max_depth': 12, 'boosting_type': 'gbdt', 'num_leaves': 79, 'min_child_samples': 70, 'lambda_l1': 0.0064640587661589505, 'lambda_l2': 7.546095424976181e-05, 'reg_alpha': 0.6179256374168793, 'reg_lambda': 0.8653554387092988, 'learning_rate': 0.04692359560723053, 'feature_fraction': 0.540231361055241, 'subsample': 0.3039159708340914, 'colsample_bytree': 0.7111036760812757, 'bagging_fraction': 0.4943364397262184, 'bagging_freq': 5, 'min_child_weight': 0.012808373527934225, 'min_split_gain': 0.6459614317370587, 'min_data_in_leaf': 11, 'max_delta_step': 4}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:18:39,592] Trial 19 finished with value: 0.9003656058507529 and parameters: {'n_estimators': 950, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 195, 'min_child_samples': 51, 'lambda_l1': 1.9740443034087055e-05, 'lambda_l2': 1.538492352140492e-07, 'reg_alpha': 0.7820676589881548, 'reg_lambda': 0.485391596878385, 'learning_rate': 0.08338420033598173, 'feature_fraction': 0.6750014906044565, 'subsample': 0.18379862862364665, 'colsample_bytree': 0.3922667871442633, 'bagging_fraction': 0.4002258795457228, 'bagging_freq': 4, 'min_child_weight': 0.0929726219315795, 'min_split_gain': 0.8462828519856542, 'min_data_in_leaf': 30, 'max_delta_step': 7}. Best is trial 10 with value: 0.9043887034768833.
[I 2023-12-27 20:22:31,723] Trial 20 finished with value: 0.9029578514479404 and parameters: {'n_estimators': 900, 'max_depth': 10, 'boosting_type': 'gbdt', 'num_leaves': 209, 'min_child_samples': 47, 'lambda_l1': 3.351553254308848e-07, 'lambda_l2': 6.380193421921823e-08, 'reg_alpha': 0.7908982979418974, 'reg_lambda': 0.4871542957987469, 'learning_rate': 0.08670301703025042, 'feature_fraction': 0.6613735657699136, 'subsample': 0.3683340397111962, 'colsample_bytree': 0.3697916753974209, 'bagging_fraction': 0.5700116779255459, 'bagging_freq': 4, 'min_child_weight': 0.09342824353373518, 'min_split_gain': 0.8503928290277616, 'min_data_in_leaf': 27, 'max_delta_step': 10}. Best is trial 10 with value: 0.9043887034768833.

Best is trial 10 with CV AUC value: 0.9044

Trial 10 finished with value: 0.9043887034768833 and parameters:

  • ‘n_estimators’: 1000,
  • ‘max_depth’: 12,
  • ‘boosting_type’: ‘gbdt’,
  • ‘num_leaves’: 171,
  • ‘min_child_samples’: 71,
  • ‘lambda_l1’: 1.6894405269684752e-05,
  • ‘lambda_l2’: 2.5479861382375376e-06,
  • ‘reg_alpha’: 0.5305753243887574,
  • ‘reg_lambda’: 0.7834468951359224,
  • ‘learning_rate’: 0.05897985586531902,
  • ‘feature_fraction’: 0.608096917733464,
  • ‘subsample’: 0.0823584299376674,
  • ‘colsample_bytree’: 0.6951910076532127,
  • ‘bagging_fraction’: 0.4298117967213491,
  • ‘bagging_freq’: 5,
  • ‘min_child_weight’: 0.0011663326986012653,
  • ‘min_split_gain’: 0.553012365836588,
  • ‘min_data_in_leaf’: 1,
  • ‘max_delta_step’: 4

7.6 Evaluate Tuned Model

This time the tuned model faces the same issues related to overfitting as the tuned model in Section 7.6. This should be addressed by limiting model complexity. Unfortunately, this time there is not enough time to do this.

Code
params_tuned_2 = {
    "n_estimators": 1000,
    "max_depth": 12,
    "boosting_type": "gbdt",
    "num_leaves": 171,
    "min_child_samples": 71,
    "lambda_l1": 1.6894405269684752e-05,
    "lambda_l2": 2.5479861382375376e-06,
    "reg_alpha": 0.5305753243887574,
    "reg_lambda": 0.7834468951359224,
    "learning_rate": 0.05897985586531902,
    "feature_fraction": 0.608096917733464,
    "subsample": 0.0823584299376674,
    "colsample_bytree": 0.6951910076532127,
    "bagging_fraction": 0.4298117967213491,
    "bagging_freq": 5,
    "min_child_weight": 0.0011663326986012653,
    "min_split_gain": 0.553012365836588,
    "min_data_in_leaf": 1,
    "max_delta_step": 4,
}

model_tuned_2 = LGBMClassifier(
    objective="binary",
    metric="auc",
    random_state=1,
    class_weight="balanced",
    n_jobs=-1,
    device="gpu",
    **params_tuned_2
)

pipeline_tuned_2 = Pipeline(
    steps=[
        ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
        ("preprocessor_2", clone(pre_processing)),
        ("selector_2", ColumnSelector(features_to_tune)),
        ("classifier", model_tuned_2),
    ]
)

pipeline_tuned_2.fit(X_credit_train, y_credit_train)
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 27981
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 216
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 104 dense feature groups (21.35 MB) transferred to GPU in 0.051113 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Pipeline(steps=[('selector_1',
                 ColumnSelector(keep=['AMT_ANNUITY', 'AMT_CREDIT',
                                      'AMT_INCOME_TOTAL',
                                      'AMT_REQ_CREDIT_BUREAU_DAY',
                                      'AMT_REQ_CREDIT_BUREAU_HOUR',
                                      'AMT_REQ_CREDIT_BUREAU_MON',
                                      'AMT_REQ_CREDIT_BUREAU_QRT',
                                      'AMT_REQ_CREDIT_BUREAU_WEEK',
                                      'AMT_REQ_CREDIT_BUREAU_YEAR',
                                      'BASEMENTAREA_MODE', 'CNT_FAM_MEMBERS',
                                      'COMMONAREA_MEDI', 'DAYS_ID_PUBLISH',
                                      'DAYS_LA...
                                learning_rate=0.05897985586531902,
                                max_delta_step=4, max_depth=12, metric='auc',
                                min_child_samples=71,
                                min_child_weight=0.0011663326986012653,
                                min_data_in_leaf=1,
                                min_split_gain=0.553012365836588,
                                n_estimators=1000, n_jobs=-1, num_leaves=171,
                                objective='binary', random_state=1,
                                reg_alpha=0.5305753243887574,
                                reg_lambda=0.7834468951359224,
                                subsample=0.0823584299376674))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
models["LGBM (216 feat. | tuned)"] = pipeline_tuned_2
Code
performance_train_2 = ml.classification_scores(
    models,
    X_credit_train,
    y_credit_train,
    color="orange",
    sort_by="ROC_AUC",
)
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
Code
performance_validation_2 = ml.classification_scores(
    models,
    X_credit_validation,
    y_credit_validation,
    sort_by="ROC_AUC",
)
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=71 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] feature_fraction is set=0.608096917733464, colsample_bytree=0.6951910076532127 will be ignored. Current value: feature_fraction=0.608096917733464
[LightGBM] [Warning] lambda_l1 is set=1.6894405269684752e-05, reg_alpha=0.5305753243887574 will be ignored. Current value: lambda_l1=1.6894405269684752e-05
[LightGBM] [Warning] lambda_l2 is set=2.5479861382375376e-06, reg_lambda=0.7834468951359224 will be ignored. Current value: lambda_l2=2.5479861382375376e-06
[LightGBM] [Warning] bagging_fraction is set=0.4298117967213491, subsample=0.0823584299376674 will be ignored. Current value: bagging_fraction=0.4298117967213491
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
Code
print("--- Train ---")
performance_train_2
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (216 feat. | tuned) 215257 0.919 0.991 0.995 0.990 0.947 0.995 1.000 0.990 0.900 1.000 1.000
LGBM (216 features) 215257 0.919 0.737 0.747 0.493 0.317 0.837 0.759 0.735 0.201 0.972 0.827
LGBM (167 features) 215257 0.919 0.737 0.747 0.495 0.318 0.837 0.759 0.735 0.201 0.972 0.827
LGBM (FULL | 361 feat.) 215257 0.919 0.737 0.748 0.495 0.318 0.837 0.760 0.735 0.201 0.972 0.827
LGBM (183 features) 215257 0.919 0.737 0.748 0.496 0.318 0.837 0.761 0.735 0.201 0.972 0.827
LGBM (140 features) 215257 0.919 0.737 0.748 0.496 0.319 0.837 0.761 0.735 0.201 0.972 0.827
LGBM (120 features) 215257 0.919 0.737 0.746 0.492 0.317 0.837 0.757 0.735 0.201 0.972 0.826
LGBM (105 features) 215257 0.919 0.736 0.746 0.493 0.317 0.836 0.759 0.734 0.200 0.972 0.826
LGBM (84 features) 215257 0.919 0.734 0.744 0.488 0.315 0.835 0.755 0.733 0.199 0.971 0.824
LGBM (66 features) 215257 0.919 0.733 0.743 0.486 0.313 0.834 0.756 0.731 0.198 0.971 0.823
LGBM (47 features) 215257 0.919 0.730 0.741 0.481 0.311 0.832 0.753 0.728 0.196 0.971 0.819
LGBM (29 features) 215257 0.919 0.724 0.734 0.468 0.304 0.828 0.746 0.722 0.191 0.970 0.812
LGBM (20 features) 215257 0.919 0.720 0.731 0.462 0.300 0.825 0.744 0.718 0.188 0.970 0.807
LGBM (12 features) 215257 0.919 0.711 0.719 0.437 0.289 0.819 0.728 0.710 0.180 0.967 0.795
LGBM (5 features) 215257 0.919 0.701 0.701 0.403 0.275 0.812 0.702 0.701 0.171 0.964 0.777
Code
print("--- Validation ---")
performance_validation_2
--- Validation ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (216 features) 46127 0.919 0.726 0.709 0.419 0.289 0.830 0.689 0.730 0.183 0.964 0.779
LGBM (183 features) 46127 0.919 0.724 0.705 0.411 0.286 0.829 0.682 0.728 0.181 0.963 0.779
LGBM (167 features) 46127 0.919 0.727 0.706 0.412 0.287 0.831 0.681 0.731 0.182 0.963 0.778
LGBM (140 features) 46127 0.919 0.724 0.706 0.411 0.286 0.829 0.684 0.728 0.181 0.963 0.778
LGBM (FULL | 361 feat.) 46127 0.919 0.724 0.704 0.409 0.285 0.829 0.680 0.728 0.180 0.963 0.778
LGBM (120 features) 46127 0.919 0.726 0.706 0.413 0.287 0.831 0.683 0.730 0.182 0.963 0.777
LGBM (105 features) 46127 0.919 0.725 0.704 0.408 0.285 0.829 0.679 0.729 0.180 0.963 0.777
LGBM (66 features) 46127 0.919 0.723 0.708 0.415 0.287 0.828 0.689 0.726 0.181 0.964 0.777
LGBM (84 features) 46127 0.919 0.722 0.706 0.412 0.285 0.828 0.687 0.726 0.180 0.963 0.776
LGBM (47 features) 46127 0.919 0.720 0.703 0.407 0.283 0.826 0.683 0.723 0.178 0.963 0.774
LGBM (29 features) 46127 0.919 0.714 0.700 0.399 0.278 0.822 0.682 0.717 0.175 0.963 0.770
LGBM (20 features) 46127 0.919 0.710 0.695 0.391 0.274 0.819 0.677 0.713 0.172 0.962 0.767
LGBM (12 features) 46127 0.919 0.705 0.693 0.386 0.271 0.815 0.679 0.707 0.169 0.962 0.760
LGBM (5 features) 46127 0.919 0.696 0.685 0.370 0.263 0.809 0.671 0.699 0.164 0.960 0.748
LGBM (216 feat. | tuned) 46127 0.919 0.893 0.596 0.193 0.268 0.942 0.243 0.950 0.298 0.935 0.746

7.7 Final Evaluation

After hyperparameter tuning, the trade-off between model complexity and accuracy was re-considered. Instead of the best-performing model based on 216 features, a much less complex model based on 47 features with comparable performance (AUC = 0.774 which is only smaller by 0.005) was chosen as the final model to be deployed.

The final performance of the model based on these 47 features is AUC = 0.777. In the case, where no credit history data was used, the best model had AUC = 0.763, so the improvement by this type of data is not huge (only 0.014).

Code
features_47 = feature_importance_ext.head(47).col_name.to_list()

pipeline_final_2_with_47_feat = Pipeline(
    steps=[
        ("selector_1", ColumnSelector(cols_to_include_in_preprocessing)),
        ("preprocessor_2", clone(pre_processing)),
        ("selector_2", ColumnSelector(features_47)),
        ("classifier", clone(lgbm_classifier)),
    ]
)
Code
# For evaluation
X_credit_train_validation = pd.concat([X_credit_train, X_credit_validation])
y_credit_train_validation = pd.concat([y_credit_train, y_credit_validation])

models_final_2 = {}
models_final_2["LGBM (47 feat. | final)"] = pipeline_final_2_with_47_feat.fit(
    X_credit_train_validation, y_credit_train_validation
)
[LightGBM] [Info] Number of positive: 21101, number of negative: 240283
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 7853
[LightGBM] [Info] Number of data points in the train set: 261384, number of used features: 47
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (8.97 MB) transferred to GPU in 0.026298 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Code
print("--- Train ---")

ml.classification_scores(
    models_final_2,
    X_credit_train_validation,
    y_credit_train_validation,
    color="orange",
    sort_by="ROC_AUC",
)
--- Train ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (47 feat. | final) 261384 0.919 0.726 0.734 0.469 0.305 0.829 0.744 0.724 0.192 0.970 0.813
Code of the figure
sns.set_style("white")
y_pred_train_val_2 = models_final_2["LGBM (47 feat. | final)"].predict(
    X_credit_train_validation
)
ml.plot_confusion_matrices(y_credit_train_validation, y_pred_train_val_2, figsize=(13, 3));
Fig. 7.1. Confusion matrices for the joint train and validation set.
Code
print("--- Test ---")

ml.classification_scores(
    models_final_2,
    X_credit_test,
    y_credit_test,
    sort_by="ROC_AUC",
)
--- Test ---
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM (47 feat. | final) 46127 0.919 0.722 0.710 0.420 0.288 0.827 0.697 0.724 0.181 0.964 0.777
Code of the figure
sns.set_style("white")
y_pred_test_2 = models_final_2["LGBM (47 feat. | final)"].predict(X_credit_test)
ml.plot_confusion_matrices(y_credit_test, y_pred_test_2, figsize=(13, 3));
Fig. 7.2. Confusion matrices for the test set.
Code
# SHAP values for the final model
@my.cache_results(dir_interim + "task-2--shap_lgbm_k=47-final.pickle")
def get_shap_values_lgbm_final_2():
    model = "LGBM (47 feat. | final)"
    preproc = Pipeline(steps=models_final_2[model].steps[:-1])
    classifier = models_final_2[model]["classifier"]
    X_test_preproc = preproc.transform(X_credit_test)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_test_preproc)
    return shap_values, X_test_preproc


shap_values_lgbm_test_2, data_for_lgbm_test_2 = get_shap_values_lgbm_final_2()

feature_importance_test_2 = (
    pd.DataFrame(
        list(
            zip(
                data_for_lgbm_test_2.columns,
                np.abs(shap_values_lgbm_test_2).mean(0).mean(0),
            )
        ),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
Code
sns.set_style("whitegrid")
lgb.plot_importance(
    models_final_2["LGBM (47 feat. | final)"]["classifier"],
    max_num_features=50,
    figsize=(8, 9),
    height=0.8,
    title="LGBM Feature Importance (Final Model)",
);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_2[1],
    data_for_lgbm_test_2,
    plot_type="bar",
    max_display=50,
    plot_size=(10, 6),
    show=False,
)
plt.tick_params(axis="y", labelsize=8)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.xlabel("Mean (|SHAP value|) \n(impact on model output)", fontsize=10);

Code
sns.set_style("whitegrid")
shap.summary_plot(
    shap_values_lgbm_test_2[1],
    data_for_lgbm_test_2,
    max_display=50,
    plot_size=(10, 6),
    show=False,
)
plt.title("SHAP Feature Importance (Final Model)", fontsize=12)
plt.tick_params(axis="y", labelsize=8)
plt.tick_params(axis="x", labelsize=8)

7.8 Model for Deployment (w/ Historical Data)

Merge all data to train the final model:

Code
# For deployment
X_credit_all = pd.concat([X_credit_train, X_credit_validation, X_credit_test], axis=0)
y_credit_all = pd.concat([y_credit_train, y_credit_validation, y_credit_test], axis=0)

pipeline_to_deploy_2 = clone(pipeline_final_2_with_47_feat)
pipeline_to_deploy_2 = pipeline_to_deploy_2.fit(X_credit_all, y_credit_all)
[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 7851
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 47
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 950M, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (10.56 MB) transferred to GPU in 0.029807 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000

For simplicity, the model will be deployed without pre-processing pipeline.

Code
# Extract and save classifier
classifier_to_deploy_2 = pipeline_to_deploy_2.named_steps["classifier"]

with open("models/classifier-2--with_credit_history.pickle", "wb") as f:
    joblib.dump(classifier_to_deploy_2, f)

8 Final Remarks

  1. In binary classification, the default threshold of 0.5 was used.
    Threshold adjustment (e.g., via ROC curve analysis) might be beneficial.

  2. Hyperparameter tuning was not efficient in this analysis. The issue of overfitting should be addressed by limiting model complexity. Unfortunately, there was not enough time to do this.

  3. Only LGBM model was used. To try other architectures (e.g., logistic regression, Naive Bayes, or neural networks) might be beneficial too.

  4. In some cases, more self-explanatory variable names could be used.

  5. There is a lot of repeated code between the parts of modeling with and without historical data. This could be improved. Unfortunately, this would have required much more time.

  6. Some results and lines of code could be described in more detail.

  7. Not all functions from functions subfolder were used. They should be treated as a separate module.

  8. The requirements.txt file was created by using pip freeze > requirements.txt command. This is not the best way to create this file as there are more packages than it is needed.

  9. A consultation with a field expert would be beneficial to understand the data better and to improve the model.