Lending Club Data Analysis

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-11-30

Updated

2023-11-30

Lending Club project logo. Originally generated with Leonardo.Ai.

Annotation

In this project, a comprehensive analysis of Lending Club loan data was conducted. Changing over-time trends were identified thus to ensure relevance of the modeling phase, data from the most recent year, 2018, was exclusively utilized. The modeling process included two major tasks: predicting loan application status (accepted/rejected) and forecasting key attributes of accepted loans, including grade, sub-grade, and interest rate.

The development of four distinct models was executed with a thorough approach, addressing challenges such as group imbalance and data size. Rigorous procedures were employed to refine and optimize each model. Subsequently, the most effective models were selected for deployment on the Google Cloud Platform (GCP).

While the models have been successfully deployed and are currently accessible through an API, ongoing efforts for refinement and enhancement are acknowledged. Continuous improvement remains a priority to ensure the models’ accuracy and effectiveness over time.

1 Data

1.1 Explore Data Files

The dataset comes in two files: one for the accepted loans and one for the rejected ones. The dimensions and size of the data are as follows:

2.3M rows, 151 columns, and 1.6 GB of accepted loans’ data.
27.6M rows, 9 columns, and 1.7 GB of rejected loans’ data.

Regarding the size of data, some optimizations will be used to reduce the memory footprint (e.g., more efficient data types and only necessary columns will be used).

The data spans from 2007 to 2018.

As in different files the number of variables is different and variable names do not match, it seems that the best-matching variables are those, listed in Table 1.1.

Table 1.1. The names of matching variables in the datasets of accepted and rejected loans.

Variable	Name in “Accepted”	Name in “Rejected”
Loan Amount	`loan_amnt`	`Amount Requested`
Application Date	`issue_d`	`Application Date`
Loan Title	`title`	`Loan Title`
Risk Score	`fico_range_low` and `fico_range_high`	`Risk_Score`
Debt-To-Income Ratio	`dti`	`Debt-To-Income Ratio`
Zip Code	`zip_code`	`Zip Code`
State	`addr_state`	`State`
Employment Length	`emp_length`	`Employment Length`
Policy Code	`policy_code`	`Policy Code`

Details:

Code

!echo Data file names:
!ls data/raw/

Data file names:
accepted_2007_to_2018Q4.csv
rejected_2007_to_2018Q4.csv

Code

!echo File sizes:
!cd data/raw/ &&\
    du -m accepted_2007_to_2018Q4.csv rejected_2007_to_2018Q4.csv |\
    sed 's/\([0-9]\+\)/\1 MB /'

File sizes:
1598 MB     accepted_2007_to_2018Q4.csv
1700 MB     rejected_2007_to_2018Q4.csv

Code

# NOTE: header line is also included here
!echo Number of lines per file:
!cd data/raw/ &&\
    wc --lines accepted_2007_to_2018Q4.csv rejected_2007_to_2018Q4.csv

Number of lines per file:
   2260702 accepted_2007_to_2018Q4.csv
  27648742 rejected_2007_to_2018Q4.csv
  29909444 total

Code

!echo Number of columns per file:

!cd data/raw/ &&\
    (csvcut -n accepted_2007_to_2018Q4.csv | wc -l | xargs printf "%5d\n" &&\
          echo accepted_2007_to_2018Q4.csv) |\
    paste -s -d ' '

!cd data/raw/ &&\
    (csvcut -n rejected_2007_to_2018Q4.csv | wc -l | xargs printf "%5d\n" &&\
          echo rejected_2007_to_2018Q4.csv) |\
    paste -s -d ' '

Number of columns per file:
  151 accepted_2007_to_2018Q4.csv 
    9 rejected_2007_to_2018Q4.csv

A few top rows (formatted as table) of each file:

Code

!cd data/raw/ &&\
    head -n 5 accepted_2007_to_2018Q4.csv | csvlook

|         id | member_id | loan_amnt | funded_amnt | funded_amnt_inv |       term | int_rate | installment | grade | sub_grade | emp_title                   | emp_length | home_ownership | annual_inc | verification_status | issue_d  | loan_status | pymnt_plan | url                                                               | desc | purpose            | title              | zip_code | addr_state |   dti | delinq_2yrs | earliest_cr_line | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | annual_inc_joint | dti_joint | verification_status_joint | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | revol_bal_joint | sec_app_fico_range_low | sec_app_fico_range_high | sec_app_earliest_cr_line | sec_app_inq_last_6mths | sec_app_mort_acc | sec_app_open_acc | sec_app_revol_util | sec_app_open_act_il | sec_app_num_rev_accts | sec_app_chargeoff_within_12_mths | sec_app_collections_12_mths_ex_med | sec_app_mths_since_last_major_derog | hardship_flag | hardship_type | hardship_reason | hardship_status | deferral_term | hardship_amount | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term |
| ---------- | --------- | --------- | ----------- | --------------- | ---------- | -------- | ----------- | ----- | --------- | --------------------------- | ---------- | -------------- | ---------- | ------------------- | -------- | ----------- | ---------- | ----------------------------------------------------------------- | ---- | ------------------ | ------------------ | -------- | ---------- | ----- | ----------- | ---------------- | -------------- | --------------- | -------------- | ---------------------- | ---------------------- | -------- | ------- | --------- | ---------- | --------- | ------------------- | --------- | ------------- | ----------- | --------------- | --------------- | ------------- | ------------------ | ---------- | ----------------------- | ------------ | --------------- | ------------ | ------------------ | -------------------- | ------------------- | -------------------------- | --------------------------- | ----------- | ---------------- | ---------------- | --------- | ------------------------- | -------------- | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- | ------------------ | ------------ | ------- | ----------- | ----------- | ---------- | -------- | ---------------- | ------ | ----------- | ------------ | -------------------- | ----------- | -------------- | ------- | ------------------------ | ----------- | ------------------ | -------------------- | --------------------- | -------------- | -------- | -------------------- | ------------------------ | --------------------- | ------------------------------ | --------------------- | -------------- | --------------- | ----------- | --------- | --------- | ------------- | ------------- | ------------------- | -------- | ---------------- | ------------ | ------------------ | ------------------ | -------------- | ---------------- | -------------------- | --------- | --------------- | ----------------- | -------------- | -------------------------- | --------------- | ---------------------- | ----------------------- | ------------------------ | ---------------------- | ---------------- | ---------------- | ------------------ | ------------------- | --------------------- | -------------------------------- | ---------------------------------- | ----------------------------------- | ------------- | ------------- | --------------- | --------------- | ------------- | --------------- | ------------------- | ----------------- | ----------------------- | --------------- | ------------ | -------------------- | ------------------------------------------ | ------------------------------ | ---------------------------- | ------------------- | -------------------- | ------------------------- | ----------------- | --------------- | ----------------- | --------------------- | --------------- |
| 68,407,277 |           |     3,600 |       3,600 |           3,600 | 0004-01-01 |    13.99 |      123.03 | C     | C4        | leadman                     | 10+ years  | MORTGAGE       |     55,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68407277 |      | debt_consolidation | Debt consolidation | 190xx    | PA         |  5.91 |           0 | Aug-2003         |            675 |             679 |              1 |                     30 |                        |        7 |       0 |     2,765 |       29.7 |        13 | w                   |      0.00 |          0.00 |  4,421.724… |        4,421.72 |        3,600.00 |        821.72 |                  0 |          0 |                       0 | Jan-2019     |          122.67 |              | Mar-2019           |                  564 |                 560 |                          0 |                          30 |           1 | Individual       |                  |           |                           |              0 |          722 |     144,904 |           2 |           2 |           0 |           1 |                 21 |        4,981 |      36 |           3 |           3 |        722 |       34 |            9,300 |      3 |           1 |            4 |                    4 |      20,701 |          1,506 |    37.2 |                        0 |           0 |                148 |                  128 |                     3 |              3 |        1 |                    4 |                       69 |                     4 |                             69 |                     2 |              2 |               4 |           2 |         5 |         3 |             4 |             9 |                   4 |        7 |                0 |            0 |                  0 |                  3 |           76.9 |              0.0 |                    0 |         0 |         178,050 |             7,746 |          2,400 |                     13,734 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,355,089 |           |    24,700 |      24,700 |          24,700 | 0004-01-01 |    11.99 |      820.28 | C     | C1        | Engineer                    | 10+ years  | MORTGAGE       |     65,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68355089 |      | small_business     | Business           | 577xx    | SD         | 16.06 |           1 | Dec-1999         |            715 |             719 |              4 |                      6 |                        |       22 |       0 |    21,470 |       19.2 |        38 | w                   |      0.00 |          0.00 | 25,679.660… |       25,679.66 |       24,700.00 |        979.66 |                  0 |          0 |                       0 | Jun-2016     |          926.35 |              | Mar-2019           |                  699 |                 695 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     204,396 |           1 |           1 |           0 |           1 |                 19 |       18,005 |      73 |           2 |           3 |      6,472 |       29 |          111,800 |      0 |           0 |            6 |                    4 |       9,733 |         57,830 |    27.1 |                        0 |           0 |                113 |                  192 |                     2 |              2 |        4 |                    2 |                          |                     0 |                              6 |                     0 |              5 |               5 |          13 |        17 |         6 |            20 |            27 |                   5 |       22 |                0 |            0 |                  0 |                  2 |           97.4 |              7.7 |                    0 |         0 |         314,017 |            39,475 |         79,300 |                     24,667 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,341,763 |           |    20,000 |      20,000 |          20,000 | 0006-01-01 |    10.78 |      432.66 | B     | B4        | truck driver                | 10+ years  | MORTGAGE       |     63,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68341763 |      | home_improvement   |                    | 605xx    | IL         | 10.78 |           0 | Aug-2000         |            695 |             699 |              0 |                        |                        |        6 |       0 |     7,869 |       56.2 |        18 | w                   |      0.00 |          0.00 | 22,705.924… |       22,705.92 |       20,000.00 |      2,705.92 |                  0 |          0 |                       0 | Jun-2017     |       15,813.30 |              | Mar-2019           |                  704 |                 700 |                          0 |                             |           1 | Joint App        |           71,000 |     13.85 | Not Verified              |              0 |            0 |     189,699 |           0 |           1 |           0 |           4 |                 19 |       10,827 |      73 |           0 |           2 |      2,081 |       65 |           14,000 |      2 |           5 |            1 |                    6 |      31,617 |          2,737 |    55.9 |                        0 |           0 |                125 |                  184 |                    14 |             14 |        5 |                  101 |                          |                    10 |                                |                     0 |              2 |               3 |           2 |         4 |         6 |             4 |             7 |                   3 |        6 |                0 |            0 |                  0 |                  0 |          100.0 |             50.0 |                    0 |         0 |         218,418 |            18,696 |          6,200 |                     14,877 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 66,310,712 |           |    35,000 |      35,000 |          35,000 | 0006-01-01 |    14.85 |      829.90 | C     | C5        | Information Systems Officer | 10+ years  | MORTGAGE       |    110,000 | Source Verified     | Dec-2015 | Current     |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=66310712 |      | debt_consolidation | Debt consolidation | 076xx    | NJ         | 17.06 |           0 | Sep-2008         |            785 |             789 |              0 |                        |                        |       13 |       0 |     7,802 |       11.6 |        17 | w                   | 15,897.65 |     15,897.65 | 31,464.010… |       31,464.01 |       19,102.35 |     12,361.66 |                  0 |          0 |                       0 | Feb-2019     |          829.90 | Apr-2019     | Mar-2019           |                  679 |                 675 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     301,500 |           1 |           1 |           0 |           1 |                 23 |       12,609 |      70 |           1 |           1 |      6,987 |       45 |           67,300 |      0 |           1 |            0 |                    2 |      23,192 |         54,962 |    12.1 |                        0 |           0 |                 36 |                   87 |                     2 |              2 |        1 |                    2 |                          |                       |                                |                     0 |              4 |               5 |           8 |        10 |         2 |            10 |            13 |                   5 |       13 |                0 |            0 |                  0 |                  1 |          100.0 |              0.0 |                    0 |         0 |         381,215 |            52,226 |         62,500 |                     18,000 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |

Code

!cd data/raw/ &&\
    head -n 5 rejected_2007_to_2018Q4.csv | csvlook

| Amount Requested | Application Date | Loan Title                       | Risk_Score | Debt-To-Income Ratio | Zip Code | State | Employment Length | Policy Code |
| ---------------- | ---------------- | -------------------------------- | ---------- | -------------------- | -------- | ----- | ----------------- | ----------- |
|            1,000 |       2007-05-26 | Wedding Covered but No Honeymoon |        693 |                10.00 | 481xx    | NM    | 4 years           |           0 |
|            1,000 |       2007-05-26 | Consolidating Debt               |        703 |                10.00 | 010xx    | MA    | < 1 year          |           0 |
|           11,000 |       2007-05-27 | Want to consolidate my debt      |        715 |                10.00 | 212xx    | MD    | 1 year            |           0 |
|            6,000 |       2007-05-27 | waksman                          |        698 |                38.64 | 017xx    | MA    | < 1 year          |           0 |

Only the matching variables (Table 1.1) from the dataset of accepted loans:

Code

!cd data/raw/ &&\
    head -n 5 accepted_2007_to_2018Q4.csv |\
    csvcut -c loan_amnt,issue_d,title,fico_range_low,fico_range_high,dti,zip_code,addr_state,emp_length,policy_code |\
    csvlook

| loan_amnt | issue_d  | title              | fico_range_low | fico_range_high |   dti | zip_code | addr_state | emp_length | policy_code |
| --------- | -------- | ------------------ | -------------- | --------------- | ----- | -------- | ---------- | ---------- | ----------- |
|     3,600 | Dec-2015 | Debt consolidation |            675 |             679 |  5.91 | 190xx    | PA         | 10+ years  |           1 |
|    24,700 | Dec-2015 | Business           |            715 |             719 | 16.06 | 577xx    | SD         | 10+ years  |           1 |
|    20,000 | Dec-2015 |                    |            695 |             699 | 10.78 | 605xx    | IL         | 10+ years  |           1 |
|    35,000 | Dec-2015 | Debt consolidation |            785 |             789 | 17.06 | 076xx    | NJ         | 10+ years  |           1 |

1.2 Variable Description

The description of abbreviated variable names can be found at https://figshare.com/articles/dataset/Lending_club_dataset_description/20016077 (last checked 2023-11-26).

Variables, such as ZIP code, will be explained below.

1.2.1 ZIP Codes

In USA, postal codes are called ZIP (stands for “zone improvement plan”) codes. The meaning of the first 5 digits of the ZIP code are illustrated below.

The meaning of USA ZIP (postal) code digits (source)

The illustration of what the first digit means is shown below.

The USA national ares defined by the 1st digit in ZIP code (source).

2 Python Packages and Functions

The next cells contain the main Python packages and functions, which will be used in the analysis.

# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

# Plotting
%matplotlib inline

# Packages and modules -------------------------------
# Utilities
import os
import re
import warnings
import numpy as np

# Dataframes
import pandas as pd

# EDA and plotting
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

import sweetviz
import klib

# Data wrangling, maths, feature engineering
import numpy as np

# Patch sklearn with Intel's version
from sklearnex import patch_sklearn

patch_sklearn()  # Run this code before importing from sklearn

# Machine learning
import lightgbm as lgb
from sklearn import set_config

from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
)
from sklearn.preprocessing import (StandardScaler, OneHotEncoder)
from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split

# ML: classification models
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import VotingClassifier, RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, SGDRegressor
from lightgbm import LGBMClassifier, LGBMRegressor

# ML: feature selection
from feature_engine.selection import DropFeatures

from feature_engine.creation import CyclicalFeatures

# ML: explainability
import shap

# Display
from IPython.display import display

# Custom functions
import functions.fun_utils as my
import functions.fun_analysis as an
import functions.fun_ml as ml
import functions.utils as utils
from functions.utils import (
    ColumnSelector,
    PreprocessorForGrades,
    PreprocessorForSubgrades,
    PreprocessorForInterestRates,
)

%aimport functions.fun_utils
%aimport functions.fun_analysis
%aimport functions.fun_ml
%aimport functions.utils

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(10, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_colwidth", 50)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# Turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)

# Scikit-learn options
set_config(transform_output="pandas")

# Analysis parameters: use Sweetviz for eda?
do_eda = True

# For caching results ---------------------------------
dir_interim = "data/interim/"
os.mkdir(dir_interim) if not os.path.exists(dir_interim) else None

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

Many custom functions are defined in separate files and are imported as modules. Some functions specific to this project are defined in the next cell. There will be even more ad-hoc convenience functions defined in the analysis (as it seemed a more appropriate place to define them there).

def axis_formatter(format="M", precision=0):
    """Return a function for formatting axis labels.

    Args:
        format (str): The format to use. Either 'M', 'k', or ''.
        precision (int): The number of decimal places to use.

    """
    if format == "M":
        power = 1e-6
    elif format == "k":
        power = 1e-3
    elif format == "":
        power = 1
    else:
        raise ValueError("format must be either 'M', 'k', or ''.")

    def formatter(x, pos):
        return f"{x * power:1.{precision}f}{format}"

    return FuncFormatter(formatter)


def create_column_selector_pattern(column_names):
    """Creates a regex pattern that selects columns matching the column names.

    Given a list of column names, creates a regex pattern that selects columns
    matching the column names.

    Args:
    - column_names: list of str, names of columns to select

    Returns:
    - pattern: str, regex pattern that selects columns matching the column names

    Example:
    >>> column_names = ['age', 'gender', 'income']
    >>> pattern = create_column_selector_pattern(column_names)
    # pattern will be '^(?:age|gender|income)$'
    # This pattern can be used as the value of `pattern` argument of
        ScikitLearn's `make_column_selector()`.
    """
    if isinstance(column_names, str):
        column_names = [column_names]

    if not isinstance(column_names, list):
        raise TypeError("column_names must be either a list or a string.")

    pattern = "|".join(column_names)
    pattern = "^(?:" + pattern + ")$"
    return pattern


def str_to_list(x):
    """Convert a docstring to a list of words."""
    return re.findall(r"\w+", x)

3 Task 1: Predicting Loan Status

In this part, the focus is on predicting whether a loan will be accepted or rejected.

3.1 Import and Inspect Data

First, a subset of columns from the dataset of accepted loans will be imported as ds_accepted. This subset will contain the columns that match the rejected loans’ data (see Table 1.1) with an additional column on income information to perform verification related to the debt-to-income ratio column. Next, the dataset of rejected loans will be imported as ds_rejected. For both datasets, column data types will be pre-selected according to the preliminary investigation of data files. Non-matching data types for matching columns will be fixed after further inspection.

# fmt: off
# Columns for "Accepted" Dataset
columns_in_accepted_ds = [
    'loan_amnt',           # Loan Amount
    'issue_d',             # Application Date
    'title',               # Loan Title
    'fico_range_low',      # Risk Score (Low End)
    'fico_range_high',     # Risk Score (High End)
    'dti',                 # Debt-To-Income Ratio
    'zip_code',            # Zip Code
    'addr_state',          # State
    'emp_length',          # Employment Length
    'policy_code',         # Policy Code
    'annual_inc'           # Annual Income
]

# Define the data types for the selected columns
column_data_types_accepted_ds = {
    'loan_amnt': "float32",
    'issue_d': str,
    'title': str,
    'fico_range_low': "Int16",
    'fico_range_high': "Int16",
    'dti': "float32",
    'zip_code': "category",
    'addr_state': "category",
    'emp_length': "category",
    'policy_code': "Int8"
}

# Define a dictionary for column renaming
column_rename_dict = {
    'loan_amnt': 'loan_amount',
    'issue_d': 'date',
    'title': 'loan_title',
    'fico_range_low': 'risk_score_low',
    'fico_range_high': 'risk_score_high',
    'dti': 'debt_to_income_ratio',
    'zip_code': 'zip_code',
    'addr_state': 'state',
    'emp_length': 'employment_length',
    'policy_code': 'policy_code'
}
# fmt: on

# Read "accepted" dataset, reorder and rename columns
ds_accepted = pd.read_csv(
    "data/raw/accepted_2007_to_2018Q4.csv",
    usecols=columns_in_accepted_ds,
    dtype=column_data_types_accepted_ds,
)[columns_in_accepted_ds].rename(columns=column_rename_dict)


del columns_in_accepted_ds, column_data_types_accepted_ds, column_rename_dict

# Define the data types for the selected columns
column_data_types_rejected_ds = {
    "Amount Requested": "float32",
    "Application Date": str,
    "Loan Title": str,
    "Risk_Score": "Int16",
    "Debt-To-Income Ratio": str,
    "Zip Code": "category",
    "State": "category",
    "Employment Length": "category",
    "Policy Code": "Int8",
}

# Define a dictionary for column renaming
column_rename_dict = {
    "Amount Requested": "loan_amount",
    "Application Date": "date",
    "Loan Title": "loan_title",
    "Risk_Score": "risk_score",
    "Debt-To-Income Ratio": "debt_to_income_ratio",
    "Zip Code": "zip_code",
    "State": "state",
    "Employment Length": "employment_length",
    "Policy Code": "policy_code",
}

# Read the "rejected" dataset with data type conversion and column renaming
ds_rejected = pd.read_csv(
    "data/raw/rejected_2007_to_2018Q4.csv",
    dtype=column_data_types_rejected_ds,
).rename(columns=column_rename_dict)

del column_data_types_rejected_ds, column_rename_dict

The dataset of accepted loans currently has more columns as they are needed to do some calculations and verifications. The columns that are not needed will be dropped later.

Code

ds_accepted.shape

(2260701, 11)

Code

ds_rejected.shape

(27648741, 9)

Both datasets have duplicates:

Code

ds_accepted.duplicated().sum()

Code

ds_rejected.duplicated().sum()

Currently, dataset of accepted loans takes approximately 95 MB while dataset of rejected loans takes 976 MB of memory (~ 10x more). The columns with string data are the largest ones.

Code

ds_accepted.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 11 entries, loan_amount to annual_inc
dtypes: Int16(2), Int8(1), category(3), float32(2), float64(1), object(2)
memory usage: 94.9+ MB

Code

ds_rejected.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Columns: 9 entries, loan_amount to policy_code
dtypes: Int16(1), Int8(1), category(3), float32(1), object(3)
memory usage: 975.7+ MB

In both datasets, it is seen that policy_code is either a constant or almost constant variable.

In ds_accepted, the variable with the largest number of missing values is employment_length (6.5%), and in ds_rejected, risk_score has 66.9% of missing values.

Code

an.col_info(ds_accepted, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_amount	float32	9.0 MB	1,572	0.1%	33	<0.1%	187,236	8.3%	8.3%	10000.0
2	date	object	146.9 MB	139	<0.1%	33	<0.1%	61,992	2.7%	2.7%	Mar-2016
3	loan_title	object	167.9 MB	63,154	2.8%	23,359	1.0%	1,153,293	51.0%	51.5%	Debt consolidation
4	risk_score_low	Int16	6.8 MB	48	<0.1%	33	<0.1%	186,580	8.3%	8.3%	660
5	risk_score_high	Int16	6.8 MB	48	<0.1%	33	<0.1%	186,580	8.3%	8.3%	664
6	debt_to_income_ratio	float32	9.0 MB	10,845	0.5%	1,744	0.1%	1,732	0.1%	0.1%	0.0
7	zip_code	category	4.6 MB	956	<0.1%	34	<0.1%	23,908	1.1%	1.1%	112xx
8	state	category	2.3 MB	51	<0.1%	33	<0.1%	314,533	13.9%	13.9%	CA
9	employment_length	category	2.3 MB	11	<0.1%	146,940	6.5%	748,005	33.1%	35.4%	10+ years
10	policy_code	Int8	4.5 MB	1	<0.1%	33	<0.1%	2,260,668	>99.9%	100.0%	1
11	annual_inc	float64	18.1 MB	89,368	4.0%	37	<0.1%	87,189	3.9%	3.9%	60000.0

Code

an.col_info(ds_rejected, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_amount	float32	110.6 MB	3,640	<0.1%	0	0%	3,920,004	14.2%	14.2%	10000.0
2	date	object	1.9 GB	4,238	<0.1%	0	0%	42,112	0.2%	0.2%	2018-12-04
3	loan_title	object	2.0 GB	73,927	0.3%	1,305	<0.1%	6,418,016	23.2%	23.2%	Debt consolidation
4	risk_score	Int16	82.9 MB	692	<0.1%	18,497,630	66.9%	178,456	0.6%	2.0%	501
5	debt_to_income_ratio	object	1.7 GB	126,145	0.5%	0	0%	1,362,556	4.9%	4.9%	100%
6	zip_code	category	55.4 MB	1,001	<0.1%	293	<0.1%	267,102	1.0%	1.0%	112xx
7	state	category	27.7 MB	51	<0.1%	22	<0.1%	3,242,169	11.7%	11.7%	CA
8	employment_length	category	27.6 MB	11	<0.1%	951,355	3.4%	22,994,315	83.2%	86.1%	< 1 year
9	policy_code	Int8	55.3 MB	2	<0.1%	918	<0.1%	27,559,694	99.7%	99.7%	0

Code

ds_accepted.head()

	loan_amount	date	loan_title	risk_score_low	risk_score_high	debt_to_income_ratio	zip_code	state	employment_length	policy_code	annual_inc
0	3600.00	Dec-2015	Debt consolidation	675	679	5.91	190xx	PA	10+ years	1	55000.00
1	24700.00	Dec-2015	Business	715	719	16.06	577xx	SD	10+ years	1	65000.00
2	20000.00	Dec-2015	NaN	695	699	10.78	605xx	IL	10+ years	1	63000.00
3	35000.00	Dec-2015	Debt consolidation	785	789	17.06	076xx	NJ	10+ years	1	110000.00
4	10400.00	Dec-2015	Major purchase	695	699	25.37	174xx	PA	3 years	1	104433.00

Code

ds_rejected.head()

	loan_amount	date	loan_title	risk_score	debt_to_income_ratio	zip_code	state	employment_length
0	1000.00	2007-05-26	Wedding Covered but No Honeymoon	693	10%	481xx	NM	4 years
1	1000.00	2007-05-26	Consolidating Debt	703	10%	010xx	MA	< 1 year
2	11000.00	2007-05-27	Want to consolidate my debt	715	10%	212xx	MD	1 year
3	6000.00	2007-05-27	waksman	698	38.64%	017xx	MA	< 1 year
4	1500.00	2007-05-27	mdrigo	509	9.43%	209xx	MD	< 1 year

Dates in the accepted loans dataset are in the format of MMM-YYYY (e.g., Mar-2016) format, and in the rejected loans dataset are in YYYY-MM-DD (e.g., 2018-12-04). This should be unified.

Code

ds_accepted.date.unique()[:10]

array(['Dec-2015', 'Nov-2015', 'Oct-2015', 'Sep-2015', 'Aug-2015',
       'Jul-2015', 'Jun-2015', 'May-2015', 'Apr-2015', 'Mar-2015'],
      dtype=object)

Code

ds_rejected.date.unique()[:10]

array(['2007-05-26', '2007-05-27', '2007-05-28', '2007-05-29',
       '2007-05-30', '2007-05-31', '2007-06-01', '2007-06-02',
       '2007-06-03', '2007-06-04'], dtype=object)

In the dataset of accepted loans, relevant FICO (risk) score values range from 610 to 850. The difference between the lower and upper boundaries of this score in most cases is 4 points and just in rare cases it is 5. To express the score not as an interval but as a single value, an average of lower and upper boundaries will be used. In the rejected loans dataset, the risk score is expressed as a single value, ranging from 0 to 990.

Details: Risk score

Code

ds_accepted[["risk_score_low", "risk_score_high"]].describe()

	risk_score_low	risk_score_high
count	2260668.00	2260668.00
mean	698.59	702.59
std	33.01	33.01
min	610.00	614.00
25%	675.00	679.00
50%	690.00	694.00
75%	715.00	719.00
max	845.00	850.00

Code

risk_score_diff = ds_accepted.risk_score_high - ds_accepted.risk_score_low
display(risk_score_diff.describe())
an.summarize_discrete(pd.DataFrame({"risk_score_diff": risk_score_diff}))

count   2260668.00
mean          4.00
std           0.01
min           4.00
25%           4.00
50%           4.00
75%           4.00
max           5.00
dtype: Float64

risk_score_diff	n	percent
4	2,260,227	>99.9%
5	441	<0.1%

Code

ds_rejected.risk_score.describe()

count   9151111.00
mean        628.17
std          89.94
min           0.00
25%         591.00
50%         637.00
75%         675.00
max         990.00
Name: risk_score, dtype: Float64

In the accepted loans dataset, debt-to-income ratio units of measurement are not present. However, it seems that the debt-to-income ratio is in percent as values are comparable to percentage values of the loan-to-income ratio (the debt-to-income ratio is a more complex indicator so exact values do not match):

ds_accepted_10 = ds_accepted.head(n=10)
display(
    pd.DataFrame({
        "loan_to_income_ratio": ds_accepted_10.loan_amount
        / ds_accepted_10.annual_inc
        * 100,
        "debt_to_income_ratio": ds_accepted_10.debt_to_income_ratio,
    })
)
del ds_accepted_10

	loan_to_income_ratio	debt_to_income_ratio
0	6.55	5.91
1	38.00	16.06
2	31.75	10.78
3	31.82	17.06
4	9.96	25.37
5	35.15	10.20
6	11.11	14.67
7	23.53	17.61
8	11.76	13.07
9	19.05	34.80

It seems, that in ds_rejected, it is enough to remove the % sign from debt_to_income_ratio and convert it to numeric values.

What is more, some values are negative (-1%) which may mean that it is missing value indicator. This will be accounted for in the further analysis.

Code

ds_rejected.debt_to_income_ratio.value_counts()

debt_to_income_ratio
100%         1362556
-1%          1203063
0%           1045102
9999%          76984
1.2%           32659
              ...   
983.82%            1
1352.48%           1
3544.74%           1
5452.96%           1
21215.75%          1
Name: count, Length: 126145, dtype: int64

Code

ds_accepted.debt_to_income_ratio.value_counts()

debt_to_income_ratio
0.00      1732
18.00     1584
14.40     1577
16.80     1576
19.20     1566
          ... 
261.54       1
74.98        1
111.40       1
180.90       1
250.72       1
Name: count, Length: 10845, dtype: int64

The employment length categories are the same but in incorrect order.

Code

ds_accepted.employment_length.cat.categories

Index(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
       '6 years', '7 years', '8 years', '9 years', '< 1 year'],
      dtype='object')

Code

ds_rejected.employment_length.cat.categories

Index(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
       '6 years', '7 years', '8 years', '9 years', '< 1 year'],
      dtype='object')

The abbreviations of the states are the same in both datasets, but the order is different. “Rejected” also has missing values represented as empty strings. These should be accounted for.

Code

states_accepted = ds_accepted.state.cat.categories
print("n categories =", len(states_accepted))
states_accepted

n categories = 51

Index(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
       'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS',
       'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR',
       'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV',
       'WY', 'ID', 'IA'],
      dtype='object')

Code

states_rejected = ds_rejected.state.cat.categories
print("n categories =", len(states_rejected))
states_rejected

n categories = 51

Index(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
       'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN',
       'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH',
       'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA',
       'WI', 'WV', 'WY'],
      dtype='object')

ZIP codes are represented as 3-digit numbers. The meaning of these digits is illustrated in Section 1.2.1.

Code

zip_accepted = ds_accepted.zip_code.cat.categories
print("n categories =", len(zip_accepted))
zip_accepted

n categories = 956

Index(['010xx', '011xx', '012xx', '013xx', '014xx', '015xx', '016xx', '017xx',
       '018xx', '019xx',
       ...
       '733xx', '964xx', '375xx', '514xx', '698xx', '643xx', '202xx', '552xx',
       '055xx', '896xx'],
      dtype='object', length=956)

ZIP codes with less than 3 digits present indicate possible issues.

Code

# Number of ZIP codes with less than 3 digits:
(ds_rejected.zip_code.str.count(r"[0-9]") < 3).sum()

Code

zip_rejected = ds_rejected.zip_code.cat.categories
print("n categories =", len(zip_rejected))
zip_rejected

n categories = 1001

Index(['000xx', '002xx', '006xx', '007xx', '008xx', '009xx', '010xx', '011xx',
       '012xx', '013xx',
       ...
       '839xx', '695xx', '818xx', '866xx', '849xx', '694xx', '579xx', '518xx',
       '004xx', '699xx'],
      dtype='object', length=1001)

There are 1001 unique values while 1000 is expected. This means that there are mistakes in the data. Manual inspection showed that 09O (with the letter “O”) instead of 090 was present in the data. This will be fixed.

3.2 Pre-Process Data

First, datasets will be pre-processed to make them mergeable. Then, they will be merged and further pre-processing will be done.

3.2.1 Pre-Process and Merge

Prepare data types for categorical variables.

Code

# Work length categories are  "< 1 year", "1 year", ..., "9 years", "10+ years"
work_dtype = pd.CategoricalDtype(utils.work_categories, ordered=True)

# State abbreviation categories
state_dtype = pd.CategoricalDtype(
    sorted(ds_accepted.state.cat.categories), ordered=False
)

Apply pre-processing steps to both datasets. The names of pre-processed datasets will be suffixed with _2.

Code

ds_accepted_2 = (
    # fmt: off
    ds_accepted
    .dropna(subset=['loan_amount'])
    .drop_duplicates()
    .assign(
        date=lambda x: pd.to_datetime(x.date, format="%b-%Y").dt.floor("d"),
        risk_score=lambda x: (
            ((x.risk_score_high + x.risk_score_low) / 2).astype("float16")
        ),
        state=lambda x: x.state.astype(state_dtype),
        zip_area=lambda x: x.zip_code.str[0].astype("Int16"),
        zip_code=lambda x: x.zip_code.str[:3].astype("Int16"),
        employment_length=lambda x: x.employment_length.astype(work_dtype),
        loan_status=1,
    )
    # fmt: on
    .astype({"loan_status": "Int8"}).drop(
        columns=["risk_score_high", "risk_score_low", "annual_inc"]
    )
)
an.col_info(ds_accepted_2, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_amount	float32	9.0 MB	1,572	0.1%	0	0%	187,236	8.3%	8.3%	10000.0
2	date	datetime64[ns]	18.1 MB	139	<0.1%	0	0%	61,992	2.7%	2.7%	2016-03-01 00:00:00
3	loan_title	object	167.9 MB	63,154	2.8%	23,326	1.0%	1,153,293	51.0%	51.5%	Debt consolidation
4	debt_to_income_ratio	float32	9.0 MB	10,845	0.5%	1,711	0.1%	1,732	0.1%	0.1%	0.0
5	zip_code	Int16	6.8 MB	956	<0.1%	1	<0.1%	23,908	1.1%	1.1%	112
6	state	category	2.3 MB	51	<0.1%	0	0%	314,533	13.9%	13.9%	CA
7	employment_length	category	2.3 MB	11	<0.1%	146,907	6.5%	748,005	33.1%	35.4%	10+ years
8	policy_code	Int8	4.5 MB	1	<0.1%	0	0%	2,260,668	100.0%	100.0%	1
9	risk_score	float16	4.5 MB	48	<0.1%	0	0%	186,580	8.3%	8.3%	662.0
10	zip_area	Int16	6.8 MB	10	<0.1%	1	<0.1%	404,303	17.9%	17.9%	9
11	loan_status	Int8	4.5 MB	1	<0.1%	0	0%	2,260,668	100.0%	100.0%	1

Code

ds_rejected_2 = (
    # fmt: off
    ds_rejected
    .dropna(subset=['loan_amount'])
    .drop_duplicates()
    .assign(
        date=lambda x: pd.to_datetime(x.date, format="%Y-%m-%d").dt.floor("d"),
        debt_to_income_ratio=lambda x: (
            x.debt_to_income_ratio.str.rstrip("%").astype("float32")
        ),
        state=lambda x: x.state.astype(state_dtype),
        zip_area=lambda x: x.zip_code.str[0].astype("Int16"),
        zip_code=lambda x: (
            x.zip_code.str[:3].str.replace("O|o", "0", regex=True).astype("Int16")
        ),
        employment_length=lambda x: x.employment_length.astype(work_dtype),
        loan_status=0,
    )
    # fmt: on
    .astype({"loan_status": "Int8", "risk_score": "float16"})
)[ds_accepted_2.columns]

an.col_info(ds_rejected_2, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_amount	float32	110.0 MB	3,640	<0.1%	0	0%	3,889,772	14.1%	14.1%	10000.0
2	date	datetime64[ns]	219.9 MB	4,238	<0.1%	0	0%	41,696	0.2%	0.2%	2018-12-04 00:00:00
3	loan_title	object	2.0 GB	73,927	0.3%	1,285	<0.1%	6,362,745	23.1%	23.1%	Debt consolidation
4	debt_to_income_ratio	float32	110.0 MB	126,145	0.5%	0	0%	1,311,209	4.8%	4.8%	100.0
5	zip_code	Int16	82.5 MB	1,000	<0.1%	292	<0.1%	262,986	1.0%	1.0%	112
6	state	category	27.5 MB	51	<0.1%	22	<0.1%	3,218,415	11.7%	11.7%	CA
7	employment_length	category	27.5 MB	11	<0.1%	949,702	3.5%	22,841,895	83.1%	86.1%	< 1 year
8	policy_code	Int8	55.0 MB	2	<0.1%	918	<0.1%	27,401,835	99.7%	99.7%	0
9	risk_score	float16	55.0 MB	692	<0.1%	18,359,858	66.8%	178,272	0.6%	2.0%	501.0
10	zip_area	Int16	82.5 MB	10	<0.1%	292	<0.1%	4,568,853	16.6%	16.6%	3
11	loan_status	Int8	55.0 MB	1	<0.1%	0	0%	27,490,787	100.0%	100.0%	0

Merge datasets by binding rows. Call the resulting dataset loans and save it into a feather file as an intermediate result.

Code

# Bind rows
loans = pd.concat([ds_accepted_2, ds_rejected_2])

Code

# Save intermediate results
loans = loans.reset_index(drop=True)
loans.to_feather(dir_interim + "task-1-1_merged.feather")
del ds_accepted, ds_rejected, ds_accepted_2, ds_rejected_2

Code

loans.shape

(29751455, 11)

Code

an.col_info(loans, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_amount	float32	119.0 MB	3,640	<0.1%	0	0%	4,077,008	13.7%	13.7%	10000.0
2	date	datetime64[ns]	238.0 MB	4,238	<0.1%	0	0%	83,503	0.3%	0.3%	2018-10-01 00:00:00
3	loan_title	object	2.1 GB	127,125	0.4%	24,611	0.1%	7,516,038	25.3%	25.3%	Debt consolidation
4	debt_to_income_ratio	float32	119.0 MB	126,161	0.4%	1,711	<0.1%	1,311,215	4.4%	4.4%	100.0
5	zip_code	Int16	89.3 MB	1,000	<0.1%	293	<0.1%	286,894	1.0%	1.0%	112
6	state	category	29.8 MB	51	<0.1%	22	<0.1%	3,532,948	11.9%	11.9%	CA
7	employment_length	category	29.8 MB	11	<0.1%	1,096,609	3.7%	23,031,883	77.4%	80.4%	< 1 year
8	policy_code	Int8	59.5 MB	3	<0.1%	918	<0.1%	27,401,835	92.1%	92.1%	0
9	risk_score	float16	59.5 MB	693	<0.1%	18,359,858	61.7%	243,521	0.8%	2.1%	662.0
10	zip_area	Int16	89.3 MB	10	<0.1%	293	<0.1%	4,880,433	16.4%	16.4%	3
11	loan_status	Int8	59.5 MB	2	<0.1%	0	0%	27,490,787	92.4%	92.4%	0

3.2.2 Pre-Process Merged Data

Next, the merged dataset will be pre-processed further. Intermediate results will be saved into feather files as calculations are time-consuming.

Note

Pre-processing and EDA were iterative procedures: some pre-processing steps were invented after EDA on the training set. In this report, some of the steps are presented in a sequential (no cyclic) order.

Code

loans = pd.read_feather(dir_interim + "task-1-1_merged.feather")

Extract parts of dates and convert them to numeric values.

file = dir_interim + "task-1-2_preprocess_1_date_elements.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    loans = loans.assign(
        year=lambda df: df["date"].dt.year,
        month=lambda df: df["date"].dt.month,
    )

    loans.to_feather(file)

del file

Extract various technical features of titles like length, number of words, and number of certain characters.

file = dir_interim + "task-1-2_preprocess_2_title_stats.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    word_pattern = r"\w+((-\w)*\w)*"
    loans = loans.assign(
        title_len=lambda df: df["loan_title"].str.len(),
        title_n_capital_letters=lambda df: df["loan_title"].str.count(r"[A-Z]"),
        title_n_non_capital_letters=lambda df: df["loan_title"].str.count(r"[a-z]"),
        title_n_letters=lambda df: df.title_n_capital_letters
        + df.title_n_non_capital_letters,
        title_n_digits=lambda df: df["loan_title"].str.count(r"[0-9]"),
        title_n_punctuation=lambda df: df["loan_title"].str.count(r"[^\w\s]"),
        title_n_spaces=lambda df: df["loan_title"].str.count(r"[ ]"),
        title_n_words=lambda df: df["loan_title"].str.count(word_pattern),
        title_in_title_case=lambda df: df["loan_title"].str.istitle(),
        title_in_upper_case=lambda df: df["loan_title"].str.isupper(),
        title_in_lower_case=lambda df: df["loan_title"].str.islower(),
        title_is_missing_or_empty=lambda df: df["loan_title"].isna()
        | df["loan_title"].str.strip().eq(""),
    )

    loans.to_feather(file)

del file

The titles were manually inspected and it was noticed that some information was presented inconsistently (e.g., presence of abbreviations, inconsistent abbreviations, spelling mistakes). So the next step was to create a bit more unified version of titles for further pre-processing.

file = dir_interim + "task-1-2_preprocess_3_cleaner_title.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    loans = loans.assign(
        loan_title_unified=lambda df: (
            df["loan_title"]
            .str.lower()
            .str.replace(r"[-\t_.!/ ]+", " ", regex=True)
            .str.replace(r"for", "")
            .str.replace(r"(card|loan)s?", r" \1 ", regex=True)
            .str.replace(r"pay( )*off?s?", " payoff ", regex=True)
            .str.replace(r"(cc|creditcard|^c c )", " credit card ", regex=True)
            .str.replace(r"(de[bp]i?t)|(deb( |$))s?", " debt ", regex=True)
            .str.replace(r"debt( )*cons?( |$)", "debt consolidation", regex=True)
            .str.replace(r"re fi", "refi")
            .str.replace(
                r"consolidat(e|ing|ions|or)|"  # various endings
                r"conso(l(id)?(atio)?)?( |$)|"  # various abbreviations
                r"con[sc][oia]?l?[iao]?[dt]?a?[tc]ion",  # various misspellings
                "consolidation",
                regex=True,
            )
            .str.replace(r"[ ]+", " ", regex=True)
            .str.strip()
            .replace("", pd.NA)
            .replace(np.nan, pd.NA)
        )
    )

    loans.to_feather(file)

del file

Extract the presence/absence of certain words in the titles. These words were subjectively chosen based on the manual inspection of titles.

file = dir_interim + "task-1-2_preprocess_4_extract_title_features.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    # Extract features from the loan title
    loans = loans.assign(
        credit=lambda df: df["loan_title_unified"].str.contains(r"credit"),
        card=lambda df: df["loan_title_unified"].str.contains(r"card"),
        refinancing=lambda df: df["loan_title_unified"].str.contains(r"refi"),
        consolidation=lambda df: df["loan_title_unified"].str.contains(r"consolidat"),
        debt_related=lambda df: df["loan_title_unified"].str.contains(r"debt"),
        bills_taxes=lambda df: df["loan_title_unified"].str.contains(r"bill|tax|rent"),
        payoff=lambda df: df["loan_title_unified"].str.contains(r"payoff|payitoff"),
        home_upgrade=lambda df: df["loan_title_unified"].str.contains(
            r"home i[nm]p|home re[np]|home upgrade|renovation|"
            r"roof|furniture|kitchen|bathroom|pool|windows"
        ),
        home_related=lambda df: df["loan_title_unified"].str.contains(
            r"home|house|apartment|flat"
        ),
        home_buying=lambda df: df["loan_title_unified"].str.contains(
            r"home buying|new home|buy house|new house|my house"
        ),
        fixing=lambda df: df["loan_title_unified"].str.contains(
            r"fix|i[nm]prov|imp$|repair|upgrade|renovat"
        ),
        major_purchase_unspecified=lambda df: df["loan_title_unified"].str.contains(
            r"major purchase"
        ),
        relocation=lambda df: df["loan_title_unified"].str.contains(
            r"moving|move|relocate|relocation(?! forward)"
        ),
        weddings=lambda df: df["loan_title_unified"].str.contains(
            r"wedding|engagement|marr[iy]"
        ),
        car=lambda df: df["loan_title_unified"].str.contains(r"car|auto|jeep"),
        motorcycle=lambda df: df["loan_title_unified"].str.contains(
            r"motorcycle|harley"
        ),
        vehicle_unspecified_or_other=lambda df: df["loan_title_unified"].str.contains(
            r"truck|atv|boat|camper|vehicle"
        ),
        medical_expenses=lambda df: df["loan_title_unified"].str.contains(
            r"med|health|surgery|dental"
        ),
        education=lambda df: df["loan_title_unified"].str.contains(
            r"student|college|school|educat|learn"
        ),
        investment=lambda df: df["loan_title_unified"].str.contains(r"invest"),
        vocation=lambda df: df["loan_title_unified"].str.contains(
            r"vacation|travel|holiday|spa"
        ),
        renewable_energy=lambda df: df["loan_title_unified"].str.contains(
            r"renewable energy|green"
        ),
    )

    # Change data types
    data_types_dict = {
        "loan_status": "Int8",
        "year": "Int16",
        "month": "Int8",
        "risk_score": "float32",
        "policy_code": "Int8",
        "zip_area": "Int16",
        "zip_code": "Int16",
        "title_len": "float32",
        "title_n_capital_letters": "float32",
        "title_n_non_capital_letters": "float32",
        "title_n_letters": "float32",
        "title_n_digits": "float32",
        "title_n_punctuation": "float32",
        "title_n_spaces": "float32",
        "title_n_words": "float32",
        "title_in_title_case": "Int8",
        "title_in_upper_case": "Int8",
        "title_in_lower_case": "Int8",
        "title_is_missing_or_empty": "Int8",
        "credit": "Int8",
        "card": "Int8",
        "refinancing": "Int8",
        "consolidation": "Int8",
        "debt_related": "Int8",
        "bills_taxes": "Int8",
        "payoff": "Int8",
        "home_upgrade": "Int8",
        "home_related": "Int8",
        "home_buying": "Int8",
        "fixing": "Int8",
        "major_purchase_unspecified": "Int8",
        "relocation": "Int8",
        "weddings": "Int8",
        "car": "Int8",
        "motorcycle": "Int8",
        "vehicle_unspecified_or_other": "Int8",
        "medical_expenses": "Int8",
        "education": "Int8",
        "investment": "Int8",
        "vocation": "Int8",
        "renewable_energy": "Int8",
    }
    loans = loans.astype(data_types_dict)

    # Save
    loans.to_feather(file)

del file

The following variables are created after the first round of EDA on training data (the subset dedicated to EDA):

file = dir_interim + "task-1-2_preprocess_5_add_more_features.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    # Add more features
    loans = loans.assign(
        loan_amount_above_40k=lambda df: (df.loan_amount > 40000).astype("Int8"),
        loan_amount_log=lambda df: np.log1p(df.loan_amount),
        loan_amount_cap_40k=lambda df: df.loan_amount.clip(upper=40000),
        # Leave the original values as they are for EDA
        debt_to_income_ratio_original=lambda df: df.debt_to_income_ratio,
        debt_to_income_ratio_orig_cap_100=lambda df: df.debt_to_income_ratio.clip(
            upper=100
        ),
        debt_to_income_ratio_is_na_original=lambda df: df.debt_to_income_ratio_original.isna().astype(
            "Int8"
        ),
        # Replace -1 with NA
        debt_to_income_ratio=lambda df: df.debt_to_income_ratio.replace(
            -1, np.nan
        ).astype("float32"),
        debt_to_income_ratio_cap_100=lambda df: df.debt_to_income_ratio.clip(upper=100),
        employment_length_num=lambda df: df.employment_length.replace(
            dict(zip(utils.work_categories, range(len(utils.work_categories))))
        ).astype("Int8"),
        employment_length_is_na=lambda df: df.employment_length.isna().astype("Int8"),
        debt_to_income_ratio_is_na=lambda df: (
            df.debt_to_income_ratio.isna().astype("Int8")
        ),
        zip_code_is_na=lambda df: df.zip_code.isna().astype("Int8"),
        risk_score_is_na=lambda df: df.risk_score.isna().astype("Int8"),
    )

    # Sin/Cos transformation for months
    loans = (
        CyclicalFeatures("month", {"month": 12})
        .fit_transform(loans)
        .astype({"month_sin": "float32", "month_cos": "float32"})
    )

    # Re-order columns
    new_column_order = [
        "loan_status",
        "policy_code",
        "date",
        "year",
        "month",
        "month_sin",
        "month_cos",
        "loan_amount",
        "loan_amount_log",
        "loan_amount_cap_40k",
        "loan_amount_above_40k",
        "debt_to_income_ratio",
        "debt_to_income_ratio_cap_100",
        "debt_to_income_ratio_is_na",
        "debt_to_income_ratio_original",
        "debt_to_income_ratio_orig_cap_100",
        "debt_to_income_ratio_is_na_original",
        "risk_score",
        "risk_score_is_na",
        "employment_length",
        "employment_length_num",
        "employment_length_is_na",
        "state",
        "zip_area",
        "zip_code",
        "zip_code_is_na",
        "loan_title",
        "loan_title_unified",
        "title_is_missing_or_empty",
        "title_len",
        "title_n_letters",
        "title_n_capital_letters",
        "title_n_non_capital_letters",
        "title_n_digits",
        "title_n_punctuation",
        "title_n_spaces",
        "title_n_words",
        "title_in_title_case",
        "title_in_upper_case",
        "title_in_lower_case",
        "credit",
        "card",
        "refinancing",
        "consolidation",
        "debt_related",
        "bills_taxes",
        "payoff",
        "home_upgrade",
        "home_related",
        "home_buying",
        "fixing",
        "major_purchase_unspecified",
        "relocation",
        "weddings",
        "car",
        "motorcycle",
        "vehicle_unspecified_or_other",
        "medical_expenses",
        "education",
        "investment",
        "vocation",
        "renewable_energy",
    ]
    loans = loans[new_column_order]
    # Save intermediate results
    loans.to_feather(file)

del file
# Time: 1m 39.1s

3.2.3 Inspect After Pre-Processing

In this section, the dataset will be inspected for the most obvious issues before spiting it into training and testing subsets and performing EDA.

Code

loans.shape

(29751455, 62)

Code

loans.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29751455 entries, 0 to 29751454
Columns: 62 entries, loan_status to renewable_energy
dtypes: Int16(3), Int8(36), category(2), datetime64[ns](1), float32(18), object(2)
memory usage: 5.0+ GB

Code

an.col_info(loans, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_status	Int8	59.5 MB	2	<0.1%	0	0%	27,490,787	92.4%	92.4%	0
2	policy_code	Int8	59.5 MB	3	<0.1%	918	<0.1%	27,401,835	92.1%	92.1%	0
3	date	datetime64[ns]	238.0 MB	4,238	<0.1%	0	0%	83,503	0.3%	0.3%	2018-10-01 00:00:00
4	year	Int16	89.3 MB	12	<0.1%	0	0%	9,910,336	33.3%	33.3%	2018
5	month	Int8	59.5 MB	12	<0.1%	0	0%	2,928,639	9.8%	9.8%	10
6	month_sin	float32	119.0 MB	8	<0.1%	0	0%	5,715,577	19.2%	19.2%	-0.8660254
7	month_cos	float32	119.0 MB	8	<0.1%	0	0%	5,189,861	17.4%	17.4%	-0.8660254
8	loan_amount	float32	119.0 MB	3,640	<0.1%	0	0%	4,077,008	13.7%	13.7%	10000.0
9	loan_amount_log	float32	119.0 MB	3,640	<0.1%	0	0%	4,077,008	13.7%	13.7%	9.210441
10	loan_amount_cap_40k	float32	119.0 MB	2,074	<0.1%	0	0%	4,077,008	13.7%	13.7%	10000.0
11	loan_amount_above_40k	Int8	59.5 MB	2	<0.1%	0	0%	29,577,932	99.4%	99.4%	0
12	debt_to_income_ratio	float32	119.0 MB	126,160	0.4%	1,167,451	3.9%	1,311,215	4.4%	4.6%	100.0
13	debt_to_income_ratio_cap_100	float32	119.0 MB	10,001	<0.1%	1,167,451	3.9%	2,118,993	7.1%	7.4%	100.0
14	debt_to_income_ratio_is_na	Int8	59.5 MB	2	<0.1%	0	0%	28,584,004	96.1%	96.1%	0
15	debt_to_income_ratio_original	float32	119.0 MB	126,161	0.4%	1,711	<0.1%	1,311,215	4.4%	4.4%	100.0
16	debt_to_income_ratio_orig_cap_100	float32	119.0 MB	10,002	<0.1%	1,711	<0.1%	2,118,993	7.1%	7.1%	100.0
17	debt_to_income_ratio_is_na_original	Int8	59.5 MB	2	<0.1%	0	0%	29,749,744	>99.9%	>99.9%	0
18	risk_score	float32	119.0 MB	693	<0.1%	18,359,858	61.7%	243,521	0.8%	2.1%	662.0
19	risk_score_is_na	Int8	59.5 MB	2	<0.1%	0	0%	18,359,858	61.7%	61.7%	1
20	employment_length	category	29.8 MB	11	<0.1%	1,096,609	3.7%	23,031,883	77.4%	80.4%	< 1 year
21	employment_length_num	Int8	59.5 MB	11	<0.1%	1,096,609	3.7%	23,031,883	77.4%	80.4%	0
22	employment_length_is_na	Int8	59.5 MB	2	<0.1%	0	0%	28,654,846	96.3%	96.3%	0
23	state	category	29.8 MB	51	<0.1%	22	<0.1%	3,532,948	11.9%	11.9%	CA
24	zip_area	Int16	89.3 MB	10	<0.1%	293	<0.1%	4,880,433	16.4%	16.4%	3
25	zip_code	Int16	89.3 MB	1,000	<0.1%	293	<0.1%	286,894	1.0%	1.0%	112
26	zip_code_is_na	Int8	59.5 MB	2	<0.1%	0	0%	29,751,162	>99.9%	>99.9%	0
27	loan_title	object	2.1 GB	127,125	0.4%	24,611	0.1%	7,516,038	25.3%	25.3%	Debt consolidation
28	loan_title_unified	object	2.1 GB	99,875	0.3%	39,845	0.1%	13,424,529	45.1%	45.2%	debt consolidation
29	title_is_missing_or_empty	Int8	59.5 MB	2	<0.1%	0	0%	29,711,614	99.9%	99.9%	0
30	title_len	float32	119.0 MB	195	<0.1%	24,611	0.1%	13,434,558	45.2%	45.2%	18.0
31	title_n_letters	float32	119.0 MB	182	<0.1%	24,611	0.1%	13,439,150	45.2%	45.2%	17.0
32	title_n_capital_letters	float32	119.0 MB	51	<0.1%	24,611	0.1%	16,695,522	56.1%	56.2%	1.0
33	title_n_non_capital_letters	float32	119.0 MB	172	<0.1%	24,611	0.1%	7,532,772	25.3%	25.3%	16.0
34	title_n_digits	float32	119.0 MB	28	<0.1%	24,611	0.1%	29,711,627	99.9%	99.9%	0.0
35	title_n_punctuation	float32	119.0 MB	37	<0.1%	24,611	0.1%	29,701,231	99.8%	99.9%	0.0
36	title_n_spaces	float32	119.0 MB	96	<0.1%	24,611	0.1%	15,653,630	52.6%	52.7%	0.0
37	title_n_words	float32	119.0 MB	95	<0.1%	24,611	0.1%	15,656,042	52.6%	52.7%	1.0
38	title_in_title_case	Int8	59.5 MB	2	<0.1%	24,611	0.1%	26,103,081	87.7%	87.8%	0
39	title_in_upper_case	Int8	59.5 MB	2	<0.1%	24,611	0.1%	29,713,719	99.9%	>99.9%	0
40	title_in_lower_case	Int8	59.5 MB	2	<0.1%	24,611	0.1%	17,200,037	57.8%	57.9%	0
41	credit	Int8	59.5 MB	2	<0.1%	39,845	0.1%	25,539,395	85.8%	86.0%	0
42	card	Int8	59.5 MB	2	<0.1%	39,845	0.1%	25,544,566	85.9%	86.0%	0
43	refinancing	Int8	59.5 MB	2	<0.1%	39,845	0.1%	26,942,551	90.6%	90.7%	0
44	consolidation	Int8	59.5 MB	2	<0.1%	39,845	0.1%	16,231,183	54.6%	54.6%	0
45	debt_related	Int8	59.5 MB	2	<0.1%	39,845	0.1%	16,239,908	54.6%	54.7%	0
46	bills_taxes	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,701,513	99.8%	>99.9%	0
47	payoff	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,686,528	99.8%	99.9%	0
48	home_upgrade	Int8	59.5 MB	2	<0.1%	39,845	0.1%	28,374,364	95.4%	95.5%	0
49	home_related	Int8	59.5 MB	2	<0.1%	39,845	0.1%	27,681,412	93.0%	93.2%	0
50	home_buying	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,206,602	98.2%	98.3%	0
51	fixing	Int8	59.5 MB	2	<0.1%	39,845	0.1%	28,373,548	95.4%	95.5%	0
52	major_purchase_unspecified	Int8	59.5 MB	2	<0.1%	39,845	0.1%	28,711,014	96.5%	96.6%	0
53	relocation	Int8	59.5 MB	2	<0.1%	39,845	0.1%	28,995,543	97.5%	97.6%	0
54	weddings	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,688,728	99.8%	99.9%	0
55	car	Int8	59.5 MB	2	<0.1%	39,845	0.1%	24,230,788	81.4%	81.6%	0
56	motorcycle	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,710,218	99.9%	>99.9%	0
57	vehicle_unspecified_or_other	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,709,495	99.9%	>99.9%	0
58	medical_expenses	Int8	59.5 MB	2	<0.1%	39,845	0.1%	28,935,736	97.3%	97.4%	0
59	education	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,697,558	99.8%	>99.9%	0
60	investment	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,709,698	99.9%	>99.9%	0
61	vocation	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,387,506	98.8%	98.9%	0
62	renewable_energy	Int8	59.5 MB	2	<0.1%	39,845	0.1%	29,653,502	99.7%	99.8%	0

Code

loans.describe().T

	count	mean	min	25%	50%	75%	max	std
loan_status	29751455.00	0.08	0.00	0.00	0.00	0.00	1.00	0.26
policy_code	29750537.00	0.08	0.00	0.00	0.00	0.00	2.00	0.28
date	29751455	2016-12-17 10:41:14.614325760	2007-05-26 00:00:00	2016-01-23 00:00:00	2017-05-28 00:00:00	2018-04-20 00:00:00	2018-12-31 00:00:00	NaN
year	29751455.00	2016.43	2007.00	2016.00	2017.00	2018.00	2018.00	1.68
month	29751455.00	6.93	1.00	4.00	7.00	10.00	12.00	3.37
month_sin	29751455.00	-0.08	-1.00	-0.87	-0.00	0.50	1.00	0.69
month_cos	29751455.00	-0.01	-1.00	-0.87	-0.00	0.87	1.00	0.73
loan_amount	29751455.00	13251.77	0.00	5000.00	10000.00	20000.00	1400000.00	13716.20
loan_amount_log	29751455.00	9.04	0.00	8.52	9.21	9.90	14.15	0.98
loan_amount_cap_40k	29751455.00	12805.58	0.00	5000.00	10000.00	20000.00	40000.00	10199.33
loan_amount_above_40k	29751455.00	0.01	0.00	0.00	0.00	0.00	1.00	0.08
debt_to_income_ratio	28584004.00	139.46	0.00	10.00	20.58	35.88	50000032.00	10348.52
debt_to_income_ratio_cap_100	28584004.00	27.91	0.00	10.00	20.58	35.88	100.00	25.35
debt_to_income_ratio_is_na	29751455.00	0.04	0.00	0.00	0.00	0.00	1.00	0.19
debt_to_income_ratio_original	29749744.00	133.95	-1.00	8.58	19.67	35.03	50000032.00	10143.76
debt_to_income_ratio_orig_cap_100	29749744.00	26.77	-1.00	8.58	19.67	35.03	100.00	25.14
debt_to_income_ratio_is_na_original	29751455.00	0.00	0.00	0.00	0.00	0.00	1.00	0.01
risk_score	11391597.00	642.54	0.00	605.00	655.00	689.00	990.00	96.22
risk_score_is_na	29751455.00	0.62	0.00	0.00	1.00	1.00	1.00	0.49
employment_length_num	28654846.00	1.10	0.00	0.00	0.00	0.00	10.00	2.54
employment_length_is_na	29751455.00	0.04	0.00	0.00	0.00	0.00	1.00	0.19
zip_area	29751162.00	4.57	0.00	2.00	4.00	7.00	9.00	2.98
zip_code	29751162.00	502.90	0.00	274.00	452.00	781.00	999.00	299.15
zip_code_is_na	29751455.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00
title_is_missing_or_empty	29751455.00	0.00	0.00	0.00	0.00	0.00	1.00	0.04
title_len	29726844.00	14.62	1.00	11.00	18.00	18.00	3572.00	5.99
title_n_letters	29726844.00	13.74	0.00	10.00	17.00	17.00	2655.00	5.13
title_n_capital_letters	29726844.00	0.60	0.00	0.00	1.00	1.00	283.00	0.54
title_n_non_capital_letters	29726844.00	13.14	0.00	9.00	16.00	17.00	2433.00	5.04
title_n_digits	29726844.00	0.00	0.00	0.00	0.00	0.00	175.00	0.08
title_n_punctuation	29726844.00	0.00	0.00	0.00	0.00	0.00	288.00	0.09
title_n_spaces	29726844.00	0.59	0.00	0.00	0.00	1.00	650.00	0.65
title_n_words	29726844.00	1.59	0.00	1.00	1.00	2.00	618.00	0.81
title_in_title_case	29726844.00	0.12	0.00	0.00	0.00	0.00	1.00	0.33
title_in_upper_case	29726844.00	0.00	0.00	0.00	0.00	0.00	1.00	0.02
title_in_lower_case	29726844.00	0.42	0.00	0.00	0.00	1.00	1.00	0.49
credit	29711610.00	0.14	0.00	0.00	0.00	0.00	1.00	0.35
card	29711610.00	0.14	0.00	0.00	0.00	0.00	1.00	0.35
refinancing	29711610.00	0.09	0.00	0.00	0.00	0.00	1.00	0.29
consolidation	29711610.00	0.45	0.00	0.00	0.00	1.00	1.00	0.50
debt_related	29711610.00	0.45	0.00	0.00	0.00	1.00	1.00	0.50
bills_taxes	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.02
payoff	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.03
home_upgrade	29711610.00	0.05	0.00	0.00	0.00	0.00	1.00	0.21
home_related	29711610.00	0.07	0.00	0.00	0.00	0.00	1.00	0.25
home_buying	29711610.00	0.02	0.00	0.00	0.00	0.00	1.00	0.13
fixing	29711610.00	0.05	0.00	0.00	0.00	0.00	1.00	0.21
major_purchase_unspecified	29711610.00	0.03	0.00	0.00	0.00	0.00	1.00	0.18
relocation	29711610.00	0.02	0.00	0.00	0.00	0.00	1.00	0.15
weddings	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.03
car	29711610.00	0.18	0.00	0.00	0.00	0.00	1.00	0.39
motorcycle	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.01
vehicle_unspecified_or_other	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.01
medical_expenses	29711610.00	0.03	0.00	0.00	0.00	0.00	1.00	0.16
education	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.02
investment	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.01
vocation	29711610.00	0.01	0.00	0.00	0.00	0.00	1.00	0.10
renewable_energy	29711610.00	0.00	0.00	0.00	0.00	0.00	1.00	0.04

3.2.4 Split Data into Train, Validation, and Test Sets

Code

# If the analysis is restored from this point, load the data
if "loans" not in locals():
    loans = pd.read_feather(
        dir_interim + "task-1-2_preprocess_5_add_more_features.feather"
    )

Remove columns that are not needed for the analysis and consume a lot of memory.

Code

loans = loans.drop(columns=["loan_title", "loan_title_unified"])

Split data into training, validation, and test sets (70%:15%:15%).
Stratification by loan status is used to ensure that the proportion of accepted and rejected loans is the same in all sets.

Code

# Train, validation, test split (stratified by loan status): 70%:15%:15%

path_train = dir_interim + "task-1-loans_train.feather"
path_validation = dir_interim + "task-1-loans_validation.feather"
path_test = dir_interim + "task-1-loans_test.feather"

if (
    os.path.exists(path_train)
    and os.path.exists(path_validation)
    and os.path.exists(path_test)
):
    # Load saved results, if they exist
    loans_train = pd.read_feather(path_train)
    loans_validation = pd.read_feather(path_validation)
    loans_test = pd.read_feather(path_test)
else:
    # Split data
    loans_train, loans_validation = train_test_split(
        loans, test_size=0.3, random_state=42, stratify=loans.loan_status
    )

    loans_validation, loans_test = train_test_split(
        loans_validation,
        test_size=0.5,
        random_state=42,
        stratify=loans_validation.loan_status,
    )

    # Save as feather files
    loans_train.to_feather(path_train)
    loans_validation.to_feather(path_validation)
    loans_test.to_feather(path_test)

Code

if "loans" in locals():
    del loans

3.3 EDA of All Years

To make EDA faster, only 10% of the training data will be used for EDA (which is 2.1 million cases).

Various trends seem to be different comparing years. E.g.:

The number of applications grows each year with the maximum in 2018 (the most recent year; see Figure 3.1).
The percentage of accepted loan applications differs significantly between years (chi-squared test, p-value < 0.001) with a maximum of 14.9% of accepted loans in 2013 and a minimum of 5.0% in 2018 (see Table 3.1).
The category “Employment length of 5 years” seems to be over-represented in 2014, 2015, and especially in 2016 and 2017 while in 2018 this over-representation is less obvious (if present at all).
Refinancing (Figure 3.5) and home buying (Figure 3.6) purposes show an increase in the year 2018 compared to previous years.
In the year 2018, no missing titles (Figure 3.4), no lower-case titles (Figure 3.3), and some other title properties suggest that there was some standardization in data collection procedures at Lending Club.

These findings may indicate that analyzing previous years may be less relevant for predicting future trends.

Find more EDA results below.

Code

file_path = dir_interim + "task-1-loans_eda.feather"

if os.path.exists(file_path):
    # Load from file, if present.
    loans_eda = pd.read_feather(file_path)
else:
    # Do calculations.
    loans_eda = loans_train.sample(frac=0.1, random_state=20)
    loans_eda.to_feather(file_path)

Code

print(f"{loans_train.shape[0]/1e6:.1f}M rows in training set.")
print(f"{loans_eda.shape[0]/1e6: .1f}M rows in EDA sub-set (from training set).")

20.8M rows in training set.
 2.1M rows in EDA sub-set (from training set).

Column info

Code

an.col_info(loans_eda, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_status	Int8	4.2 MB	2	<0.1%	0	0%	1,924,091	92.4%	92.4%	0
2	policy_code	Int8	4.2 MB	3	<0.1%	70	<0.1%	1,917,868	92.1%	92.1%	0
3	date	datetime64[ns]	16.7 MB	4,155	0.2%	0	0%	5,781	0.3%	0.3%	2018-10-01 00:00:00
4	year	Int16	6.2 MB	12	<0.1%	0	0%	694,223	33.3%	33.3%	2018
5	month	Int8	4.2 MB	12	<0.1%	0	0%	204,881	9.8%	9.8%	10
6	month_sin	float32	8.3 MB	8	<0.1%	0	0%	399,619	19.2%	19.2%	-0.8660254
7	month_cos	float32	8.3 MB	8	<0.1%	0	0%	363,560	17.5%	17.5%	-0.8660254
8	loan_amount	float32	8.3 MB	2,095	0.1%	0	0%	285,145	13.7%	13.7%	10000.0
9	loan_amount_log	float32	8.3 MB	2,095	0.1%	0	0%	285,145	13.7%	13.7%	9.210441
10	loan_amount_cap_40k	float32	8.3 MB	1,600	0.1%	0	0%	285,145	13.7%	13.7%	10000.0
11	loan_amount_above_40k	Int8	4.2 MB	2	<0.1%	0	0%	2,070,516	99.4%	99.4%	0
12	debt_to_income_ratio	float32	8.3 MB	39,432	1.9%	81,734	3.9%	91,623	4.4%	4.6%	100.0
13	debt_to_income_ratio_cap_100	float32	8.3 MB	9,994	0.5%	81,734	3.9%	148,622	7.1%	7.4%	100.0
14	debt_to_income_ratio_is_na	Int8	4.2 MB	2	<0.1%	0	0%	2,000,868	96.1%	96.1%	0
15	debt_to_income_ratio_original	float32	8.3 MB	39,433	1.9%	125	<0.1%	91,623	4.4%	4.4%	100.0
16	debt_to_income_ratio_orig_cap_100	float32	8.3 MB	9,995	0.5%	125	<0.1%	148,622	7.1%	7.1%	100.0
17	debt_to_income_ratio_is_na_original	Int8	4.2 MB	2	<0.1%	0	0%	2,082,477	>99.9%	>99.9%	0
18	risk_score	float32	8.3 MB	654	<0.1%	1,285,066	61.7%	16,923	0.8%	2.1%	662.0
19	risk_score_is_na	Int8	4.2 MB	2	<0.1%	0	0%	1,285,066	61.7%	61.7%	1
20	employment_length	category	2.1 MB	11	<0.1%	76,401	3.7%	1,612,460	77.4%	80.4%	< 1 year
21	employment_length_num	Int8	4.2 MB	11	<0.1%	76,401	3.7%	1,612,460	77.4%	80.4%	0
22	employment_length_is_na	Int8	4.2 MB	2	<0.1%	0	0%	2,006,201	96.3%	96.3%	0
23	state	category	2.1 MB	51	<0.1%	3	<0.1%	247,753	11.9%	11.9%	CA
24	zip_area	Int16	6.2 MB	10	<0.1%	27	<0.1%	340,617	16.4%	16.4%	3
25	zip_code	Int16	6.2 MB	986	<0.1%	27	<0.1%	20,130	1.0%	1.0%	112
26	zip_code_is_na	Int8	4.2 MB	2	<0.1%	0	0%	2,082,575	>99.9%	>99.9%	0
27	title_is_missing_or_empty	Int8	4.2 MB	2	<0.1%	0	0%	2,079,742	99.9%	99.9%	0
28	title_len	float32	8.3 MB	79	<0.1%	1,804	0.1%	940,825	45.2%	45.2%	18.0
29	title_n_letters	float32	8.3 MB	74	<0.1%	1,804	0.1%	941,060	45.2%	45.2%	17.0
30	title_n_capital_letters	float32	8.3 MB	38	<0.1%	1,804	0.1%	1,168,488	56.1%	56.2%	1.0
31	title_n_non_capital_letters	float32	8.3 MB	67	<0.1%	1,804	0.1%	527,687	25.3%	25.4%	16.0
32	title_n_digits	float32	8.3 MB	14	<0.1%	1,804	0.1%	2,079,718	99.9%	99.9%	0.0
33	title_n_punctuation	float32	8.3 MB	16	<0.1%	1,804	0.1%	2,078,989	99.8%	99.9%	0.0
34	title_n_spaces	float32	8.3 MB	28	<0.1%	1,804	0.1%	1,095,780	52.6%	52.7%	0.0
35	title_n_words	float32	8.3 MB	28	<0.1%	1,804	0.1%	1,095,949	52.6%	52.7%	1.0
36	title_in_title_case	Int8	4.2 MB	2	<0.1%	1,804	0.1%	1,827,590	87.8%	87.8%	0
37	title_in_upper_case	Int8	4.2 MB	2	<0.1%	1,804	0.1%	2,079,823	99.9%	>99.9%	0
38	title_in_lower_case	Int8	4.2 MB	2	<0.1%	1,804	0.1%	1,203,682	57.8%	57.8%	0
39	credit	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,787,690	85.8%	86.0%	0
40	card	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,788,076	85.9%	86.0%	0
41	refinancing	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,885,717	90.5%	90.7%	0
42	consolidation	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,135,780	54.5%	54.6%	0
43	debt_related	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,136,445	54.6%	54.6%	0
44	bills_taxes	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,079,041	99.8%	>99.9%	0
45	payoff	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,077,970	99.8%	99.9%	0
46	home_upgrade	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,986,705	95.4%	95.5%	0
47	home_related	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,938,471	93.1%	93.2%	0
48	home_buying	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,044,582	98.2%	98.3%	0
49	fixing	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,986,669	95.4%	95.5%	0
50	major_purchase_unspecified	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,009,575	96.5%	96.6%	0
51	relocation	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,029,413	97.4%	97.6%	0
52	weddings	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,078,101	99.8%	99.9%	0
53	car	Int8	4.2 MB	2	<0.1%	2,860	0.1%	1,696,235	81.4%	81.6%	0
54	motorcycle	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,079,642	99.9%	>99.9%	0
55	vehicle_unspecified_or_other	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,079,573	99.9%	>99.9%	0
56	medical_expenses	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,024,973	97.2%	97.4%	0
57	education	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,078,790	99.8%	>99.9%	0
58	investment	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,079,601	99.9%	>99.9%	0
59	vocation	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,056,743	98.8%	98.9%	0
60	renewable_energy	Int8	4.2 MB	2	<0.1%	2,860	0.1%	2,075,725	99.7%	99.8%	0

EDA: Sweetviz report

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_training_1 = sweetviz.analyze(
            [loans_eda, "Training data"],
            target_feat="loan_status",
            pairwise_analysis="off",
        )
        report_training_1.show_notebook()

crosstab_status_by_year = an.CrossTab("year", "loan_status", data=loans_eda)
crosstab_status_by_year().assign(
    percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100
)

Table 3.1. Percentage of accepted loans by year.

loan_status	0	1	percent_accepted
year
2007	401	46	10.29
2008	1833	189	9.35
2009	3954	351	8.15
2010	7859	876	10.03
2011	15226	1491	8.92
2012	23612	3722	13.62
2013	52973	9283	14.91
2014	134360	16686	11.05
2015	199956	29476	12.85
2016	333098	30465	8.38
2017	491675	30847	5.90
2018	659144	35079	5.05

fig, ax = plt.subplots(
    2, 1, figsize=(10, 6), sharex=True, gridspec_kw={"height_ratios": [3, 2]}
)

crosstab_status_by_year.barplot(ax=ax[0])
crosstab_status_by_year.barplot(normalize="rows", stacked=True, xlabel="Year", ax=ax[1])

# Get the limits of the x-axis and y-axis
xlim = ax[0].get_xlim()
ylim = ax[0].get_ylim()
# Set the position of the annotation
x_pos = xlim[1] - 0.25 * (xlim[1] - xlim[0])  # 25% offset from the right edge
y_pos = ylim[1] - 0.05 * (ylim[1] - ylim[0])  # 5% offset from the top edge

chi_sq_res_1 = crosstab_status_by_year.chisq_test("short")

ax[0].annotate(
    chi_sq_res_1.capitalize(),
    xy=(x_pos, y_pos),
    xycoords="data",
    ha="center",
    va="top",
)
ax[0].yaxis.set_major_formatter(axis_formatter("M", precision=1))
ax[0].get_legend().set_title("Loan status")
ax[1].get_legend().remove()

Fig. 3.1. Number of loan applications by year.

The next EDA plots and tables will be used to compare distributions of variables over the years. In a few cases, monthly trends will also be explored.

EDA: Financial, Risk-Related and Other Indicators

ct = an.CrossTab("year", "employment_length", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

With “< 1 year” group removed:

ct = an.CrossTab(
    "year", "employment_length", data=loans_eda.query("employment_length != '< 1 year'")
)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("month", "employment_length", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)

del ct

With “< 1 year” group removed:

ct = an.CrossTab(
    "month",
    "employment_length",
    data=loans_eda.query("employment_length != '< 1 year'"),
)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)

del ct

ct = an.CrossTab("month", "employment_length_is_na", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)
del ct

sns.violinplot(x="year", y="loan_amount", hue="loan_status", data=loans_eda);

sns.violinplot(
    x="year", y="loan_amount_cap_40k", hue="loan_status", data=loans_eda, bw_adjust=3
);

sns.violinplot(
    x="year", y="debt_to_income_ratio_cap_100", hue="loan_status", data=loans_eda
);

sns.violinplot(x="year", y="risk_score", hue="loan_status", data=loans_eda);

ct = an.CrossTab("month", "risk_score_is_na", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)
del ct

EDA: Title Properties

x = "year"
y = "title_len"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 100)

del x, y, hue, data

x = "year"
y = "title_n_capital_letters"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 10)

del x, y, hue, data

x = "year"
y = "title_n_non_capital_letters"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 40)

del x, y, hue, data

x = "year"
y = "title_n_letters"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 40)

del x, y, hue, data

x = "year"
y = "title_n_digits"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)

del x, y, hue, data

x = "year"
y = "title_n_punctuation"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 2)

del x, y, hue, data

x = "year"
y = "title_n_spaces"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 20)

del x, y, hue, data

x = "year"
y = "title_n_words"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 13)

del x, y, hue, data

It seems that loan title got standardized in 2018, as all-lower-case titles are not present in 2018:

ct = an.CrossTab("year", "title_in_lower_case", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
display(ct.counts.T)
del ct

Number of loan titles in lower case by year.
year	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
0	156	1378	3031	1368	2055	4141	10775	26229	48439	92373	319514	694223
1	291	644	1273	7366	14662	23193	51481	124817	180984	269479	202926	0

ct = an.CrossTab("year", "title_in_upper_case", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "title_in_title_case", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "title_is_missing_or_empty", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
ct.counts.assign(percent_missing_or_empty=lambda df: df[1] / (df[1] + df[0]) * 100)
del ct

EDA: Title Contents (Loan Purpose)

Here, the presence of certain words, phrases, or abbreviations in the titles will be compared over the years explored.

ct = an.CrossTab("year", "credit", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "card", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

Refinancing cases seem to increase in 2018.

ct = an.CrossTab("year", "refinancing", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "consolidation", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "debt_related", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "bills_taxes", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "payoff", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "home_upgrade", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "home_related", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "home_buying", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "fixing", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "major_purchase_unspecified", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "relocation", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "weddings", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "car", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "motorcycle", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "vehicle_unspecified_or_other", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "medical_expenses", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "education", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "investment", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "vocation", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "renewable_energy", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

3.4 EDA of Most Recent Year (2018)

Based on the EDA of all years, I decided to focus the analysis only on data from the most recent year (2018).

3.4.1 General EDA

There are 694.2 thousand cases in the EDA subset. The general EDA suggested that the following variables can be excluded:

as irrelevant:
- date
- year
- policy_code
- debt_to_income_ratio_is_na_original (created for EDA only)
- debt_to_income_ratio_orig_cap_100 (created for EDA only)
- debt_to_income_ratio_original (created for EDA only)
constant:
- zip_code_is_na
- title_is_missing_or_empty
- title_n_digits
- title_n_punctuation
- title_in_upper_case
- title_in_lower_case
- bills_taxes
- payoff
- weddings
- motorcycle
- vehicle_unspecified_or_other
- investment
almost constant:
- education
- debt_to_income_ratio_is_na
- renewable_energy

Find the details below.

Code

file_path = dir_interim + "task-1-loans_eda.feather"

if os.path.exists(file_path):
    # Load from file, if present.
    loans_eda = pd.read_feather(file_path)

Code

loans_eda_2018 = loans_eda.query("year == 2018")
print(f"{loans_eda_2018.shape[0]/1e3:,.1f}k rows for year 2018")

694.2k rows for year 2018

Code

loans_eda_2018.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Index: 694223 entries, 19248942 to 12767628
Columns: 60 entries, loan_status to renewable_energy
dtypes: Int16(3), Int8(36), category(2), datetime64[ns](1), float32(18)
memory usage: 113.2 MB

Code

an.col_info(loans_eda_2018, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_status	Int8	1.4 MB	2	<0.1%	0	0%	659,144	94.9%	94.9%	0
2	policy_code	Int8	1.4 MB	3	<0.1%	0	0%	656,360	94.5%	94.5%	0
3	date	datetime64[ns]	5.6 MB	365	0.1%	0	0%	5,781	0.8%	0.8%	2018-10-01 00:00:00
4	year	Int16	2.1 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	2018
5	month	Int8	1.4 MB	12	<0.1%	0	0%	66,932	9.6%	9.6%	10
6	month_sin	float32	2.8 MB	8	<0.1%	0	0%	132,879	19.1%	19.1%	-0.8660254
7	month_cos	float32	2.8 MB	8	<0.1%	0	0%	128,343	18.5%	18.5%	-0.8660254
8	loan_amount	float32	2.8 MB	1,849	0.3%	0	0%	114,672	16.5%	16.5%	10000.0
9	loan_amount_log	float32	2.8 MB	1,849	0.3%	0	0%	114,672	16.5%	16.5%	9.210441
10	loan_amount_cap_40k	float32	2.8 MB	1,550	0.2%	0	0%	114,672	16.5%	16.5%	10000.0
11	loan_amount_above_40k	Int8	1.4 MB	2	<0.1%	0	0%	691,679	99.6%	99.6%	0
12	debt_to_income_ratio	float32	2.8 MB	24,784	3.6%	4,556	0.7%	63,399	9.1%	9.2%	100.0
13	debt_to_income_ratio_cap_100	float32	2.8 MB	9,752	1.4%	4,556	0.7%	83,910	12.1%	12.2%	100.0
14	debt_to_income_ratio_is_na	Int8	1.4 MB	2	<0.1%	0	0%	689,667	99.3%	99.3%	0
15	debt_to_income_ratio_original	float32	2.8 MB	24,785	3.6%	85	<0.1%	63,399	9.1%	9.1%	100.0
16	debt_to_income_ratio_orig_cap_100	float32	2.8 MB	9,753	1.4%	85	<0.1%	83,910	12.1%	12.1%	100.0
17	debt_to_income_ratio_is_na_original	Int8	1.4 MB	2	<0.1%	0	0%	694,138	>99.9%	>99.9%	0
18	risk_score	float32	2.8 MB	462	0.1%	613,924	88.4%	2,555	0.4%	3.2%	682.0
19	risk_score_is_na	Int8	1.4 MB	2	<0.1%	0	0%	613,924	88.4%	88.4%	1
20	employment_length	category	695.2 kB	11	<0.1%	23,447	3.4%	600,178	86.5%	89.5%	< 1 year
21	employment_length_num	Int8	1.4 MB	11	<0.1%	23,447	3.4%	600,178	86.5%	89.5%	0
22	employment_length_is_na	Int8	1.4 MB	2	<0.1%	0	0%	670,776	96.6%	96.6%	0
23	state	category	699.3 kB	51	<0.1%	0	0%	78,026	11.2%	11.2%	CA
24	zip_area	Int16	2.1 MB	10	<0.1%	0	0%	117,537	16.9%	16.9%	3
25	zip_code	Int16	2.1 MB	944	0.1%	0	0%	6,719	1.0%	1.0%	770
26	zip_code_is_na	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
27	title_is_missing_or_empty	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
28	title_len	float32	2.8 MB	10	<0.1%	0	0%	289,103	41.6%	41.6%	18.0
29	title_n_letters	float32	2.8 MB	10	<0.1%	0	0%	289,103	41.6%	41.6%	17.0
30	title_n_capital_letters	float32	2.8 MB	2	<0.1%	0	0%	689,755	99.4%	99.4%	1.0
31	title_n_non_capital_letters	float32	2.8 MB	11	<0.1%	0	0%	289,103	41.6%	41.6%	16.0
32	title_n_digits	float32	2.8 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0.0
33	title_n_punctuation	float32	2.8 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0.0
34	title_n_spaces	float32	2.8 MB	3	<0.1%	0	0%	415,081	59.8%	59.8%	1.0
35	title_n_words	float32	2.8 MB	3	<0.1%	0	0%	415,081	59.8%	59.8%	2.0
36	title_in_title_case	Int8	1.4 MB	2	<0.1%	0	0%	543,485	78.3%	78.3%	0
37	title_in_upper_case	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
38	title_in_lower_case	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
39	credit	Int8	1.4 MB	2	<0.1%	0	0%	575,160	82.8%	82.8%	0
40	card	Int8	1.4 MB	2	<0.1%	0	0%	575,160	82.8%	82.8%	0
41	refinancing	Int8	1.4 MB	2	<0.1%	0	0%	575,160	82.8%	82.8%	0
42	consolidation	Int8	1.4 MB	2	<0.1%	0	0%	405,120	58.4%	58.4%	0
43	debt_related	Int8	1.4 MB	2	<0.1%	0	0%	405,120	58.4%	58.4%	0
44	bills_taxes	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
45	payoff	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
46	home_upgrade	Int8	1.4 MB	2	<0.1%	0	0%	676,116	97.4%	97.4%	0
47	home_related	Int8	1.4 MB	2	<0.1%	0	0%	649,006	93.5%	93.5%	0
48	home_buying	Int8	1.4 MB	2	<0.1%	0	0%	667,113	96.1%	96.1%	0
49	fixing	Int8	1.4 MB	2	<0.1%	0	0%	676,116	97.4%	97.4%	0
50	major_purchase_unspecified	Int8	1.4 MB	2	<0.1%	0	0%	668,980	96.4%	96.4%	0
51	relocation	Int8	1.4 MB	2	<0.1%	0	0%	680,419	98.0%	98.0%	0
52	weddings	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
53	car	Int8	1.4 MB	2	<0.1%	0	0%	541,696	78.0%	78.0%	0
54	motorcycle	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
55	vehicle_unspecified_or_other	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
56	medical_expenses	Int8	1.4 MB	2	<0.1%	0	0%	677,389	97.6%	97.6%	0
57	education	Int8	1.4 MB	2	<0.1%	0	0%	694,218	>99.9%	>99.9%	0
58	investment	Int8	1.4 MB	1	<0.1%	0	0%	694,223	100.0%	100.0%	0
59	vocation	Int8	1.4 MB	2	<0.1%	0	0%	688,563	99.2%	99.2%	0
60	renewable_energy	Int8	1.4 MB	2	<0.1%	0	0%	693,471	99.9%	99.9%	0

Code

# Columns to exclude
# (loan_title and loan_title_unified are already excluded)

to_exclude_1 = str_to_list(
    """
date
year
policy_code
debt_to_income_ratio_is_na_original
debt_to_income_ratio_orig_cap_100
debt_to_income_ratio_original
zip_code_is_na
title_is_missing_or_empty
title_n_digits
title_n_punctuation
title_in_upper_case
title_in_lower_case
bills_taxes
payoff
weddings
motorcycle
vehicle_unspecified_or_other
investment
education
renewable_energy
"""
)

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_training_2018 = sweetviz.analyze(
            [loans_eda_2018.drop(columns=to_exclude_1), "Training data"],
            target_feat="loan_status",
            pairwise_analysis="off",
        )
        report_training_2018.show_notebook()

3.4.2 EDA of Job and Finances-Related Variables

The loan amount is skewed. There are some loans larger than 40k USD which is the maximum amount allowed by Lending Club. It might be beneficial to create a too-large loan indicator variable. For some predictive models, it might be beneficial to cap the loan amount at 40k USD.
The debt-to-income ratio is also right skewed. In the original version of this variable, missing values were present only in the accepted loans data while -1 values were present only in the rejected loans. Thus, both -1 and missing values were treated as missing values in the pre-processed version of this variable.
There were no accepted loans with risk scores below 662 and above 847.5. According to the literature, the upper bound of the risk score (FICO) is 850. So the values above this threshold are probably errors. The mean of rejected loans was lower than the mean of accepted loans.
Risk scores were missing only in the rejected loans’ data.
The state of Iowa (IA) is underrepresented: there were only 3 cases in the EDA subset. So variable “states” will be dropped and instead, 10 national areas based on the first digit of the ZIP code will be used.

Find all EDA results below.

Code

def plot_kde(data, x, hue, bw_adjust=3, rug=True, common_norm=False, **kwargs):
    """Plot a kernel density estimate (KDE) plot with a rug plot."""
    return sns.displot(
        data,
        x=x,
        hue=hue,
        kind="kde",
        fill=True,
        height=3,
        aspect=3,
        common_norm=common_norm,
        bw_adjust=bw_adjust,
        rug=rug,
        **kwargs,
    )


def describe_numeric(data, x, by):
    """Describe numeric data by a categorical variable."""
    return data.groupby(hue)[x].describe().style.format("{:.1f}")

Code

ct = an.CrossTab("year", "employment_length", data=loans_eda_2018)
ct.barplot(colormap="Paired")

ct2 = an.CrossTab(
    "year",
    "employment_length",
    data=loans_eda_2018.query("employment_length != '< 1 year'"),
)
ct2.barplot(colormap="Paired")

del ct, ct2

Code

ct = an.CrossTab("year", "loan_status", data=loans_eda_2018)
ct.barplot(colormap="Paired")
del ct

Code

ct = an.CrossTab("month", "loan_status", data=loans_eda_2018)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)
display(ct.counts.assign(percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100).T)
del ct

month	1	2	3	4	5	6	7	8	9	10	11	12
loan_status
0	46628.00	38720.00	44958.00	51412.00	62477.00	55874.00	59517.00	62630.00	57023.00	63762.00	58123.00	58020.00
1	2539.00	2253.00	2779.00	2989.00	3262.00	3002.00	3087.00	3317.00	2858.00	3170.00	3022.00	2801.00
percent_accepted	5.16	5.50	5.82	5.49	4.96	5.10	4.93	5.03	4.77	4.74	4.94	4.61

Code

ct = an.CrossTab("loan_amount_above_40k", "loan_status", data=loans_eda_2018)
ct.barplot(colormap="Paired")
display(ct.counts.assign(percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100).T)
del ct

loan_amount_above_40k	0	1
loan_status
0	656600.00	2544.00
1	35079.00	0.00
percent_accepted	5.07	0.00

Code

x_1 = "loan_amount"
x_2 = "loan_amount_cap_40k"
x_3 = "loan_amount_log"
hue = "loan_status"
data = loans_eda_2018

plot_kde(data, x_1, hue)
plot_kde(data, x_2, hue)
plot_kde(data, x_3, hue)
display(describe_numeric(data, x_1, by=hue))
del x_1, x_2, x_3, hue, data

	count	mean	std	min	25%	50%	75%	max
loan_status
0	659144.0	13059.5	13528.5	100.0	5000.0	10000.0	20000.0	300000.0
1	35079.0	15966.1	10082.5	1000.0	8000.0	14000.0	21600.0	40000.0

In the original debt-to-income ratio variable (before treating -1 as a missing value), all the missing values were in the accepted loans data only, and all the -1 values were in the rejected loans data only.

Code

ct1 = an.CrossTab(
    "debt_to_income_ratio_is_minus_1",
    "loan_status",
    data=loans_eda_2018.assign(
        debt_to_income_ratio_is_minus_1=loans_eda_2018.debt_to_income_ratio_original
        == -1
    ),
)
display(ct1.counts)

ct2 = an.CrossTab(
    "debt_to_income_ratio_is_na_original", "loan_status", data=loans_eda_2018
)
display(ct2.counts)

del ct1, ct2

loan_status	0	1
debt_to_income_ratio_is_minus_1
False	654673	35079
True	4471	0

loan_status	0	1
debt_to_income_ratio_is_na_original
0	659144	34994
1	0	85

After treating -1 as a missing value, the distribution of the debt-to-income ratio is following:

Code

ct = an.CrossTab("debt_to_income_ratio_is_na", "loan_status", data=loans_eda_2018)
ct.barplot(normalize="rows", stacked=True, colormap="Paired")
display(ct.counts.assign(percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100))

ct2 = an.CrossTab("year", "debt_to_income_ratio_is_na", data=loans_eda_2018)
display(ct2.counts.assign(percent_missing=lambda df: df[1] / (df[1] + df[0]) * 100))

del ct, ct2

loan_status	0	1	percent_accepted
debt_to_income_ratio_is_na
0	654673	34994	5.07
1	4471	85	1.87

debt_to_income_ratio_is_na	0	1	percent_missing
year
2018	689667	4556	0.66

Plots with (a) full range of debt-to-income ratio and (b) range between 0 and 100:

Code

x = "debt_to_income_ratio"
hue = "loan_status"
data = loans_eda_2018

describe_numeric(data, x, by=hue)
plot_kde(data, x, hue)

plot_kde(data, "debt_to_income_ratio_cap_100", hue, bw_adjust=1);

Code

ct = an.CrossTab("employment_length", "loan_status", data=loans_eda_2018)
ct.barplot(colormap="Paired")
ct.barplot(normalize="rows", stacked=True, legend=False, colormap="Paired")
display(ct.counts.assign(percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100).T)
del ct

employment_length	< 1 year	1 year	2 years	3 years	4 years	5 years	6 years	7 years	8 years	9 years	10+ years
loan_status
0	597146.00	8644.00	4968.00	4995.00	2975.00	7271.00	1419.00	936.00	1751.00	1123.00	7513.00
1	3032.00	2321.00	3151.00	2812.00	2273.00	2199.00	1554.00	1280.00	1111.00	889.00	11413.00
percent_accepted	0.51	21.17	38.81	36.02	43.31	23.22	52.27	57.76	38.82	44.18	60.30

Code

ct = an.CrossTab("employment_length_is_na", "loan_status", data=loans_eda_2018)
ct.barplot(normalize="rows", stacked=True, legend=False, colormap="Paired")
display(ct.counts.assign(percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100))

ct2 = an.CrossTab("year", "employment_length_is_na", data=loans_eda_2018)
display(ct2.counts.assign(percent_missing=lambda df: df[1] / (df[1] + df[0]) * 100))

del ct, ct2

loan_status	0	1	percent_accepted
employment_length_is_na
0	638741	32035	4.78
1	20403	3044	12.98

employment_length_is_na	0	1	percent_missing
year
2018	670776	23447	3.38

Code

plot_kde(loans_eda_2018, "risk_score", "loan_status")
describe_numeric(loans_eda_2018, "risk_score", "loan_status")

	count	mean	std	min	25%	50%	75%	max
loan_status
0	45220.0	629.3	62.8	300.0	587.0	629.0	667.0	990.0
1	35079.0	708.4	36.2	662.0	682.0	702.0	727.0	847.5

Code

ct = an.CrossTab("risk_score_is_na", "loan_status", data=loans_eda_2018)
ct.barplot(normalize="rows", stacked=True, colormap="Paired")
display(ct.counts.assign(percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100))

ct2 = an.CrossTab("year", "risk_score_is_na", data=loans_eda_2018)
display(ct2.counts.assign(percent_missing=lambda df: df[1] / (df[1] + df[0]) * 100))

del ct, ct2

loan_status	0	1	percent_accepted
risk_score_is_na
0	45220	35079	43.69
1	613924	0	0.00

risk_score_is_na	0	1	percent_missing
year
2018	80299	613924	88.43

Code

ct = an.CrossTab("state", "loan_status", data=loans_eda_2018)
ct.barplot(normalize="rows", stacked=True, colormap="Paired")
display(
    ct.counts.assign(
        percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100
    ).sort_values("percent_accepted", ascending=False)
)
del ct

loan_status	0	1	percent_accepted
state
MN	8677	608	6.55
CT	8389	586	6.53
CO	11236	776	6.46
VT	1198	80	6.26
CA	73260	4766	6.11
NH	2876	181	5.92
NY	44559	2793	5.90
AZ	14218	882	5.84
NJ	19980	1225	5.78
MD	14052	851	5.71
WY	1104	66	5.64
OR	7108	424	5.63
MA	12834	765	5.63
NV	7870	466	5.59
WA	11853	698	5.56
IL	23499	1381	5.55
NE	3268	184	5.33
KS	5506	302	5.20
RI	2712	147	5.14
WV	3688	199	5.12
WI	8651	466	5.11
VA	16392	875	5.07
MT	1703	90	5.02
NM	4031	207	4.88
MI	18411	932	4.82
FL	53292	2659	4.75
ND	1437	70	4.64
AK	1398	68	4.64
TX	60780	2938	4.61
UT	4448	214	4.59
PA	24741	1188	4.58
MO	11797	562	4.55
ID	2690	127	4.51
OH	24178	1139	4.50
ME	2427	114	4.49
NC	20666	955	4.42
SD	1369	62	4.33
GA	26588	1199	4.31
IN	12853	572	4.26
OK	7276	313	4.12
SC	11573	494	4.09
TN	14114	596	4.05
DC	1193	49	3.95
HI	3439	141	3.94
KY	8303	338	3.91
DE	2337	90	3.71
LA	9776	374	3.68
AL	11261	422	3.61
AR	6911	233	3.26
MS	7219	212	2.85
IA	3	0	0.00

Code

ct = an.CrossTab("zip_area", "loan_status", data=loans_eda_2018)
ct.barplot(normalize="rows", stacked=True, colormap="Paired")
display(
    ct.counts.assign(
        percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100
    ).sort_values("percent_accepted", ascending=False)
)
del ct

loan_status	0	1	percent_accepted
zip_area
9	96796	6099	5.93
0	50506	3107	5.80
8	45585	2739	5.67
5	21901	1293	5.57
1	71819	4067	5.36
6	44024	2428	5.23
2	67693	3431	4.82
4	63728	2977	4.46
7	84644	3849	4.35
3	112448	5089	4.33

3.4.3 Relationships between Variables

Hierarchical clustering can be performed on a correlation matrix to cluster variables with similar correlation patterns (these clusters can be identified as squares created of 2 or more variables along the diagonal). Absolute values of correlation coefficients make it easier to identify the clusters. Thus, Figure 3.7 helped to identify variables that are highly correlated with each other and manually populate the list of variables to exclude from the analysis.

Looking into correlations of variables to the target, variables risk_score_is_na, risk_score, employment_length_num indicate the highest correlation to the target (Table 3.2).

Code

data_num = loans_eda_2018.drop(columns=to_exclude_1).select_dtypes("number")
corr_coefs = data_num.corr(method="pearson")
corr_coefs_abs = corr_coefs.fillna(0).abs()

g = sns.clustermap(
    corr_coefs_abs,
    method="ward",
    cmap="Greens",
    annot=True,
    annot_kws={"size": 8},
    vmin=0,
    vmax=1,
    figsize=(14, 11),
    cbar_pos=(0.94, 0.91, 0.03, 0.1),
    cbar_kws={"location": "right"},
    dendrogram_ratio=(0.075, 0),
)

g.fig.suptitle(
    "Absolute Values of Pearson Correlation (Matrix with Hierarchical Clustering)",
    fontsize=17,
    y=1.03,
    x=0.55,
);

Fig. 3.7. Matrix of absolute values of Pearson correlation coefficients.

The list of variables to exclude from the analysis based on this and previous explorations:

Code

to_exclude_2 = str_to_list(
    """
date
year
policy_code
debt_to_income_ratio_is_na_original
debt_to_income_ratio_orig_cap_100
debt_to_income_ratio_original
zip_code_is_na
title_is_missing_or_empty
title_n_digits
title_n_punctuation
title_in_upper_case
title_in_lower_case
bills_taxes
payoff
weddings
motorcycle
vehicle_unspecified_or_other
investment
education   
renewable_energy
title_n_spaces
title_n_non_capital_letters
title_n_letters
title_in_title_case
fixing
credit
card
debt_related
zip_code
state
employment_length
loan_amount_log
month
"""
)

an.get_pointbiserial_corr_scores(
    loans_eda_2018.drop(columns=to_exclude_2), target="loan_status"
)
# Time: 2m 41.0s

Table 3.2. Point-biserial correlation coefficients between the variables and the target.

	variable_1	variable_2	n	r_pb	p	p_adj
1	loan_status	risk_score_is_na	694223	-0.638	<0.001	<0.001
2	loan_status	risk_score	80299	0.596	<0.001	<0.001
3	loan_status	employment_length_num	670776	0.589	<0.001	<0.001
4	loan_status	debt_to_income_ratio_cap_100	689667	-0.093	<0.001	<0.001
5	loan_status	title_len	694223	0.092	<0.001	<0.001
6	loan_status	title_n_words	694223	0.070	<0.001	<0.001
7	loan_status	employment_length_is_na	694223	0.068	<0.001	<0.001
8	loan_status	loan_amount_cap_40k	694223	0.065	<0.001	<0.001
9	loan_status	home_upgrade	694223	0.060	<0.001	<0.001
10	loan_status	consolidation	694223	0.051	<0.001	<0.001
11	loan_status	refinancing	694223	0.051	<0.001	<0.001
12	loan_status	loan_amount	694223	0.048	<0.001	<0.001
13	loan_status	home_buying	694223	-0.034	<0.001	<0.001
14	loan_status	car	694223	0.025	<0.001	<0.001
15	loan_status	relocation	694223	-0.023	<0.001	<0.001
16	loan_status	title_n_capital_letters	694223	-0.019	<0.001	<0.001
17	loan_status	medical_expenses	694223	-0.016	<0.001	<0.001
18	loan_status	major_purchase_unspecified	694223	-0.016	<0.001	<0.001
19	loan_status	loan_amount_above_40k	694223	-0.014	<0.001	<0.001
20	loan_status	home_related	694223	0.012	<0.001	<0.001
21	loan_status	debt_to_income_ratio_is_na	694223	-0.012	<0.001	<0.001
22	loan_status	month_sin	694223	0.011	<0.001	<0.001
23	loan_status	debt_to_income_ratio	689667	-0.010	<0.001	<0.001
24	loan_status	zip_area	694223	0.007	<0.001	<0.001
25	loan_status	month_cos	694223	-0.003	0.014	0.028
26	loan_status	vocation	694223	-0.003	0.024	0.028

The last check before modeling:

Code

an.col_info(loans_eda_2018.drop(columns=to_exclude_2), style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_status	Int8	1.4 MB	2	<0.1%	0	0%	659,144	94.9%	94.9%	0
2	month_sin	float32	2.8 MB	8	<0.1%	0	0%	132,879	19.1%	19.1%	-0.8660254
3	month_cos	float32	2.8 MB	8	<0.1%	0	0%	128,343	18.5%	18.5%	-0.8660254
4	loan_amount	float32	2.8 MB	1,849	0.3%	0	0%	114,672	16.5%	16.5%	10000.0
5	loan_amount_cap_40k	float32	2.8 MB	1,550	0.2%	0	0%	114,672	16.5%	16.5%	10000.0
6	loan_amount_above_40k	Int8	1.4 MB	2	<0.1%	0	0%	691,679	99.6%	99.6%	0
7	debt_to_income_ratio	float32	2.8 MB	24,784	3.6%	4,556	0.7%	63,399	9.1%	9.2%	100.0
8	debt_to_income_ratio_cap_100	float32	2.8 MB	9,752	1.4%	4,556	0.7%	83,910	12.1%	12.2%	100.0
9	debt_to_income_ratio_is_na	Int8	1.4 MB	2	<0.1%	0	0%	689,667	99.3%	99.3%	0
10	risk_score	float32	2.8 MB	462	0.1%	613,924	88.4%	2,555	0.4%	3.2%	682.0
11	risk_score_is_na	Int8	1.4 MB	2	<0.1%	0	0%	613,924	88.4%	88.4%	1
12	employment_length_num	Int8	1.4 MB	11	<0.1%	23,447	3.4%	600,178	86.5%	89.5%	0
13	employment_length_is_na	Int8	1.4 MB	2	<0.1%	0	0%	670,776	96.6%	96.6%	0
14	zip_area	Int16	2.1 MB	10	<0.1%	0	0%	117,537	16.9%	16.9%	3
15	title_len	float32	2.8 MB	10	<0.1%	0	0%	289,103	41.6%	41.6%	18.0
16	title_n_capital_letters	float32	2.8 MB	2	<0.1%	0	0%	689,755	99.4%	99.4%	1.0
17	title_n_words	float32	2.8 MB	3	<0.1%	0	0%	415,081	59.8%	59.8%	2.0
18	refinancing	Int8	1.4 MB	2	<0.1%	0	0%	575,160	82.8%	82.8%	0
19	consolidation	Int8	1.4 MB	2	<0.1%	0	0%	405,120	58.4%	58.4%	0
20	home_upgrade	Int8	1.4 MB	2	<0.1%	0	0%	676,116	97.4%	97.4%	0
21	home_related	Int8	1.4 MB	2	<0.1%	0	0%	649,006	93.5%	93.5%	0
22	home_buying	Int8	1.4 MB	2	<0.1%	0	0%	667,113	96.1%	96.1%	0
23	major_purchase_unspecified	Int8	1.4 MB	2	<0.1%	0	0%	668,980	96.4%	96.4%	0
24	relocation	Int8	1.4 MB	2	<0.1%	0	0%	680,419	98.0%	98.0%	0
25	car	Int8	1.4 MB	2	<0.1%	0	0%	541,696	78.0%	78.0%	0
26	medical_expenses	Int8	1.4 MB	2	<0.1%	0	0%	677,389	97.6%	97.6%	0
27	vocation	Int8	1.4 MB	2	<0.1%	0	0%	688,563	99.2%	99.2%	0

3.5 Modeling (Year 2018)

3.5.1 Prepare Training, Validation, and Test Sets

Code

path_train = dir_interim + "task-1-loans_train.feather"
path_validation = dir_interim + "task-1-loans_validation.feather"
path_test = dir_interim + "task-1-loans_test.feather"

if (
    os.path.exists(path_train)
    and os.path.exists(path_validation)
    and os.path.exists(path_test)
):
    # Load saved results
    loans_train = pd.read_feather(path_train)
    loans_validation = pd.read_feather(path_validation)
    loans_test = pd.read_feather(path_test)
else:
    # Save as feather files
    loans_train.to_feather(path_train)
    loans_validation.to_feather(path_validation)
    loans_test.to_feather(path_test)

Code

# Filter the data for year 2018 only
loans_train_2018 = loans_train.query("year == 2018")
loans_validation_2018 = loans_validation.query("year == 2018")
loans_test_2018 = loans_test.query("year == 2018")

# Create X_train, y_train, X_validation, y_validation, X_test, and y_test
X_train = loans_train_2018.drop(columns=to_exclude_2 + ["loan_status"])
y_train = loans_train_2018["loan_status"]

X_validation = loans_validation_2018.drop(columns=to_exclude_2 + ["loan_status"])
y_validation = loans_validation_2018["loan_status"]

X_test = loans_test_2018.drop(columns=to_exclude_2 + ["loan_status"])
y_test = loans_test_2018["loan_status"]

Code

# Remove some unnecessary variables
del (
    loans_train_2018,
    loans_validation_2018,
    loans_test_2018,
    loans_train,
    loans_validation,
    loans_test,
)

if "loans_eda_2018" in locals():
    del loans_eda_2018

if "loans_eda" in locals():
    del loans_eda

3.5.2 Create Pre-Processing Pipelines

Some steps of group-independent pre-processing have already been done in the previous chapters and will not be repeated here. Here, pipelines for group-dependent processing will be created.

Suffix _lr means that the pipeline (or other object) is for non-tree-based models (logistic regression and Naive Bayes) and _trees means that it is for tree-based models (LGBM):

_lr pipeline contains one extra step: numeric data scaling.

numeric_common = str_to_list(
    """
risk_score
employment_length_num
title_len
title_n_capital_letters
title_n_words
"""
)

# Impute median, scale:
numeric_features_for_lr = (
    str_to_list(
        """
loan_amount_cap_40k
debt_to_income_ratio_cap_100
"""
    )
    + numeric_common
)

exclude_for_lr = str_to_list(
    """
loan_amount
debt_to_income_ratio
"""
)

# Impute median:
numeric_features_for_trees = (
    str_to_list(
        """
loan_amount
debt_to_income_ratio
"""
    )
    + numeric_common
)

exclude_for_trees = str_to_list(
    """
loan_amount_cap_40k
debt_to_income_ratio_cap_100
"""
)

# One-Hot Encoding:
categorical_features = ["zip_area"]

# Impute 0:
binary_features = str_to_list(
    """
loan_amount_above_40k
refinancing
consolidation
home_upgrade
home_related
home_buying
major_purchase_unspecified
relocation
car
medical_expenses
vocation
"""
)

# Pass Through
indicators = str_to_list(
    """
debt_to_income_ratio_is_na
employment_length_is_na
risk_score_is_na
"""
)

Group-independent steps that can be performed before cross-validation:

select_binary_01 = make_column_selector(
    pattern=create_column_selector_pattern(binary_features)
)
select_categorical = make_column_selector(
    pattern=create_column_selector_pattern(categorical_features)
)

# Create the pipelines
binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value=0))]
)
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))]
)


group_independent_transforms = ColumnTransformer(
    transformers=[
        ("binary", binary_transformer, select_binary_01),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)

group_independent_preprocessor_for_lr = Pipeline(
    steps=[
        ("dropper", DropFeatures(exclude_for_lr)),
        ("transformer", group_independent_transforms),
    ]
)

group_independent_preprocessor_for_trees = Pipeline(
    steps=[
        ("dropper", DropFeatures(exclude_for_trees)),
        ("transformer", group_independent_transforms),
    ]
)

For logistic regression and Naive Bayes, loan_amount, debt_to_income_ratio will be dropped, and capped variables will be used instead:

group_independent_preprocessor_for_lr

Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['loan_amount',
                                                'debt_to_income_ratio'])),
                ('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                                   verbose_feature_names_out=False))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['loan_amount',
                                                'debt_to_income_ratio'])),
                ('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                                   verbose_feature_names_out=False))])

DropFeatures

DropFeatures(features_to_drop=['loan_amount', 'debt_to_income_ratio'])

transformer: ColumnTransformer

ColumnTransformer(remainder='passthrough',
                  transformers=[('binary',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value=0,
                                                                strategy='constant'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                  verbose_feature_names_out=False)

binary

<sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>

SimpleImputer

SimpleImputer(fill_value=0, strategy='constant')

categorical

<sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse_output=False)

remainder

passthrough

passthrough

For tree-based models, non-capped versions of the same variables will be used:

group_independent_preprocessor_for_trees

Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['loan_amount_cap_40k',
                                                'debt_to_income_ratio_cap_100'])),
                ('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                                   verbose_feature_names_out=False))])

Pipeline

Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['loan_amount_cap_40k',
                                                'debt_to_income_ratio_cap_100'])),
                ('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                                   verbose_feature_names_out=False))])

DropFeatures

DropFeatures(features_to_drop=['loan_amount_cap_40k',
                               'debt_to_income_ratio_cap_100'])

transformer: ColumnTransformer

ColumnTransformer(remainder='passthrough',
                  transformers=[('binary',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value=0,
                                                                strategy='constant'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                  verbose_feature_names_out=False)

binary

<sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>

SimpleImputer

SimpleImputer(fill_value=0, strategy='constant')

categorical

<sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse_output=False)

remainder

passthrough

passthrough

Group-dependent pre-processing steps that should be performed after cross-validation splits before each model re-fitting:

# Group-dependent pre-processing steps
# that will be performed before each model re-fitting

# Select numeric variables by data type and name pattern
select_numeric_for_lr = make_column_selector(
    dtype_include="number",
    pattern=create_column_selector_pattern(numeric_features_for_lr),
)
select_numeric_for_trees = make_column_selector(
    dtype_include="number",
    pattern=create_column_selector_pattern(numeric_features_for_trees),
)

# Create the pipelines
numeric_transformer_for_lr = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
numeric_transformer_for_trees = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

group_dependent_preprocessor_for_lr = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_for_lr, select_numeric_for_lr),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)

group_dependent_preprocessor_for_trees = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_for_trees, select_numeric_for_trees),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)

For logistic regression and Naive Bayes:

# For LR and NB
group_dependent_preprocessor_for_lr

ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7BBFD0>)],
                  verbose_feature_names_out=False)

For tree-based models:

# For tree-based models
group_dependent_preprocessor_for_trees

ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7BBE50>)],
                  verbose_feature_names_out=False)

3.5.3 Apply Group-Independent Pre-Processing

Apply group-independent pre-processing steps to the training, validation, and test sets and save the results as separate datasets.

Code

print(X_train.shape)
print(X_validation.shape)
print(X_test.shape)

(6939617, 26)
(1484768, 26)
(1485951, 26)

Next, apply group-independent pre-processing and cache the results.

Code

# For LR and NB
path_train = dir_interim + "task-1-X_train_lr.feather"
path_validation = dir_interim + "task-1-X_validation_lr.feather"
path_test = dir_interim + "task-1-X_test_lr.feather"

if (
    os.path.exists(path_train)
    and os.path.exists(path_validation)
    and os.path.exists(path_test)
):
    # Load saved results
    X_train_lr = pd.read_feather(path_train)
    X_validation_lr = pd.read_feather(path_validation)
    X_test_lr = pd.read_feather(path_test)
else:
    # Pre-process
    X_train_lr = group_independent_preprocessor_for_lr.fit_transform(X_train)
    X_validation_lr = group_independent_preprocessor_for_lr.transform(X_validation)
    X_test_lr = group_independent_preprocessor_for_lr.transform(X_test)

    # Save as feather files
    X_train_lr.to_feather(path_train)
    X_validation_lr.to_feather(path_validation)
    X_test_lr.to_feather(path_test)

Code

# For LGBM
path_train_trees = dir_interim + "task-1-X_train_trees.feather"
path_validation_trees = dir_interim + "task-1-X_validation_trees.feather"
path_test_trees = dir_interim + "task-1-X_test_trees.feather"

if (
    os.path.exists(path_train_trees)
    and os.path.exists(path_validation_trees)
    and os.path.exists(path_test_trees)
):
    # Load saved results
    X_train_trees = pd.read_feather(path_train_trees)
    X_validation_trees = pd.read_feather(path_validation_trees)
    X_test_trees = pd.read_feather(path_test_trees)
else:
    # Pre-process
    X_train_trees = group_independent_preprocessor_for_trees.fit_transform(X_train)
    X_validation_trees = group_independent_preprocessor_for_trees.transform(
        X_validation
    )
    X_test_trees = group_independent_preprocessor_for_trees.transform(X_test)

    # Save as feather files
    X_train_lr.to_feather(path_train_trees)
    X_validation_lr.to_feather(path_validation_trees)
    X_test_lr.to_feather(path_test_trees)

3.5.4 Train Models

Code

# Dictionary to collect results (logistic regression and Naive Bayes)
models_lr = {}

Code

@my.cache_results(dir_interim + "models_1_01_naive_bayes-2.pickle")
def fit_nb():
    """Fit a Naive Bayes model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(group_dependent_preprocessor_for_lr)),
            ("classifier", GaussianNB()),
        ]
    )
    pipeline.fit(X_train_lr, y_train)

    return pipeline


models_lr["Naive Bayes"] = fit_nb()
# Time 1m 38.1s

Code

@my.cache_results(dir_interim + "models_1_02_logistic_regression_sgd-2.pickle")
def fit_lr_sgd():
    """Fit a Logistic Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(group_dependent_preprocessor_for_lr)),
            (
                "classifier",
                SGDClassifier(
                    random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
                ),
            ),
        ]
    )
    pipeline.fit(X_train_lr, y_train)
    return pipeline


models_lr["Logistic Regression"] = fit_lr_sgd()
# 51.5s

Code

# Dictionary to collect results (tree-based models)
models_trees = {}

Code

@my.cache_results(dir_interim + "models_1_03_lgbm-2.pickle")
def fit_lgbm():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(group_dependent_preprocessor_for_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="binary",
                    metric="binary_logloss",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    pipeline.fit(X_train_trees, y_train)

    return pipeline


# Time: 57.2s
models_trees["LGBM"] = fit_lgbm()

3.5.5 Evaluate Models

print("--- Train ---")

trees_performance_train = ml.classification_scores(
    models_trees,
    X_train_trees,
    y_train,
    style=False,
)
ml.classification_scores(
    models_lr,
    X_train_lr,
    y_train,
    add=trees_performance_train,
    color="orange",
    sort_by="ROC_AUC",
)

--- Train ---

Table 3.3. Classification scores for the train set. The rows are sorted by ROC AUC score. The best values in each column are highlighted.

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM	6939617	0.950	0.997	0.998	0.996	0.967	0.998	0.999	0.996	0.936	1.000	1.000
Logistic Regression	6939617	0.950	0.981	0.989	0.977	0.840	0.990	0.997	0.980	0.725	1.000	0.998
Naive Bayes	6939617	0.950	0.938	0.963	0.926	0.614	0.966	0.991	0.935	0.445	0.999	0.987

print("--- Validation ---")

trees_performance_val = ml.classification_scores(
    models_trees,
    X_validation_trees,
    y_validation,
    style=False,
)
ml.classification_scores(
    models_lr,
    X_validation_lr,
    y_validation,
    add=trees_performance_val,
    sort_by="ROC_AUC",
)

--- Validation ---

Table 3.4. Classification scores for the validation set. The rows are sorted by ROC AUC score. The best values in each column are highlighted.

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM	1484768	0.950	0.997	0.998	0.995	0.966	0.998	0.999	0.996	0.935	1.000	1.000
Logistic Regression	1484768	0.950	0.981	0.989	0.978	0.841	0.990	0.997	0.980	0.726	1.000	0.998
Naive Bayes	1484768	0.950	0.938	0.963	0.927	0.614	0.966	0.991	0.935	0.445	1.000	0.988

Code

y_pred_validation_nb = models_lr["Naive Bayes"].predict(X_validation_lr)
ml.plot_confusion_matrices(y_validation, y_pred_validation_nb);

Code

y_pred_validation_lr = models_lr["Logistic Regression"].predict(X_validation_lr)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lr);

Code

y_pred_validation_lgbm = models_trees["LGBM"].predict(X_validation_trees)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm);

For the next round of analysis, the LGBM model is selected as the best-performing one.

3.5.6 Feature Importance (Best Model)

Feature importance for the LGBM model will be evaluated. Both internal LGBM and SHAP feature importance will be used.

Both methods indicate, that there are 9 most important features, just suggest different order of features.

Code

@my.cache_results(dir_interim + "task-1-shap_lgbm_k=all.pkl")
def get_shap_values_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_trees[model].steps[:-1])
    classifier = models_trees[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation_trees)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm, data_for_lgbm = get_shap_values_lgbm()
# Time:7m 49.3s

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

Code

lgb.plot_importance(
    models_trees["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(10, 6),
    height=0.8,
    title="LGBM Feature Importance",
);

Code

shap.summary_plot(shap_values_lgbm[1], data_for_lgbm, plot_type="bar")

Code

shap.summary_plot(shap_values_lgbm[1], data_for_lgbm)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

3.5.7 LGBM Re-Evaluation with Most Important Variables

The LGBM model was re-evaluated with 9 most important features only.

Validation accuracy, F1 score (of accepted loans), and PPV (positive predictive value) slightly dropped while others remained the same at the precision of 3 decimal digits (Table 3.6). Due to lower complexity, the model with 9 features will be used in the next steps.

The details of modeling are presented below.

@my.cache_results(dir_interim + "models_1_03_lgbm_k=9.pkl")
def fit_lgbm_k9():
    """Fit a LGBM model."""

    to_include = str_to_list(
        """
    risk_score_is_na
    risk_score
    employment_length_num
    debt_to_income_ratio
    month_cos
    month_sin
    employment_length_is_na
    title_len
    loan_amount
    """
    )

    pipeline = Pipeline(
        steps=[
            ("selector", ColumnSelector(to_include)),
            ("preprocessor", clone(group_dependent_preprocessor_for_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="binary",
                    metric="binary_logloss",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_trees, y_train)
    return pipeline


models_trees["LGBM (feat=9)"] = fit_lgbm_k9()
# Time 35.4s

print("--- Train ---")

trees_performance_train = ml.classification_scores(
    models_trees,
    X_train_trees,
    y_train,
    style=False,
)
train_performance = ml.classification_scores(
    models_lr,
    X_train_lr,
    y_train,
    add=trees_performance_train,
    color="orange",
    sort_by="ROC_AUC",
)
train_performance

--- Train ---

Table 3.5. Classification scores for the train set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM	6939617	0.950	0.997	0.998	0.996	0.967	0.998	0.999	0.996	0.936	1.000	1.000
LGBM (feat=9)	6939617	0.950	0.996	0.998	0.995	0.965	0.998	0.999	0.996	0.934	1.000	1.000
Logistic Regression	6939617	0.950	0.981	0.989	0.977	0.840	0.990	0.997	0.980	0.725	1.000	0.998
Naive Bayes	6939617	0.950	0.938	0.963	0.926	0.614	0.966	0.991	0.935	0.445	0.999	0.987

print("--- Validation ---")

trees_performance_val = ml.classification_scores(
    models_trees,
    X_validation_trees,
    y_validation,
    style=False,
)
validation_performance = ml.classification_scores(
    models_lr,
    X_validation_lr,
    y_validation,
    add=trees_performance_val,
    sort_by="ROC_AUC",
)
validation_performance
# Time: 8m 50.6s

--- Validation ---

Table 3.6. Classification scores for the validation set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.

	n	No_info_rate	Accuracy	BAcc	BAcc_01	F1	F1_neg	TPR	TNR	PPV	NPV	ROC_AUC
LGBM	1484768	0.950	0.997	0.998	0.995	0.966	0.998	0.999	0.996	0.935	1.000	1.000
LGBM (feat=9)	1484768	0.950	0.996	0.998	0.995	0.965	0.998	0.999	0.996	0.933	1.000	1.000
Logistic Regression	1484768	0.950	0.981	0.989	0.978	0.841	0.990	0.997	0.980	0.726	1.000	0.998
Naive Bayes	1484768	0.950	0.938	0.963	0.927	0.614	0.966	0.991	0.935	0.445	1.000	0.988

Code

y_pred_validation_lgbm = models_trees["LGBM"].predict(X_validation_trees)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm);

Code

y_pred_validation_lgbm = models_trees["LGBM (feat=9)"].predict(X_validation_trees)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm);

Code

lgb.plot_importance(
    models_trees["LGBM (feat=9)"]["classifier"],
    max_num_features=50,
    figsize=(10, 4),
    height=0.8,
    title="LGBM Feature Importance",
);

Code

@my.cache_results(dir_interim + "task-1-shap_lgbm_k=9.pkl")
def get_shap_values_lgbm_k8():
    model = "LGBM (feat=9)"
    preproc = Pipeline(steps=models_trees[model].steps[:-1])
    classifier = models_trees[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation_trees)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm_k9, data_for_lgbm_k9 = get_shap_values_lgbm_k8()
# Time: 7m 13.7s

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

Code

shap.summary_plot(shap_values_lgbm_k9[1], data_for_lgbm_k9, plot_type="bar")

Code

shap.summary_plot(shap_values_lgbm_k9[1], data_for_lgbm_k9)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

3.6 Final Model for Deployment

The final model will be trained on the whole dataset and deployed on Google Cloud Platform (GCP).

Code

to_include = str_to_list(
    """
risk_score_is_na
risk_score
employment_length_num
debt_to_income_ratio
month_cos
month_sin
employment_length_is_na
title_len
loan_amount
"""
)

X = pd.concat(
    [
        X_train_trees[to_include],
        X_validation_trees[to_include],
        X_test_trees[to_include],
    ],
    axis="index",
)

y = pd.concat([y_train, y_validation, y_test], axis="index")

Code

@my.cache_results(dir_interim + "01--model_predict_loan_status.pkl")
def fit_lgbm_final():
    """Fit the final LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("selector", ColumnSelector(to_include)),
            ("preprocessor", clone(group_dependent_preprocessor_for_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="binary",
                    metric="binary_logloss",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X, y)
    return pipeline


loan_status_predictor_final = fit_lgbm_final()

4 Task 2: Predicting Loan Grade, Subgrade, and Interest Rate

4.1 Inspection, EDA and Pre-Processing

4.1.1 Import and Inspect Data

For this task, only data from the most recent loan issue year (2018) will be used to reflect the most recent trends. This decision is made based on findings comparing trends over the years in Task 1.

To remember what data in the file looks like, the first 10 rows will be printed:

Code

!cd data/raw/ &&\
    head -n 10 accepted_2007_to_2018Q4.csv

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
68407277,,3600.0,3600.0,3600.0, 36 months,13.99,123.03,C,C4,leadman,10+ years,MORTGAGE,55000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68407277,,debt_consolidation,Debt consolidation,190xx,PA,5.91,0.0,Aug-2003,675.0,679.0,1.0,30.0,,7.0,0.0,2765.0,29.7,13.0,w,0.0,0.0,4421.723916800001,4421.72,3600.0,821.72,0.0,0.0,0.0,Jan-2019,122.67,,Mar-2019,564.0,560.0,0.0,30.0,1.0,Individual,,,,0.0,722.0,144904.0,2.0,2.0,0.0,1.0,21.0,4981.0,36.0,3.0,3.0,722.0,34.0,9300.0,3.0,1.0,4.0,4.0,20701.0,1506.0,37.2,0.0,0.0,148.0,128.0,3.0,3.0,1.0,4.0,69.0,4.0,69.0,2.0,2.0,4.0,2.0,5.0,3.0,4.0,9.0,4.0,7.0,0.0,0.0,0.0,3.0,76.9,0.0,0.0,0.0,178050.0,7746.0,2400.0,13734.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68355089,,24700.0,24700.0,24700.0, 36 months,11.99,820.28,C,C1,Engineer,10+ years,MORTGAGE,65000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68355089,,small_business,Business,577xx,SD,16.06,1.0,Dec-1999,715.0,719.0,4.0,6.0,,22.0,0.0,21470.0,19.2,38.0,w,0.0,0.0,25679.66,25679.66,24700.0,979.66,0.0,0.0,0.0,Jun-2016,926.35,,Mar-2019,699.0,695.0,0.0,,1.0,Individual,,,,0.0,0.0,204396.0,1.0,1.0,0.0,1.0,19.0,18005.0,73.0,2.0,3.0,6472.0,29.0,111800.0,0.0,0.0,6.0,4.0,9733.0,57830.0,27.1,0.0,0.0,113.0,192.0,2.0,2.0,4.0,2.0,,0.0,6.0,0.0,5.0,5.0,13.0,17.0,6.0,20.0,27.0,5.0,22.0,0.0,0.0,0.0,2.0,97.4,7.7,0.0,0.0,314017.0,39475.0,79300.0,24667.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68341763,,20000.0,20000.0,20000.0, 60 months,10.78,432.66,B,B4,truck driver,10+ years,MORTGAGE,63000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68341763,,home_improvement,,605xx,IL,10.78,0.0,Aug-2000,695.0,699.0,0.0,,,6.0,0.0,7869.0,56.2,18.0,w,0.0,0.0,22705.924293878397,22705.92,20000.0,2705.92,0.0,0.0,0.0,Jun-2017,15813.3,,Mar-2019,704.0,700.0,0.0,,1.0,Joint App,71000.0,13.85,Not Verified,0.0,0.0,189699.0,0.0,1.0,0.0,4.0,19.0,10827.0,73.0,0.0,2.0,2081.0,65.0,14000.0,2.0,5.0,1.0,6.0,31617.0,2737.0,55.9,0.0,0.0,125.0,184.0,14.0,14.0,5.0,101.0,,10.0,,0.0,2.0,3.0,2.0,4.0,6.0,4.0,7.0,3.0,6.0,0.0,0.0,0.0,0.0,100.0,50.0,0.0,0.0,218418.0,18696.0,6200.0,14877.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
66310712,,35000.0,35000.0,35000.0, 60 months,14.85,829.9,C,C5,Information Systems Officer,10+ years,MORTGAGE,110000.0,Source Verified,Dec-2015,Current,n,https://lendingclub.com/browse/loanDetail.action?loan_id=66310712,,debt_consolidation,Debt consolidation,076xx,NJ,17.06,0.0,Sep-2008,785.0,789.0,0.0,,,13.0,0.0,7802.0,11.6,17.0,w,15897.65,15897.65,31464.01,31464.01,19102.35,12361.66,0.0,0.0,0.0,Feb-2019,829.9,Apr-2019,Mar-2019,679.0,675.0,0.0,,1.0,Individual,,,,0.0,0.0,301500.0,1.0,1.0,0.0,1.0,23.0,12609.0,70.0,1.0,1.0,6987.0,45.0,67300.0,0.0,1.0,0.0,2.0,23192.0,54962.0,12.1,0.0,0.0,36.0,87.0,2.0,2.0,1.0,2.0,,,,0.0,4.0,5.0,8.0,10.0,2.0,10.0,13.0,5.0,13.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,381215.0,52226.0,62500.0,18000.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68476807,,10400.0,10400.0,10400.0, 60 months,22.45,289.91,F,F1,Contract Specialist,3 years,MORTGAGE,104433.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68476807,,major_purchase,Major purchase,174xx,PA,25.37,1.0,Jun-1998,695.0,699.0,3.0,12.0,,12.0,0.0,21929.0,64.5,35.0,w,0.0,0.0,11740.5,11740.5,10400.0,1340.5,0.0,0.0,0.0,Jul-2016,10128.96,,Mar-2018,704.0,700.0,0.0,,1.0,Individual,,,,0.0,0.0,331730.0,1.0,3.0,0.0,3.0,14.0,73839.0,84.0,4.0,7.0,9702.0,78.0,34000.0,2.0,1.0,3.0,10.0,27644.0,4567.0,77.5,0.0,0.0,128.0,210.0,4.0,4.0,6.0,4.0,12.0,1.0,12.0,0.0,4.0,6.0,5.0,9.0,10.0,7.0,19.0,6.0,12.0,0.0,0.0,0.0,4.0,96.6,60.0,0.0,0.0,439570.0,95768.0,20300.0,88097.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68426831,,11950.0,11950.0,11950.0, 36 months,13.44,405.18,C,C3,Veterinary Tecnician,4 years,RENT,34000.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68426831,,debt_consolidation,Debt consolidation,300xx,GA,10.2,0.0,Oct-1987,690.0,694.0,0.0,,,5.0,0.0,8822.0,68.4,6.0,w,0.0,0.0,13708.9485297572,13708.95,11950.0,1758.95,0.0,0.0,0.0,May-2017,7653.56,,May-2017,759.0,755.0,0.0,,1.0,Individual,,,,0.0,0.0,12798.0,0.0,1.0,0.0,0.0,338.0,3976.0,99.0,0.0,0.0,4522.0,76.0,12900.0,0.0,0.0,0.0,0.0,2560.0,844.0,91.0,0.0,0.0,338.0,54.0,32.0,32.0,0.0,36.0,,,,0.0,2.0,3.0,2.0,2.0,2.0,4.0,4.0,3.0,5.0,0.0,0.0,0.0,0.0,100.0,100.0,0.0,0.0,16900.0,12798.0,9400.0,4000.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68476668,,20000.0,20000.0,20000.0, 36 months,9.17,637.58,B,B2,Vice President of Recruiting Operations,10+ years,MORTGAGE,180000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68476668,,debt_consolidation,Debt consolidation,550xx,MN,14.67,0.0,Jun-1990,680.0,684.0,0.0,49.0,,12.0,0.0,87329.0,84.5,27.0,f,0.0,0.0,21393.800000011,21393.8,20000.0,1393.8,0.0,0.0,0.0,Nov-2016,15681.05,,Mar-2019,654.0,650.0,0.0,,1.0,Individual,,,,0.0,0.0,360358.0,0.0,2.0,0.0,2.0,18.0,29433.0,63.0,2.0,3.0,13048.0,74.0,94200.0,1.0,0.0,1.0,6.0,30030.0,0.0,102.9,0.0,0.0,142.0,306.0,10.0,10.0,4.0,12.0,,10.0,,0.0,4.0,6.0,4.0,5.0,7.0,9.0,16.0,6.0,12.0,0.0,0.0,0.0,2.0,96.3,100.0,0.0,0.0,388852.0,116762.0,31500.0,46452.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
67275481,,20000.0,20000.0,20000.0, 36 months,8.49,631.26,B,B1,road driver,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=67275481,,major_purchase,Major purchase,293xx,SC,17.61,1.0,Feb-1999,705.0,709.0,0.0,3.0,,8.0,0.0,826.0,5.7,15.0,w,0.0,0.0,21538.508976797,21538.51,20000.0,1538.51,0.0,0.0,0.0,Jan-2017,14618.23,,Mar-2019,674.0,670.0,0.0,3.0,1.0,Individual,,,,0.0,0.0,141601.0,0.0,3.0,0.0,4.0,13.0,27111.0,75.0,0.0,0.0,640.0,55.0,14500.0,1.0,0.0,2.0,4.0,17700.0,13674.0,5.7,0.0,0.0,149.0,55.0,32.0,13.0,3.0,32.0,,8.0,,1.0,2.0,2.0,3.0,3.0,9.0,3.0,3.0,2.0,8.0,0.0,0.0,1.0,0.0,93.3,0.0,0.0,0.0,193390.0,27937.0,14500.0,36144.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68466926,,10000.0,10000.0,10000.0, 36 months,6.49,306.45,A,A2,SERVICE MANAGER,6 years,RENT,85000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68466926,,credit_card,Credit card refinancing,160xx,PA,13.07,0.0,Apr-2002,685.0,689.0,1.0,,106.0,14.0,1.0,10464.0,34.5,23.0,w,0.0,0.0,10998.9715749644,10998.97,10000.0,998.97,0.0,0.0,0.0,Aug-2018,1814.48,,Mar-2019,719.0,715.0,0.0,,1.0,Individual,,,,0.0,8341.0,27957.0,2.0,1.0,0.0,0.0,35.0,17493.0,57.0,2.0,7.0,2524.0,46.0,30300.0,2.0,0.0,1.0,7.0,1997.0,8182.0,50.1,0.0,0.0,164.0,129.0,1.0,1.0,1.0,4.0,,1.0,,0.0,6.0,9.0,7.0,10.0,3.0,13.0,19.0,9.0,14.0,0.0,0.0,0.0,2.0,95.7,28.6,1.0,0.0,61099.0,27957.0,16400.0,30799.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,

Code

!cd data/raw/ &&\
    head -n 10 accepted_2007_to_2018Q4.csv | csvlook

|         id | member_id | loan_amnt | funded_amnt | funded_amnt_inv |       term | int_rate | installment | grade | sub_grade | emp_title                               | emp_length | home_ownership | annual_inc | verification_status | issue_d  | loan_status | pymnt_plan | url                                                               | desc | purpose            | title                   | zip_code | addr_state |   dti | delinq_2yrs | earliest_cr_line | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | annual_inc_joint | dti_joint | verification_status_joint | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | revol_bal_joint | sec_app_fico_range_low | sec_app_fico_range_high | sec_app_earliest_cr_line | sec_app_inq_last_6mths | sec_app_mort_acc | sec_app_open_acc | sec_app_revol_util | sec_app_open_act_il | sec_app_num_rev_accts | sec_app_chargeoff_within_12_mths | sec_app_collections_12_mths_ex_med | sec_app_mths_since_last_major_derog | hardship_flag | hardship_type | hardship_reason | hardship_status | deferral_term | hardship_amount | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term |
| ---------- | --------- | --------- | ----------- | --------------- | ---------- | -------- | ----------- | ----- | --------- | --------------------------------------- | ---------- | -------------- | ---------- | ------------------- | -------- | ----------- | ---------- | ----------------------------------------------------------------- | ---- | ------------------ | ----------------------- | -------- | ---------- | ----- | ----------- | ---------------- | -------------- | --------------- | -------------- | ---------------------- | ---------------------- | -------- | ------- | --------- | ---------- | --------- | ------------------- | --------- | ------------- | ----------- | --------------- | --------------- | ------------- | ------------------ | ---------- | ----------------------- | ------------ | --------------- | ------------ | ------------------ | -------------------- | ------------------- | -------------------------- | --------------------------- | ----------- | ---------------- | ---------------- | --------- | ------------------------- | -------------- | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- | ------------------ | ------------ | ------- | ----------- | ----------- | ---------- | -------- | ---------------- | ------ | ----------- | ------------ | -------------------- | ----------- | -------------- | ------- | ------------------------ | ----------- | ------------------ | -------------------- | --------------------- | -------------- | -------- | -------------------- | ------------------------ | --------------------- | ------------------------------ | --------------------- | -------------- | --------------- | ----------- | --------- | --------- | ------------- | ------------- | ------------------- | -------- | ---------------- | ------------ | ------------------ | ------------------ | -------------- | ---------------- | -------------------- | --------- | --------------- | ----------------- | -------------- | -------------------------- | --------------- | ---------------------- | ----------------------- | ------------------------ | ---------------------- | ---------------- | ---------------- | ------------------ | ------------------- | --------------------- | -------------------------------- | ---------------------------------- | ----------------------------------- | ------------- | ------------- | --------------- | --------------- | ------------- | --------------- | ------------------- | ----------------- | ----------------------- | --------------- | ------------ | -------------------- | ------------------------------------------ | ------------------------------ | ---------------------------- | ------------------- | -------------------- | ------------------------- | ----------------- | --------------- | ----------------- | --------------------- | --------------- |
| 68,407,277 |           |     3,600 |       3,600 |           3,600 | 0004-01-01 |    13.99 |      123.03 | C     | C4        | leadman                                 | 10+ years  | MORTGAGE       |     55,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68407277 |      | debt_consolidation | Debt consolidation      | 190xx    | PA         |  5.91 |           0 | Aug-2003         |            675 |             679 |              1 |                     30 |                        |        7 |       0 |     2,765 |       29.7 |        13 | w                   |      0.00 |          0.00 |  4,421.724… |        4,421.72 |        3,600.00 |        821.72 |                  0 |          0 |                       0 | Jan-2019     |          122.67 |              | Mar-2019           |                  564 |                 560 |                          0 |                          30 |           1 | Individual       |                  |           |                           |              0 |          722 |     144,904 |           2 |           2 |           0 |           1 |                 21 |        4,981 |      36 |           3 |           3 |        722 |       34 |            9,300 |      3 |           1 |            4 |                    4 |      20,701 |          1,506 |    37.2 |                        0 |           0 |                148 |                  128 |                     3 |              3 |        1 |                    4 |                       69 |                     4 |                             69 |                     2 |              2 |               4 |           2 |         5 |         3 |             4 |             9 |                   4 |        7 |                0 |            0 |                  0 |                  3 |           76.9 |              0.0 |                    0 |         0 |         178,050 |             7,746 |          2,400 |                     13,734 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,355,089 |           |    24,700 |      24,700 |          24,700 | 0004-01-01 |    11.99 |      820.28 | C     | C1        | Engineer                                | 10+ years  | MORTGAGE       |     65,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68355089 |      | small_business     | Business                | 577xx    | SD         | 16.06 |           1 | Dec-1999         |            715 |             719 |              4 |                      6 |                        |       22 |       0 |    21,470 |       19.2 |        38 | w                   |      0.00 |          0.00 | 25,679.660… |       25,679.66 |       24,700.00 |        979.66 |                  0 |          0 |                       0 | Jun-2016     |          926.35 |              | Mar-2019           |                  699 |                 695 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     204,396 |           1 |           1 |           0 |           1 |                 19 |       18,005 |      73 |           2 |           3 |      6,472 |       29 |          111,800 |      0 |           0 |            6 |                    4 |       9,733 |         57,830 |    27.1 |                        0 |           0 |                113 |                  192 |                     2 |              2 |        4 |                    2 |                          |                     0 |                              6 |                     0 |              5 |               5 |          13 |        17 |         6 |            20 |            27 |                   5 |       22 |                0 |            0 |                  0 |                  2 |           97.4 |              7.7 |                    0 |         0 |         314,017 |            39,475 |         79,300 |                     24,667 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,341,763 |           |    20,000 |      20,000 |          20,000 | 0006-01-01 |    10.78 |      432.66 | B     | B4        | truck driver                            | 10+ years  | MORTGAGE       |     63,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68341763 |      | home_improvement   |                         | 605xx    | IL         | 10.78 |           0 | Aug-2000         |            695 |             699 |              0 |                        |                        |        6 |       0 |     7,869 |       56.2 |        18 | w                   |      0.00 |          0.00 | 22,705.924… |       22,705.92 |       20,000.00 |      2,705.92 |                  0 |          0 |                       0 | Jun-2017     |       15,813.30 |              | Mar-2019           |                  704 |                 700 |                          0 |                             |           1 | Joint App        |           71,000 |     13.85 | Not Verified              |              0 |            0 |     189,699 |           0 |           1 |           0 |           4 |                 19 |       10,827 |      73 |           0 |           2 |      2,081 |       65 |           14,000 |      2 |           5 |            1 |                    6 |      31,617 |          2,737 |    55.9 |                        0 |           0 |                125 |                  184 |                    14 |             14 |        5 |                  101 |                          |                    10 |                                |                     0 |              2 |               3 |           2 |         4 |         6 |             4 |             7 |                   3 |        6 |                0 |            0 |                  0 |                  0 |          100.0 |             50.0 |                    0 |         0 |         218,418 |            18,696 |          6,200 |                     14,877 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 66,310,712 |           |    35,000 |      35,000 |          35,000 | 0006-01-01 |    14.85 |      829.90 | C     | C5        | Information Systems Officer             | 10+ years  | MORTGAGE       |    110,000 | Source Verified     | Dec-2015 | Current     |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=66310712 |      | debt_consolidation | Debt consolidation      | 076xx    | NJ         | 17.06 |           0 | Sep-2008         |            785 |             789 |              0 |                        |                        |       13 |       0 |     7,802 |       11.6 |        17 | w                   | 15,897.65 |     15,897.65 | 31,464.010… |       31,464.01 |       19,102.35 |     12,361.66 |                  0 |          0 |                       0 | Feb-2019     |          829.90 | Apr-2019     | Mar-2019           |                  679 |                 675 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     301,500 |           1 |           1 |           0 |           1 |                 23 |       12,609 |      70 |           1 |           1 |      6,987 |       45 |           67,300 |      0 |           1 |            0 |                    2 |      23,192 |         54,962 |    12.1 |                        0 |           0 |                 36 |                   87 |                     2 |              2 |        1 |                    2 |                          |                       |                                |                     0 |              4 |               5 |           8 |        10 |         2 |            10 |            13 |                   5 |       13 |                0 |            0 |                  0 |                  1 |          100.0 |              0.0 |                    0 |         0 |         381,215 |            52,226 |         62,500 |                     18,000 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,476,807 |           |    10,400 |      10,400 |          10,400 | 0006-01-01 |    22.45 |      289.91 | F     | F1        | Contract Specialist                     | 3 years    | MORTGAGE       |    104,433 | Source Verified     | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68476807 |      | major_purchase     | Major purchase          | 174xx    | PA         | 25.37 |           1 | Jun-1998         |            695 |             699 |              3 |                     12 |                        |       12 |       0 |    21,929 |       64.5 |        35 | w                   |      0.00 |          0.00 | 11,740.500… |       11,740.50 |       10,400.00 |      1,340.50 |                  0 |          0 |                       0 | Jul-2016     |       10,128.96 |              | Mar-2018           |                  704 |                 700 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     331,730 |           1 |           3 |           0 |           3 |                 14 |       73,839 |      84 |           4 |           7 |      9,702 |       78 |           34,000 |      2 |           1 |            3 |                   10 |      27,644 |          4,567 |    77.5 |                        0 |           0 |                128 |                  210 |                     4 |              4 |        6 |                    4 |                       12 |                     1 |                             12 |                     0 |              4 |               6 |           5 |         9 |        10 |             7 |            19 |                   6 |       12 |                0 |            0 |                  0 |                  4 |           96.6 |             60.0 |                    0 |         0 |         439,570 |            95,768 |         20,300 |                     88,097 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,426,831 |           |    11,950 |      11,950 |          11,950 | 0004-01-01 |    13.44 |      405.18 | C     | C3        | Veterinary Tecnician                    | 4 years    | RENT           |     34,000 | Source Verified     | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68426831 |      | debt_consolidation | Debt consolidation      | 300xx    | GA         | 10.20 |           0 | Oct-1987         |            690 |             694 |              0 |                        |                        |        5 |       0 |     8,822 |       68.4 |         6 | w                   |      0.00 |          0.00 | 13,708.949… |       13,708.95 |       11,950.00 |      1,758.95 |                  0 |          0 |                       0 | May-2017     |        7,653.56 |              | May-2017           |                  759 |                 755 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |      12,798 |           0 |           1 |           0 |           0 |                338 |        3,976 |      99 |           0 |           0 |      4,522 |       76 |           12,900 |      0 |           0 |            0 |                    0 |       2,560 |            844 |    91.0 |                        0 |           0 |                338 |                   54 |                    32 |             32 |        0 |                   36 |                          |                       |                                |                     0 |              2 |               3 |           2 |         2 |         2 |             4 |             4 |                   3 |        5 |                0 |            0 |                  0 |                  0 |          100.0 |            100.0 |                    0 |         0 |          16,900 |            12,798 |          9,400 |                      4,000 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,476,668 |           |    20,000 |      20,000 |          20,000 | 0004-01-01 |     9.17 |      637.58 | B     | B2        | Vice President of Recruiting Operations | 10+ years  | MORTGAGE       |    180,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68476668 |      | debt_consolidation | Debt consolidation      | 550xx    | MN         | 14.67 |           0 | Jun-1990         |            680 |             684 |              0 |                     49 |                        |       12 |       0 |    87,329 |       84.5 |        27 | f                   |      0.00 |          0.00 | 21,393.800… |       21,393.80 |       20,000.00 |      1,393.80 |                  0 |          0 |                       0 | Nov-2016     |       15,681.05 |              | Mar-2019           |                  654 |                 650 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     360,358 |           0 |           2 |           0 |           2 |                 18 |       29,433 |      63 |           2 |           3 |     13,048 |       74 |           94,200 |      1 |           0 |            1 |                    6 |      30,030 |              0 |   102.9 |                        0 |           0 |                142 |                  306 |                    10 |             10 |        4 |                   12 |                          |                    10 |                                |                     0 |              4 |               6 |           4 |         5 |         7 |             9 |            16 |                   6 |       12 |                0 |            0 |                  0 |                  2 |           96.3 |            100.0 |                    0 |         0 |         388,852 |           116,762 |         31,500 |                     46,452 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 67,275,481 |           |    20,000 |      20,000 |          20,000 | 0004-01-01 |     8.49 |      631.26 | B     | B1        | road driver                             | 10+ years  | MORTGAGE       |     85,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=67275481 |      | major_purchase     | Major purchase          | 293xx    | SC         | 17.61 |           1 | Feb-1999         |            705 |             709 |              0 |                      3 |                        |        8 |       0 |       826 |        5.7 |        15 | w                   |      0.00 |          0.00 | 21,538.509… |       21,538.51 |       20,000.00 |      1,538.51 |                  0 |          0 |                       0 | Jan-2017     |       14,618.23 |              | Mar-2019           |                  674 |                 670 |                          0 |                           3 |           1 | Individual       |                  |           |                           |              0 |            0 |     141,601 |           0 |           3 |           0 |           4 |                 13 |       27,111 |      75 |           0 |           0 |        640 |       55 |           14,500 |      1 |           0 |            2 |                    4 |      17,700 |         13,674 |     5.7 |                        0 |           0 |                149 |                   55 |                    32 |             13 |        3 |                   32 |                          |                     8 |                                |                     1 |              2 |               2 |           3 |         3 |         9 |             3 |             3 |                   2 |        8 |                0 |            0 |                  1 |                  0 |           93.3 |              0.0 |                    0 |         0 |         193,390 |            27,937 |         14,500 |                     36,144 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,466,926 |           |    10,000 |      10,000 |          10,000 | 0004-01-01 |     6.49 |      306.45 | A     | A2        | SERVICE MANAGER                         | 6 years    | RENT           |     85,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68466926 |      | credit_card        | Credit card refinancing | 160xx    | PA         | 13.07 |           0 | Apr-2002         |            685 |             689 |              1 |                        |                    106 |       14 |       1 |    10,464 |       34.5 |        23 | w                   |      0.00 |          0.00 | 10,998.972… |       10,998.97 |       10,000.00 |        998.97 |                  0 |          0 |                       0 | Aug-2018     |        1,814.48 |              | Mar-2019           |                  719 |                 715 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |        8,341 |      27,957 |           2 |           1 |           0 |           0 |                 35 |       17,493 |      57 |           2 |           7 |      2,524 |       46 |           30,300 |      2 |           0 |            1 |                    7 |       1,997 |          8,182 |    50.1 |                        0 |           0 |                164 |                  129 |                     1 |              1 |        1 |                    4 |                          |                     1 |                                |                     0 |              6 |               9 |           7 |        10 |         3 |            13 |            19 |                   9 |       14 |                0 |            0 |                  0 |                  2 |           95.7 |             28.6 |                    1 |         0 |          61,099 |            27,957 |         16,400 |                     30,799 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |

Next, data will be imported, the year 2018 will be kept and data types will be automatically adjusted to take less memory. The results will be saved into a feather file.

Code

file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
if os.path.exists(file_path):
    # Restore from file, if present.
    accepted_2018 = pd.read_feather(file_path)
else:
    # Use Pyarrow backend to read data.
    # Leave only data of loans issued in year 2018.
    accepted_all = pd.read_csv("data/raw/accepted_2007_to_2018Q4.csv", engine="pyarrow")
    accepted_2018 = accepted_all[accepted_all["issue_d"].str.contains("2018", na=False)]
    accepted_2018 = klib.convert_datatypes(accepted_2018)
    accepted_2018.to_feather(file_path)
    del accepted_all

del file_path

The whole dataset took 2.5+ GB of memory (code is not shown here). After leaving only the year 2018, it used 574.3+ MB (code is not shown here). After automatic data type adjustment, the memory usage dropped almost 2 times to 267.4 MB.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB

<class 'pandas.core.frame.DataFrame'>
Index: 495242 entries, 421097 to 1611876
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 574.3+ MB

Code

accepted_2018.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Index: 495242 entries, 421097 to 1611876
Columns: 151 entries, id to settlement_term
dtypes: category(35), float32(102), float64(11), string(3)
memory usage: 267.4 MB

Note

In this section, a general overview and inspection will be done to get more familiar with the data and to catch the most obvious discrepancies. A detailed EDA will be performed on the training set only.

To spot possible discrepancies, the columns were inspected and the results were sorted by the number of unique values (n_unique). Some columns have no non-missing values (member ID, description), some have only 1 non-missing value (hardship_type, deferral_term, etc.), and these should be excluded from the analysis.

Code

an.col_info(accepted_2018).sort_values("n_unique").pipe(an.style_col_info)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
2	member_id	float32	2.0 MB	0	0%	495,242	100.0%	0	0%	nan%
20	desc	category	495.4 kB	0	0%	495,242	100.0%	0	0%	nan%
130	hardship_type	category	495.4 kB	1	<0.1%	494,874	99.9%	368	0.1%	100.0%	INTEREST ONLY-3 MONTHS DEFERRAL
133	deferral_term	float32	2.0 MB	1	<0.1%	494,874	99.9%	368	0.1%	100.0%	3.0
138	hardship_length	float32	2.0 MB	1	<0.1%	494,874	99.9%	368	0.1%	100.0%	3.0
104	num_tl_120dpd_2m	float32	2.0 MB	1	<0.1%	12,404	2.5%	482,838	97.5%	100.0%	0.0
56	policy_code	float32	2.0 MB	1	<0.1%	0	0%	495,242	100.0%	100.0%	1.0
145	debt_settlement_flag	category	495.5 kB	2	<0.1%	0	0%	494,762	99.9%	99.9%	N
144	disbursement_method	category	495.5 kB	2	<0.1%	0	0%	423,884	85.6%	85.6%	Cash
105	num_tl_30dpd	float32	2.0 MB	2	<0.1%	0	0%	495,219	>99.9%	>99.9%	0.0
38	initial_list_status	category	495.5 kB	2	<0.1%	0	0%	427,183	86.3%	86.3%	w
129	hardship_flag	category	495.5 kB	2	<0.1%	0	0%	495,053	>99.9%	>99.9%	N
57	application_type	category	495.5 kB	2	<0.1%	0	0%	426,257	86.1%	86.1%	Individual
61	acc_now_delinq	float32	2.0 MB	2	<0.1%	0	0%	495,216	>99.9%	>99.9%	0.0
6	term	category	495.5 kB	2	<0.1%	0	0%	344,671	69.6%	69.6%	36 months
18	pymnt_plan	category	495.5 kB	2	<0.1%	0	0%	495,091	>99.9%	>99.9%	n
50	next_pymnt_d	category	495.5 kB	3	<0.1%	56,311	11.4%	438,776	88.6%	>99.9%	Apr-2019
132	hardship_status	category	495.5 kB	3	<0.1%	494,874	99.9%	189	<0.1%	51.4%	ACTIVE
60	verification_status_joint	category	495.6 kB	3	<0.1%	431,231	87.1%	28,298	5.7%	44.2%	Not Verified
147	settlement_status	category	495.5 kB	3	<0.1%	494,762	99.9%	391	0.1%	81.5%	ACTIVE
15	verification_status	category	495.6 kB	3	<0.1%	0	0%	199,934	40.4%	40.4%	Not Verified
13	home_ownership	category	495.7 kB	4	<0.1%	0	0%	239,220	48.3%	48.3%	MORTGAGE
140	hardship_loan_status	category	495.8 kB	5	<0.1%	494,874	99.9%	150	<0.1%	40.8%	Late (16-30 days)
30	inq_last_6mths	float32	2.0 MB	6	<0.1%	0	0%	332,652	67.2%	67.2%	0.0
17	loan_status	category	496.0 kB	7	<0.1%	0	0%	427,181	86.3%	86.3%	Current
9	grade	category	495.9 kB	7	<0.1%	0	0%	141,365	28.5%	28.5%	B
120	sec_app_inq_last_6mths	float32	2.0 MB	7	<0.1%	426,257	86.1%	42,442	8.6%	61.5%	0.0
66	open_il_12m	float32	2.0 MB	8	<0.1%	0	0%	272,591	55.0%	55.0%	0.0
110	pub_rec_bankruptcies	float32	2.0 MB	8	<0.1%	0	0%	434,943	87.8%	87.8%	0.0
54	collections_12_mths_ex_med	float32	2.0 MB	9	<0.1%	0	0%	487,215	98.4%	98.4%	0.0
131	hardship_reason	category	496.2 kB	9	<0.1%	494,874	99.9%	96	<0.1%	26.1%	UNEMPLOYMENT
136	hardship_end_date	category	496.1 kB	9	<0.1%	494,874	99.9%	87	<0.1%	23.6%	May-2019
135	hardship_start_date	category	496.1 kB	9	<0.1%	494,874	99.9%	98	<0.1%	26.6%	Mar-2019
83	chargeoff_within_12_mths	float32	2.0 MB	9	<0.1%	0	0%	492,165	99.4%	99.4%	0.0
137	payment_plan_start_date	category	496.1 kB	9	<0.1%	494,874	99.9%	91	<0.1%	24.7%	Mar-2019
146	debt_settlement_flag_date	category	496.2 kB	10	<0.1%	494,762	99.9%	142	<0.1%	29.6%	Mar-2019
148	settlement_date	category	496.3 kB	11	<0.1%	494,762	99.9%	103	<0.1%	21.5%	Jan-2019
12	emp_length	category	496.2 kB	11	<0.1%	41,987	8.5%	160,382	32.4%	35.4%	10+ years
16	issue_d	category	496.3 kB	12	<0.1%	0	0%	46,311	9.4%	9.4%	May-2018
22	title	category	496.4 kB	12	<0.1%	0	0%	259,642	52.4%	52.4%	Debt consolidation
21	purpose	category	496.7 kB	13	<0.1%	0	0%	259,642	52.4%	52.4%	debt_consolidation
64	open_acc_6m	float32	2.0 MB	15	<0.1%	0	0%	231,308	46.7%	46.7%	0.0
48	last_pymnt_d	category	496.8 kB	15	<0.1%	640	0.1%	407,215	82.2%	82.3%	Mar-2019
34	pub_rec	float32	2.0 MB	16	<0.1%	0	0%	432,258	87.3%	87.3%	0.0
111	tax_liens	float32	2.0 MB	16	<0.1%	0	0%	491,903	99.3%	99.3%	0.0
127	sec_app_collections_12_mths_ex_med	float32	2.0 MB	17	<0.1%	426,257	86.1%	65,305	13.2%	94.7%	0.0
51	last_credit_pull_d	category	497.0 kB	18	<0.1%	5	<0.1%	458,186	92.5%	92.5%	Mar-2019
151	settlement_term	float32	2.0 MB	19	<0.1%	494,762	99.9%	177	<0.1%	36.9%	24.0
67	open_il_24m	float32	2.0 MB	21	<0.1%	0	0%	155,968	31.5%	31.5%	1.0
126	sec_app_chargeoff_within_12_mths	float32	2.0 MB	21	<0.1%	426,257	86.1%	67,272	13.6%	97.5%	0.0
121	sec_app_mort_acc	float32	2.0 MB	22	<0.1%	426,257	86.1%	27,029	5.5%	39.2%	0.0
71	open_rv_12m	float32	2.0 MB	24	<0.1%	0	0%	195,241	39.4%	39.4%	0.0
106	num_tl_90g_dpd_24m	float32	2.0 MB	25	<0.1%	0	0%	475,280	96.0%	96.0%	0.0
26	delinq_2yrs	float32	2.0 MB	26	<0.1%	0	0%	422,718	85.4%	85.4%	0.0
107	num_tl_op_past_12m	float32	2.0 MB	26	<0.1%	0	0%	127,126	25.7%	25.7%	1.0
92	mths_since_recent_inq	float32	2.0 MB	26	<0.1%	61,305	12.4%	41,585	8.4%	9.6%	1.0
139	hardship_dpd	float32	2.0 MB	30	<0.1%	494,874	99.9%	61	<0.1%	16.6%	0.0
89	mort_acc	float32	2.0 MB	30	<0.1%	0	0%	226,841	45.8%	45.8%	0.0
76	inq_fi	float32	2.0 MB	30	<0.1%	0	0%	227,365	45.9%	45.9%	0.0
10	sub_grade	category	498.4 kB	35	<0.1%	0	0%	31,728	6.4%	6.4%	B4
95	num_actv_bc_tl	float32	2.0 MB	35	<0.1%	0	0%	101,502	20.5%	20.5%	2.0
124	sec_app_open_act_il	float32	2.0 MB	37	<0.1%	426,257	86.1%	15,286	3.1%	22.2%	1.0
94	num_accts_ever_120_pd	float32	2.0 MB	38	<0.1%	0	0%	391,216	79.0%	79.0%	0.0
29	fico_range_high	float32	2.0 MB	38	<0.1%	0	0%	32,103	6.5%	6.5%	684.0
28	fico_range_low	float32	2.0 MB	38	<0.1%	0	0%	32,103	6.5%	6.5%	680.0
72	open_rv_24m	float32	2.0 MB	41	<0.1%	0	0%	110,850	22.4%	22.4%	1.0
102	num_rev_tl_bal_gt_0	float32	2.0 MB	45	<0.1%	0	0%	74,540	15.1%	15.1%	4.0
79	acc_open_past_24mths	float32	2.0 MB	45	<0.1%	0	0%	74,308	15.0%	15.0%	3.0
78	inq_last_12m	float32	2.0 MB	45	<0.1%	0	0%	153,733	31.0%	31.0%	0.0
150	settlement_percentage	float32	2.0 MB	45	<0.1%	494,762	99.9%	145	<0.1%	30.2%	45.0
65	open_act_il	float32	2.0 MB	48	<0.1%	0	0%	126,903	25.6%	25.6%	1.0
96	num_actv_rev_tl	float32	2.0 MB	49	<0.1%	0	0%	74,082	15.0%	15.0%	4.0
24	addr_state	category	500.3 kB	50	<0.1%	0	0%	67,267	13.6%	13.6%	CA
77	total_cu_tl	float32	2.0 MB	50	<0.1%	0	0%	265,016	53.5%	53.5%	0.0
97	num_bc_sats	float32	2.0 MB	51	<0.1%	0	0%	82,277	16.6%	16.6%	3.0
118	sec_app_fico_range_high	float32	2.0 MB	62	<0.1%	426,257	86.1%	3,843	0.8%	5.6%	674.0
117	sec_app_fico_range_low	float32	2.0 MB	62	<0.1%	426,257	86.1%	3,843	0.8%	5.6%	670.0
122	sec_app_open_acc	float32	2.0 MB	63	<0.1%	426,257	86.1%	4,906	1.0%	7.1%	9.0
98	num_bc_tl	float32	2.0 MB	65	<0.1%	0	0%	56,940	11.5%	11.5%	4.0
53	last_fico_range_low	float32	2.0 MB	71	<0.1%	0	0%	22,249	4.5%	4.5%	695.0
52	last_fico_range_high	float32	2.0 MB	72	<0.1%	0	0%	22,249	4.5%	4.5%	699.0
100	num_op_rev_tl	float32	2.0 MB	73	<0.1%	0	0%	51,758	10.5%	10.5%	5.0
103	num_sats	float32	2.0 MB	75	<0.1%	0	0%	40,989	8.3%	8.3%	9.0
33	open_acc	float32	2.0 MB	76	<0.1%	0	0%	40,942	8.3%	8.3%	9.0
125	sec_app_num_rev_accts	float32	2.0 MB	81	<0.1%	426,257	86.1%	4,222	0.9%	6.1%	7.0
99	num_il_tl	float32	2.0 MB	99	<0.1%	0	0%	44,138	8.9%	8.9%	3.0
101	num_rev_accts	float32	2.0 MB	101	<0.1%	0	0%	32,567	6.6%	6.6%	8.0
7	int_rate	float32	2.0 MB	110	<0.1%	0	0%	13,718	2.8%	2.8%	13.56
32	mths_since_last_record	float32	2.0 MB	127	<0.1%	432,258	87.3%	1,168	0.2%	1.9%	94.0
37	total_acc	float32	2.0 MB	130	<0.1%	0	0%	18,677	3.8%	3.8%	16.0
128	sec_app_mths_since_last_major_derog	float32	2.0 MB	136	<0.1%	472,865	95.5%	614	0.1%	2.7%	1.0
31	mths_since_last_delinq	float32	2.0 MB	158	<0.1%	276,652	55.9%	3,892	0.8%	1.8%	12.0
84	delinq_amnt	float32	2.0 MB	159	<0.1%	0	0%	495,082	>99.9%	>99.9%	0.0
91	mths_since_recent_bc_dlq	float32	2.0 MB	159	<0.1%	397,132	80.2%	1,724	0.3%	1.8%	44.0
55	mths_since_last_major_derog	float32	2.0 MB	164	<0.1%	380,409	76.8%	1,875	0.4%	1.6%	45.0
93	mths_since_recent_revol_delinq	float32	2.0 MB	165	<0.1%	352,552	71.2%	2,535	0.5%	1.8%	24.0
74	all_util	float32	2.0 MB	166	<0.1%	129	<0.1%	9,374	1.9%	1.9%	60.0
88	mo_sin_rcnt_tl	float32	2.0 MB	204	<0.1%	0	0%	51,311	10.4%	10.4%	2.0
109	percent_bc_gt_75	float32	2.0 MB	204	<0.1%	6,596	1.3%	188,963	38.2%	38.7%	0.0
70	il_util	float32	2.0 MB	235	<0.1%	80,824	16.3%	7,747	1.6%	1.9%	78.0
87	mo_sin_rcnt_rev_tl_op	float32	2.0 MB	286	0.1%	0	0%	35,630	7.2%	7.2%	2.0
141	orig_projected_additional_accrued_interest	float32	2.0 MB	321	0.1%	494,921	99.9%	1	<0.1%	0.3%	191.64
68	mths_since_rcnt_il	float32	2.0 MB	352	0.1%	18,410	3.7%	21,698	4.4%	4.6%	7.0
143	hardship_last_payment_amount	float32	2.0 MB	361	0.1%	494,874	99.9%	2	<0.1%	0.5%	0.81
134	hardship_amount	float32	2.0 MB	367	0.1%	494,874	99.9%	2	<0.1%	0.5%	164.39
142	hardship_payoff_balance_amount	float64	4.0 MB	368	0.1%	494,874	99.9%	1	<0.1%	0.3%	4238.9
90	mths_since_recent_bc	float32	2.0 MB	475	0.1%	6,198	1.3%	22,244	4.5%	4.5%	3.0
149	settlement_amount	float32	2.0 MB	477	0.1%	494,762	99.9%	2	<0.1%	0.4%	4272.0
85	mo_sin_old_il_acct	float32	2.0 MB	507	0.1%	18,410	3.7%	5,195	1.0%	1.1%	130.0
108	pct_tl_nvr_dlq	float32	2.0 MB	588	0.1%	2	<0.1%	276,697	55.9%	55.9%	100.0
119	sec_app_earliest_cr_line	category	1.0 MB	645	0.1%	426,257	86.1%	641	0.1%	0.9%	Aug-2006
27	earliest_cr_line	category	1.1 MB	684	0.1%	0	0%	4,265	0.9%	0.9%	Aug-2006
86	mo_sin_old_rev_tl_op	float32	2.0 MB	731	0.1%	0	0%	3,040	0.6%	0.6%	136.0
23	zip_code	category	1.1 MB	897	0.2%	0	0%	5,364	1.1%	1.1%	112xx
36	revol_util	float32	2.0 MB	1,136	0.2%	592	0.1%	4,942	1.0%	1.0%	0.0
123	sec_app_revol_util	float32	2.0 MB	1,164	0.2%	427,454	86.3%	813	0.2%	1.2%	0.0
82	bc_util	float32	2.0 MB	1,210	0.2%	6,803	1.4%	9,601	1.9%	2.0%	0.0
3	loan_amnt	float32	2.0 MB	1,559	0.3%	0	0%	55,645	11.2%	11.2%	10000.0
4	funded_amnt	float32	2.0 MB	1,559	0.3%	0	0%	55,645	11.2%	11.2%	10000.0
5	funded_amnt_inv	float64	4.0 MB	1,571	0.3%	0	0%	53,986	10.9%	10.9%	10000.0
46	recoveries	float64	4.0 MB	1,998	0.4%	0	0%	493,047	99.6%	99.6%	0.0
47	collection_recovery_fee	float32	2.0 MB	2,000	0.4%	0	0%	493,047	99.6%	99.6%	0.0
45	total_rec_late_fee	float32	2.0 MB	2,334	0.5%	0	0%	488,297	98.6%	98.6%	0.0
59	dti_joint	float32	2.0 MB	3,943	0.8%	426,257	86.1%	49	<0.1%	0.1%	21.64
114	total_bc_limit	float32	2.0 MB	4,585	0.9%	0	0%	6,596	1.3%	1.3%	0.0
62	tot_coll_amt	float32	2.0 MB	7,608	1.5%	0	0%	425,302	85.9%	85.9%	0.0
75	total_rev_hi_lim	float32	2.0 MB	8,303	1.7%	0	0%	1,493	0.3%	0.3%	10000.0
25	dti	float32	2.0 MB	9,464	1.9%	1,132	0.2%	819	0.2%	0.2%	0.0
58	annual_inc_joint	float64	4.0 MB	10,952	2.2%	426,257	86.1%	1,216	0.2%	1.8%	100000.0
73	max_bal_bc	float32	2.0 MB	28,148	5.7%	0	0%	15,355	3.1%	3.1%	0.0
14	annual_inc	float64	4.0 MB	30,071	6.1%	0	0%	18,932	3.8%	3.8%	60000.0
8	installment	float32	2.0 MB	35,303	7.1%	0	0%	1,463	0.3%	0.3%	304.72
116	revol_bal_joint	float32	2.0 MB	44,730	9.0%	426,257	86.1%	69	<0.1%	0.1%	0.0
80	avg_cur_bal	float32	2.0 MB	62,125	12.5%	40	<0.1%	540	0.1%	0.1%	0.0
35	revol_bal	float32	2.0 MB	64,413	13.0%	0	0%	4,742	1.0%	1.0%	0.0
81	bc_open_to_buy	float32	2.0 MB	68,060	13.7%	6,588	1.3%	3,297	0.7%	0.7%	0.0
49	last_pymnt_amnt	float64	4.0 MB	86,195	17.4%	0	0%	1,382	0.3%	0.3%	304.72
69	total_bal_il	float32	2.0 MB	115,553	23.3%	0	0%	62,158	12.6%	12.6%	0.0
115	total_il_high_credit_limit	float32	2.0 MB	128,691	26.0%	0	0%	62,160	12.6%	12.6%	0.0
11	emp_title	string	35.4 MB	129,449	26.1%	54,659	11.0%	8,679	1.8%	2.0%	Teacher
43	total_rec_prncp	float64	4.0 MB	129,889	26.2%	0	0%	4,736	1.0%	1.0%	10000.0
39	out_prncp	float64	4.0 MB	136,867	27.6%	0	0%	57,348	11.6%	11.6%	0.0
113	total_bal_ex_mort	float32	2.0 MB	139,900	28.2%	0	0%	839	0.2%	0.2%	0.0
40	out_prncp_inv	float64	4.0 MB	143,128	28.9%	0	0%	57,348	11.6%	11.6%	0.0
44	total_rec_int	float32	2.0 MB	172,941	34.9%	0	0%	732	0.1%	0.1%	0.0
41	total_pymnt	float64	4.0 MB	234,997	47.5%	0	0%	448	0.1%	0.1%	0.0
42	total_pymnt_inv	float64	4.0 MB	236,121	47.7%	0	0%	448	0.1%	0.1%	0.0
63	tot_cur_bal	float32	2.0 MB	254,794	51.4%	0	0%	575	0.1%	0.1%	0.0
112	tot_hi_cred_lim	float32	2.0 MB	266,282	53.8%	0	0%	207	<0.1%	<0.1%	12500.0
19	url	string	60.9 MB	495,242	100.0%	0	0%	1	<0.1%	<0.1%	https://lendingclub.com/browse/loanDetail.action?loan_id=130954621
1	id	string	32.7 MB	495,242	100.0%	0	0%	1	<0.1%	<0.1%	130954621

In the categorical target columns: first grades and sub-grades (e.g., A, B) seem to be more common than the last ones (e.g., F, G). This should be accounted for in the analysis. E.g., by using a stratified split.

Code

grade_counts = accepted_2018["grade"].value_counts().sort_index()
grade_counts.plot.bar(ylabel="counts")
grade_counts

grade
A    135177
B    141365
C    126850
D     69046
E     18958
F      3175
G       671
Name: count, dtype: int64

Code

sub_grade_counts = accepted_2018["sub_grade"].value_counts().sort_index()
sub_grade_counts.plot.bar(ylabel="counts")
pd.concat([sub_grade_counts.head(), pd.Series({"...": "..."}), sub_grade_counts.tail()])

A1     28887
A2     22871
A3     25264
A4     31648
A5     26507
...      ...
G1       487
G2        88
G3        44
G4        25
G5        27
dtype: object

Loan title and purpose contain similar information. Some categories are very rare so should be merged. One of the variables should be dropped.

Code

ct = an.CrossTab("title", "purpose", data=accepted_2018, margins=True)
ct.counts = (
    ct.counts.sort_values(by="All", axis=0, ascending=False)
    .sort_values(by="All", axis=1, ascending=False)
    .drop("All", axis=0)
    .drop("All", axis=1)
)
ct.heatmap(vmax=1, cbar=False, cmap="crest");

4.1.2 Group-Independent Pre-Processing

In this pre-processing phase:

columns with more than 70% of missing values will be dropped;
extract features (months from issue date, national area code zip_area from the first digit of the ZIP code, title length);
add missing value indicators.
some other columns after manual inspection will be dropped
- irrelevant variables or variables after feature extraction;
- date variables;
- categorical variables with too many categories;
- etc.

Code

file_path = dir_interim + "task-2--2-accepted_loans-2018--preprocessed.feather"

# Drop columns based on manual inspection
drop_manually = [
    # Irrelevant
    "url",
    "id",
    "policy_code",
    "loan_status",
    # Almost duplicates of other columns:
    "title",
    # Dates
    "last_pymnt_d",
    "next_pymnt_d",
    "last_credit_pull_d",
    "issue_d",
    "earliest_cr_line",
    # Too many categories
    "zip_code",
    "addr_state",
    "emp_title",
    # Extremely low variability
    "num_tl_120dpd_2m",
    "acc_now_delinq",
]


if os.path.exists(file_path):
    accepted_2018_preproc = pd.read_feather(file_path)
else:
    # Remove columns with more than 70% of missing values
    accepted_2018_preproc = accepted_2018.loc[:, accepted_2018.isnull().mean() <= 0.7]

    # Pre-process the data
    accepted_2018_preproc = accepted_2018_preproc.assign(
        # Extract features
        title_len=lambda df: df["title"].str.len().fillna(0).astype("float32"),
        issue_d=lambda df: pd.to_datetime(df["issue_d"], format="%b-%Y"),
        issue_month=lambda df: df["issue_d"].dt.month.astype("Int8"),
        zip_area=lambda x: x.zip_code.str[0].astype("category"),
        # Encode variables:
        # - Employment length classes as integers
        emp_length=lambda df: df["emp_length"]
        .replace(dict(zip(utils.work_categories, range(len(utils.work_categories)))))
        .astype("Int8"),
        # Add missing value indicators
        emp_length_is_na=lambda df: df["emp_length"].isna().astype("Int8"),
        dti_is_na=lambda df: df["dti"].isna().astype("Int8"),
        mths_since_last_delinq_is_na=(
            lambda df: df["mths_since_last_delinq"].isna().astype("Int8")
        ),
        revol_util_is_na=lambda df: df["revol_util"].isna().astype("Int8"),
        mths_since_rcnt_il_is_na=lambda df: df["mths_since_rcnt_il"]
        .isna()
        .astype("Int8"),
        il_util_is_na=lambda df: df["il_util"].isna().astype("Int8"),
        all_util_is_na=lambda df: df["all_util"].isna().astype("Int8"),
        bc_open_to_buy_is_na=lambda df: df["bc_open_to_buy"].isna().astype("Int8"),
        bc_util_is_na=lambda df: df["bc_util"].isna().astype("Int8"),
        mo_sin_old_il_acct_is_na=lambda df: df["mo_sin_old_il_acct"]
        .isna()
        .astype("Int8"),
        mths_since_recent_bc_is_na=(
            lambda df: df["mths_since_recent_bc"].isna().astype("Int8")
        ),
        mths_since_recent_inq_is_na=(
            lambda df: df["mths_since_recent_inq"].isna().astype("Int8")
        ),
        percent_bc_gt_75_is_na=lambda df: df["percent_bc_gt_75"].isna().astype("Int8"),
    )

    # Remove columns after manual inspection
    accepted_2018_preproc = accepted_2018_preproc.drop(columns=drop_manually)

    # Save as feather file
    accepted_2018_preproc.to_feather(file_path)

del file_path
if "accepted_2018" in locals():
    del accepted_2018

4.1.3 Create Pre-Processing Pipelines

In this section, pipelines for further processing of data will be created. Suffix _lr means that the pipeline (or other object) is for non-tree-based models (logistic regression and Naive Bayes) and _trees means that it is for tree-based models (LGBM):

_lr pipeline contains one extra step: numeric data scaling.

Code

# Numeric variables except missing value indicators
select_numeric = make_column_selector(dtype_include="number", pattern=".*(?!_is_na$)")
# Categorical variables
select_categorical = make_column_selector(dtype_include="category")

# Create the pipelines
numeric_transformer_lr = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
numeric_transformer_trees = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))]
)

# Merge pipelines of numeric and categorical variables
pre_processing_lr = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_lr, select_numeric),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)
pre_processing_trees = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_trees, select_numeric),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)

Code

pre_processing_lr

ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001878499E590>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000187803F2450>)],
                  verbose_feature_names_out=False)

Code

pre_processing_trees

ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001878499E590>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000187803F2450>)],
                  verbose_feature_names_out=False)

4.1.4 Split to Train, Validation, and Test Sets

Split data into training, validation, and test sets (70%:15%:15%) stratified by loan sub-grade to better reflect this type of data diversity in each set.

Code

file_path = dir_interim + "task-2--2-accepted_loans-2018--preprocessed.feather"

if os.path.exists(file_path):
    accepted_2018_preproc = pd.read_feather(file_path)

del file_path

Code

# Train, validation, test split (stratified by sub-grade): 70%:15%:15%
accepted_train, accepted_validation = train_test_split(
    accepted_2018_preproc,
    test_size=0.3,
    random_state=42,
    stratify=accepted_2018_preproc.sub_grade,
)

accepted_validation, accepted_test = train_test_split(
    accepted_validation,
    test_size=0.5,
    random_state=42,
    stratify=accepted_validation.sub_grade,
)

The sizes of these sets are as follows:

Code

print(f"{accepted_train.shape[0]/1e3:.1f}k rows in training set.")
print(f"{accepted_validation.shape[0]/1e3: .1f}k rows in validation set.")
print(f"{accepted_test.shape[0]/1e3: .1f}k rows in test set.")

346.7k rows in training set.
 74.3k rows in validation set.
 74.3k rows in test set.

4.1.5 EDA on Training Set

EDA in this section will cover 2 parts: EDA of target variables and EDA of other variables.

No essential discrepancies were found. Some variables (e.g., pymnt_plan) have rare values (are almost constant), but some target variables also have rare values (e.g., G5 grade). So these variables will be kept for now and models will “decide” if they are important enough.

Variables such as emp_length, verification_status, or fico_range_low show a potential to be good predictors.

Please, find EDA details below.

Target variables

Correspondence between loan grade and sub-grade (blue cells indicate that the sub-grade is included in the grade) show no discrepancies:

Code

ct = an.CrossTab("grade", "sub_grade", data=accepted_train)
ct.heatmap(vmax=1, cbar=False, cmap="crest", annot=False)
plt.xlabel("Sub-grade")
plt.ylabel("Grade")
del ct

Code

counts_g = (
    accepted_train["grade"]
    .value_counts()
    .sort_index()
    .to_frame()
    .reset_index()
    .assign(
        percent=lambda df: my.format_percent(df["count"] / df["count"].sum() * 100),
    )
)

my.plot_counts_with_labels(counts_g, x="grade", y="count", label="percent", rot=0);

Code

subgrade_count_series = accepted_train["sub_grade"].value_counts().sort_index()
counts_s = (
    subgrade_count_series.to_frame()
    .reset_index()
    .assign(
        percent=lambda df: df["count"] / df["count"].sum() * 100,
        label=lambda df: my.format_percent(df["percent"]),
    )
)

my.plot_counts_with_labels(
    counts_s,
    x="sub_grade",
    y="count",
    label="label",
    label_rotation=90,
    y_lim_max=27_000,
);

Code

print("Counts of sub-grades for F and G grades:")
pd.concat([pd.Series({"Group": "Count"}), subgrade_count_series.tail(10)])

Counts of sub-grades for F and G grades:

Group    Count
F1        1063
F2         403
F3         305
F4         220
F5         231
G1         341
G2          62
G3          31
G4          17
G5          19
dtype: object

In most cases, grades/sub-grades are in alignment with interest rates: lower grades have higher interest rates and vice versa. But there are some exceptions: that not some loans have a 6% interest rate disregarding the grade/sub-grade (this is the case for all grades except grade G).

Please, find the details below.

Code

sns.boxplot(x="grade", y="int_rate", data=accepted_train)
plt.xlabel("Grade")
plt.ylabel("Interest rate");

Code

sns.boxplot(x="sub_grade", y="int_rate", data=accepted_train)
plt.xlabel("Sub-grade")
plt.ylabel("Interest rate");

Let’s look into the outlying visible in the plot above:

Code

accepted_train.query("sub_grade == 'B4'")["int_rate"].value_counts().sort_index()

int_rate
6.00        3
10.90    4935
10.91    2102
11.05    3300
11.55    8370
11.80    3499
Name: count, dtype: int64

Code

accepted_train.query("sub_grade == 'D5'")["int_rate"].value_counts().sort_index()

int_rate
6.00        1
21.45    1093
21.85    3018
22.35    3474
Name: count, dtype: int64

It seems that those values are 6%. Let’s look deeper into it:

Code

ct = an.CrossTab(
    "grade",
    "int_rate_is_6",
    data=accepted_train.assign(int_rate_is_6=lambda df: (df["int_rate"] == 6)),
)
ct.heatmap(vmax=1, cbar=False, cmap="crest")
plt.xlabel("Interest rate is 6%")
plt.ylabel("Grade")
del ct

General EDA

Code

an.col_info(accepted_train, style=True)

	column	data_type	memory_size	n_unique	p_unique	n_missing	p_missing	n_dominant	p_dominant	p_dom_excl_na	dominant
1	loan_amnt	float32	1.4 MB	1,555	0.4%	0	0%	38,907	11.2%	11.2%	10000.0
2	funded_amnt	float32	1.4 MB	1,555	0.4%	0	0%	38,907	11.2%	11.2%	10000.0
3	funded_amnt_inv	float64	2.8 MB	1,569	0.5%	0	0%	37,718	10.9%	10.9%	10000.0
4	term	category	346.9 kB	2	<0.1%	0	0%	241,239	69.6%	69.6%	36 months
5	int_rate	float32	1.4 MB	110	<0.1%	0	0%	9,643	2.8%	2.8%	6.11
6	installment	float32	1.4 MB	29,669	8.6%	0	0%	1,049	0.3%	0.3%	304.72
7	grade	category	347.4 kB	7	<0.1%	0	0%	98,955	28.5%	28.5%	B
8	sub_grade	category	349.8 kB	35	<0.1%	0	0%	22,209	6.4%	6.4%	B4
9	emp_length	Int8	693.3 kB	11	<0.1%	29,368	8.5%	112,097	32.3%	35.3%	10
10	home_ownership	category	347.1 kB	4	<0.1%	0	0%	167,124	48.2%	48.2%	MORTGAGE
11	annual_inc	float64	2.8 MB	22,950	6.6%	0	0%	13,265	3.8%	3.8%	60000.0
12	verification_status	category	347.0 kB	3	<0.1%	0	0%	140,120	40.4%	40.4%	Not Verified
13	pymnt_plan	category	346.9 kB	2	<0.1%	0	0%	346,572	>99.9%	>99.9%	n
14	purpose	category	348.1 kB	13	<0.1%	0	0%	181,886	52.5%	52.5%	debt_consolidation
15	dti	float32	1.4 MB	8,528	2.5%	795	0.2%	595	0.2%	0.2%	0.0
16	delinq_2yrs	float32	1.4 MB	24	<0.1%	0	0%	296,006	85.4%	85.4%	0.0
17	fico_range_low	float32	1.4 MB	38	<0.1%	0	0%	22,587	6.5%	6.5%	680.0
18	fico_range_high	float32	1.4 MB	38	<0.1%	0	0%	22,587	6.5%	6.5%	684.0
19	inq_last_6mths	float32	1.4 MB	6	<0.1%	0	0%	232,437	67.0%	67.0%	0.0
20	mths_since_last_delinq	float32	1.4 MB	151	<0.1%	193,704	55.9%	2,758	0.8%	1.8%	12.0
21	open_acc	float32	1.4 MB	72	<0.1%	0	0%	28,519	8.2%	8.2%	9.0
22	pub_rec	float32	1.4 MB	15	<0.1%	0	0%	302,407	87.2%	87.2%	0.0
23	revol_bal	float32	1.4 MB	57,488	16.6%	0	0%	3,336	1.0%	1.0%	0.0
24	revol_util	float32	1.4 MB	1,116	0.3%	429	0.1%	3,448	1.0%	1.0%	0.0
25	total_acc	float32	1.4 MB	128	<0.1%	0	0%	13,194	3.8%	3.8%	16.0
26	initial_list_status	category	346.9 kB	2	<0.1%	0	0%	298,847	86.2%	86.2%	w
27	out_prncp	float64	2.8 MB	104,998	30.3%	0	0%	40,203	11.6%	11.6%	0.0
28	out_prncp_inv	float64	2.8 MB	109,695	31.6%	0	0%	40,203	11.6%	11.6%	0.0
29	total_pymnt	float64	2.8 MB	178,207	51.4%	0	0%	295	0.1%	0.1%	0.0
30	total_pymnt_inv	float64	2.8 MB	179,921	51.9%	0	0%	295	0.1%	0.1%	0.0
31	total_rec_prncp	float64	2.8 MB	101,938	29.4%	0	0%	3,321	1.0%	1.0%	10000.0
32	total_rec_int	float32	1.4 MB	141,484	40.8%	0	0%	493	0.1%	0.1%	0.0
33	total_rec_late_fee	float32	1.4 MB	1,901	0.5%	0	0%	341,735	98.6%	98.6%	0.0
34	recoveries	float64	2.8 MB	1,405	0.4%	0	0%	345,148	99.6%	99.6%	0.0
35	collection_recovery_fee	float32	1.4 MB	1,406	0.4%	0	0%	345,148	99.6%	99.6%	0.0
36	last_pymnt_amnt	float64	2.8 MB	65,820	19.0%	0	0%	994	0.3%	0.3%	304.72
37	last_fico_range_high	float32	1.4 MB	72	<0.1%	0	0%	15,672	4.5%	4.5%	699.0
38	last_fico_range_low	float32	1.4 MB	71	<0.1%	0	0%	15,672	4.5%	4.5%	695.0
39	collections_12_mths_ex_med	float32	1.4 MB	8	<0.1%	0	0%	341,024	98.4%	98.4%	0.0
40	application_type	category	346.9 kB	2	<0.1%	0	0%	298,502	86.1%	86.1%	Individual
41	tot_coll_amt	float32	1.4 MB	6,504	1.9%	0	0%	297,695	85.9%	85.9%	0.0
42	tot_cur_bal	float32	1.4 MB	204,761	59.1%	0	0%	428	0.1%	0.1%	0.0
43	open_acc_6m	float32	1.4 MB	15	<0.1%	0	0%	161,945	46.7%	46.7%	0.0
44	open_act_il	float32	1.4 MB	47	<0.1%	0	0%	88,854	25.6%	25.6%	1.0
45	open_il_12m	float32	1.4 MB	7	<0.1%	0	0%	190,997	55.1%	55.1%	0.0
46	open_il_24m	float32	1.4 MB	21	<0.1%	0	0%	109,259	31.5%	31.5%	1.0
47	mths_since_rcnt_il	float32	1.4 MB	334	0.1%	12,828	3.7%	15,250	4.4%	4.6%	7.0
48	total_bal_il	float32	1.4 MB	101,091	29.2%	0	0%	43,612	12.6%	12.6%	0.0
49	il_util	float32	1.4 MB	223	0.1%	56,717	16.4%	5,435	1.6%	1.9%	78.0
50	open_rv_12m	float32	1.4 MB	23	<0.1%	0	0%	136,493	39.4%	39.4%	0.0
51	open_rv_24m	float32	1.4 MB	39	<0.1%	0	0%	77,516	22.4%	22.4%	1.0
52	max_bal_bc	float32	1.4 MB	26,163	7.5%	0	0%	10,764	3.1%	3.1%	0.0
53	all_util	float32	1.4 MB	159	<0.1%	99	<0.1%	6,566	1.9%	1.9%	60.0
54	total_rev_hi_lim	float32	1.4 MB	6,799	2.0%	0	0%	1,063	0.3%	0.3%	17000.0
55	inq_fi	float32	1.4 MB	29	<0.1%	0	0%	159,036	45.9%	45.9%	0.0
56	total_cu_tl	float32	1.4 MB	50	<0.1%	0	0%	185,685	53.6%	53.6%	0.0
57	inq_last_12m	float32	1.4 MB	42	<0.1%	0	0%	107,324	31.0%	31.0%	0.0
58	acc_open_past_24mths	float32	1.4 MB	44	<0.1%	0	0%	52,169	15.0%	15.0%	3.0
59	avg_cur_bal	float32	1.4 MB	55,981	16.1%	31	<0.1%	400	0.1%	0.1%	0.0
60	bc_open_to_buy	float32	1.4 MB	60,535	17.5%	4,642	1.3%	2,278	0.7%	0.7%	0.0
61	bc_util	float32	1.4 MB	1,178	0.3%	4,794	1.4%	6,709	1.9%	2.0%	0.0
62	chargeoff_within_12_mths	float32	1.4 MB	8	<0.1%	0	0%	344,527	99.4%	99.4%	0.0
63	delinq_amnt	float32	1.4 MB	109	<0.1%	0	0%	346,561	>99.9%	>99.9%	0.0
64	mo_sin_old_il_acct	float32	1.4 MB	489	0.1%	12,828	3.7%	3,683	1.1%	1.1%	130.0
65	mo_sin_old_rev_tl_op	float32	1.4 MB	715	0.2%	0	0%	2,203	0.6%	0.6%	136.0
66	mo_sin_rcnt_rev_tl_op	float32	1.4 MB	267	0.1%	0	0%	24,897	7.2%	7.2%	2.0
67	mo_sin_rcnt_tl	float32	1.4 MB	190	0.1%	0	0%	35,838	10.3%	10.3%	2.0
68	mort_acc	float32	1.4 MB	29	<0.1%	0	0%	159,064	45.9%	45.9%	0.0
69	mths_since_recent_bc	float32	1.4 MB	445	0.1%	4,380	1.3%	15,561	4.5%	4.5%	3.0
70	mths_since_recent_inq	float32	1.4 MB	26	<0.1%	42,734	12.3%	29,234	8.4%	9.6%	1.0
71	num_accts_ever_120_pd	float32	1.4 MB	34	<0.1%	0	0%	273,859	79.0%	79.0%	0.0
72	num_actv_bc_tl	float32	1.4 MB	33	<0.1%	0	0%	70,996	20.5%	20.5%	2.0
73	num_actv_rev_tl	float32	1.4 MB	45	<0.1%	0	0%	51,745	14.9%	14.9%	4.0
74	num_bc_sats	float32	1.4 MB	49	<0.1%	0	0%	57,711	16.6%	16.6%	3.0
75	num_bc_tl	float32	1.4 MB	63	<0.1%	0	0%	39,957	11.5%	11.5%	4.0
76	num_il_tl	float32	1.4 MB	98	<0.1%	0	0%	30,943	8.9%	8.9%	3.0
77	num_op_rev_tl	float32	1.4 MB	69	<0.1%	0	0%	36,283	10.5%	10.5%	5.0
78	num_rev_accts	float32	1.4 MB	99	<0.1%	0	0%	22,799	6.6%	6.6%	8.0
79	num_rev_tl_bal_gt_0	float32	1.4 MB	41	<0.1%	0	0%	52,026	15.0%	15.0%	4.0
80	num_sats	float32	1.4 MB	72	<0.1%	0	0%	28,551	8.2%	8.2%	9.0
81	num_tl_30dpd	float32	1.4 MB	2	<0.1%	0	0%	346,653	>99.9%	>99.9%	0.0
82	num_tl_90g_dpd_24m	float32	1.4 MB	24	<0.1%	0	0%	332,722	96.0%	96.0%	0.0
83	num_tl_op_past_12m	float32	1.4 MB	26	<0.1%	0	0%	89,094	25.7%	25.7%	1.0
84	pct_tl_nvr_dlq	float32	1.4 MB	561	0.2%	2	<0.1%	193,695	55.9%	55.9%	100.0
85	percent_bc_gt_75	float32	1.4 MB	193	0.1%	4,648	1.3%	132,349	38.2%	38.7%	0.0
86	pub_rec_bankruptcies	float32	1.4 MB	8	<0.1%	0	0%	304,300	87.8%	87.8%	0.0
87	tax_liens	float32	1.4 MB	15	<0.1%	0	0%	344,309	99.3%	99.3%	0.0
88	tot_hi_cred_lim	float32	1.4 MB	209,235	60.4%	0	0%	150	<0.1%	<0.1%	17000.0
89	total_bal_ex_mort	float32	1.4 MB	123,038	35.5%	0	0%	611	0.2%	0.2%	0.0
90	total_bc_limit	float32	1.4 MB	3,914	1.1%	0	0%	4,648	1.3%	1.3%	0.0
91	total_il_high_credit_limit	float32	1.4 MB	111,710	32.2%	0	0%	43,614	12.6%	12.6%	0.0
92	hardship_flag	category	346.9 kB	2	<0.1%	0	0%	346,544	>99.9%	>99.9%	N
93	disbursement_method	category	346.9 kB	2	<0.1%	0	0%	296,679	85.6%	85.6%	Cash
94	debt_settlement_flag	category	346.9 kB	2	<0.1%	0	0%	346,353	99.9%	99.9%	N
95	title_len	float32	1.4 MB	10	<0.1%	0	0%	181,886	52.5%	52.5%	18.0
96	issue_month	Int8	693.3 kB	12	<0.1%	0	0%	32,520	9.4%	9.4%	10
97	zip_area	category	347.5 kB	10	<0.1%	0	0%	60,894	17.6%	17.6%	9
98	emp_length_is_na	Int8	693.3 kB	2	<0.1%	0	0%	317,301	91.5%	91.5%	0
99	dti_is_na	Int8	693.3 kB	2	<0.1%	0	0%	345,874	99.8%	99.8%	0
100	mths_since_last_delinq_is_na	Int8	693.3 kB	2	<0.1%	0	0%	193,704	55.9%	55.9%	1
101	revol_util_is_na	Int8	693.3 kB	2	<0.1%	0	0%	346,240	99.9%	99.9%	0
102	mths_since_rcnt_il_is_na	Int8	693.3 kB	2	<0.1%	0	0%	333,841	96.3%	96.3%	0
103	il_util_is_na	Int8	693.3 kB	2	<0.1%	0	0%	289,952	83.6%	83.6%	0
104	all_util_is_na	Int8	693.3 kB	2	<0.1%	0	0%	346,570	>99.9%	>99.9%	0
105	bc_open_to_buy_is_na	Int8	693.3 kB	2	<0.1%	0	0%	342,027	98.7%	98.7%	0
106	bc_util_is_na	Int8	693.3 kB	2	<0.1%	0	0%	341,875	98.6%	98.6%	0
107	mo_sin_old_il_acct_is_na	Int8	693.3 kB	2	<0.1%	0	0%	333,841	96.3%	96.3%	0
108	mths_since_recent_bc_is_na	Int8	693.3 kB	2	<0.1%	0	0%	342,289	98.7%	98.7%	0
109	mths_since_recent_inq_is_na	Int8	693.3 kB	2	<0.1%	0	0%	303,935	87.7%	87.7%	0
110	percent_bc_gt_75_is_na	Int8	693.3 kB	2	<0.1%	0	0%	342,021	98.7%	98.7%	0

Code

if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_accepted_2018 = sweetviz.analyze(
            [accepted_train, "Accepted loans (train)"],
            target_feat="int_rate",
            pairwise_analysis="off",
        )
        report_accepted_2018.show_notebook()

4.1.6 Separate Target Variables

As there are 3 sub-tasks (predict grade, sub-grade, and interest rate), 3 separate target variables for each set of predictors will be created.

Code

X_train = accepted_train.drop(columns=["grade", "sub_grade", "int_rate"])
y_train_grades = accepted_train["grade"]
y_train_subgrades = accepted_train["sub_grade"]
y_train_int_rate = accepted_train["int_rate"]

X_validation = accepted_validation.drop(columns=["grade", "sub_grade", "int_rate"])
y_validation_grades = accepted_validation["grade"]
y_validation_subgrades = accepted_validation["sub_grade"]
y_validation_int_rate = accepted_validation["int_rate"]

X_test = accepted_test.drop(columns=["grade", "sub_grade", "int_rate"])
y_test_grades = accepted_test["grade"]
y_test_subgrades = accepted_test["sub_grade"]
y_test_int_rate = accepted_test["int_rate"]

4.2 Loan Grade Prediction

This section focuses on predicting loan grades.

4.2.1 Train Models

Here 5 models will be trained:

logistic regression (LR);
Naive Bayes (NB);
LGBM;
Combination of all 3 models (LR, NB, and LGBM) with soft voting;
Combination of all 3 models (LR, NB, and LGBM) with hard voting.

Code

# Dictionary to collect the results
models_grades = {}

Code

@my.cache_results(dir_interim + "models_2_01_naive_bayes.pickle")
def fit_nb_grades():
    """Fit a Naive Bayes model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", GaussianNB()),
        ]
    )
    pipeline.fit(X_train, y_train_grades)

    return pipeline


models_grades["Naive Bayes"] = fit_nb_grades()

Code

@my.cache_results(dir_interim + "models_2_02_logistic_regression_sgd.pickle")
def fit_lr_sgd_grades():
    """Fit a Logistic Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "classifier",
                SGDClassifier(
                    random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_grades)

    return pipeline


models_grades["Logistic Regression"] = fit_lr_sgd_grades()

Code

@my.cache_results(dir_interim + "models_2_03_lgbm--test.pickle")
def fit_lgbm_grades():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


# Time: 1m 1.9s
models_grades["LGBM"] = fit_lgbm_grades()

Code

@my.cache_results(dir_interim + "models_2_04_voting_soft.pickle")
def fit_voting_soft_grades():
    """Fit a voting classifier with soft voting."""

    classifiers = [
        ("GaussianNB", GaussianNB()),
        (
            "SGDClassifier",
            SGDClassifier(
                random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
            ),
        ),
        (
            "LGBMClassifier",
            LGBMClassifier(
                random_state=1,
                class_weight="balanced",
                objective="multiclass",
                n_jobs=-1,
                device="gpu",
                verbosity=1,
            ),
        ),
    ]

    voting_classifier = VotingClassifier(classifiers, voting="soft", n_jobs=-1)

    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", voting_classifier),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


# Time: 1m 7.9s
models_grades["Voting (soft)"] = fit_voting_soft_grades()

Code

@my.cache_results(dir_interim + "models_2_05_voting_hard.pickle")
def fit_voting_hard_grades():
    """Fit a voting classifier with hard voting."""

    classifiers = [
        ("GaussianNB", GaussianNB()),
        (
            "SGDClassifier",
            SGDClassifier(
                random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
            ),
        ),
        (
            "LGBMClassifier",
            LGBMClassifier(
                random_state=1,
                class_weight="balanced",
                objective="multiclass",
                n_jobs=-1,
                device="gpu",
                verbosity=1,
            ),
        ),
    ]

    voting_classifier = VotingClassifier(classifiers, voting="hard", n_jobs=-1)

    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", voting_classifier),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


# Time: 1m 7.6s
models_grades["Voting (hard)"] = fit_voting_hard_grades()

4.2.2 Evaluation

Next, the models will be evaluated. Macro-averaged ROC AUC (and F1 scores where ROC AUC is unavailable) will be used as the main performance metric. The results will be compared to the baseline model that always predicts the most frequent class. Two performance metrics are needed as for hard-voting ROC AUC cannot be calculated (it requires probabilities).

Results: the best-performing model was LGBM with ROC AUC of 0.9950 and F1 of 0.83 in the validation set.

Please, find the details below.

# Report that includes ROC AUC
def print_classification_report_for_grades(models, model_name, include_roc_auc=True):
    """Make a classification report for a given model.

    Creates classification reports on training and validation sets.
    Args:
        models (dict|scikit learn model): Either a dictionary with models or a model.
        model_name (str): The name of the model to be evaluated.
        include_roc_auc (bool): Whether to include ROC AUC in the report.
    """
    ml.print_classification_report(
        models,
        model_name,
        X_train,
        X_validation,
        y_train_grades,
        y_validation_grades,
        label_train_set="Train",
        label_test_set="Validation",
        label_model=model_name,
        include_roc_auc=include_roc_auc,
    )

Code

dummy_grades = DummyClassifier(strategy="most_frequent")
dummy_grades.fit(X_train, y_train_grades)

print_classification_report_for_grades(dummy_grades, "baseline")

--- Train (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.00      0.00      0.00     94625
           B       0.29      1.00      0.44     98955
           C       0.00      0.00      0.00     88794
           D       0.00      0.00      0.00     48333
           E       0.00      0.00      0.00     13270
           F       0.00      0.00      0.00      2222
           G       0.00      0.00      0.00       470

    accuracy                           0.29    346669
   macro avg       0.04      0.14      0.06    346669
weighted avg       0.08      0.29      0.13    346669


--- Validation (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.00      0.00      0.00     20275
           B       0.29      1.00      0.44     21203
           C       0.00      0.00      0.00     19027
           D       0.00      0.00      0.00     10355
           E       0.00      0.00      0.00      2846
           F       0.00      0.00      0.00       479
           G       0.00      0.00      0.00       101

    accuracy                           0.29     74286
   macro avg       0.04      0.14      0.06     74286
weighted avg       0.08      0.29      0.13     74286

Evaluation of Naive Bayes model:

Code

print_classification_report_for_grades(models_grades, "Naive Bayes")

--- Train (Naive Bayes) ---
ROC AUC: 0.7387 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.58      0.71      0.64     94625
           B       0.32      0.57      0.41     98955
           C       0.35      0.02      0.03     88794
           D       0.38      0.09      0.14     48333
           E       0.14      0.06      0.08     13270
           F       0.04      0.31      0.07      2222
           G       0.01      0.29      0.02       470

    accuracy                           0.38    346669
   macro avg       0.26      0.29      0.20    346669
weighted avg       0.40      0.38      0.32    346669


--- Validation (Naive Bayes) ---
ROC AUC: 0.7440 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.58      0.71      0.64     20275
           B       0.32      0.57      0.41     21203
           C       0.36      0.02      0.04     19027
           D       0.39      0.09      0.15     10355
           E       0.14      0.06      0.09      2846
           F       0.04      0.32      0.08       479
           G       0.01      0.24      0.01       101

    accuracy                           0.38     74286
   macro avg       0.26      0.29      0.20     74286
weighted avg       0.40      0.38      0.32     74286

Evaluation of Logistic Regression model:

Code

print_classification_report_for_grades(models_grades, "Logistic Regression")

--- Train (Logistic Regression) ---
ROC AUC: 0.9062 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.76      0.89      0.82     94625
           B       0.61      0.50      0.55     98955
           C       0.56      0.49      0.52     88794
           D       0.53      0.51      0.52     48333
           E       0.33      0.57      0.42     13270
           F       0.16      0.14      0.15      2222
           G       0.07      0.72      0.12       470

    accuracy                           0.61    346669
   macro avg       0.43      0.55      0.44    346669
weighted avg       0.61      0.61      0.61    346669


--- Validation (Logistic Regression) ---
ROC AUC: 0.9036 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.76      0.89      0.82     20275
           B       0.60      0.50      0.55     21203
           C       0.56      0.48      0.52     19027
           D       0.53      0.51      0.52     10355
           E       0.33      0.56      0.42      2846
           F       0.17      0.15      0.16       479
           G       0.07      0.76      0.13       101

    accuracy                           0.60     74286
   macro avg       0.43      0.55      0.44     74286
weighted avg       0.61      0.60      0.60     74286

Evaluation of LGBM model:

Code

print_classification_report_for_grades(models_grades, "LGBM")

--- Train (LGBM) ---
ROC AUC: 0.9963 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     94625
           B       0.93      0.92      0.92     98955
           C       0.94      0.90      0.92     88794
           D       0.93      0.92      0.92     48333
           E       0.85      0.96      0.90     13270
           F       0.91      1.00      0.95      2222
           G       0.80      1.00      0.89       470

    accuracy                           0.93    346669
   macro avg       0.90      0.95      0.92    346669
weighted avg       0.93      0.93      0.93    346669


--- Validation (LGBM) ---
ROC AUC: 0.9950 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     20275
           B       0.92      0.92      0.92     21203
           C       0.93      0.90      0.91     19027
           D       0.92      0.91      0.92     10355
           E       0.84      0.93      0.88      2846
           F       0.74      0.86      0.80       479
           G       0.38      0.45      0.41       101

    accuracy                           0.93     74286
   macro avg       0.81      0.85      0.83     74286
weighted avg       0.93      0.93      0.93     74286

Evaluate the soft voting ensemble model:

Code

print_classification_report_for_grades(models_grades, "Voting (soft)")

--- Train (Voting (soft)) ---
ROC AUC: 0.9858 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.76      0.95      0.84     94625
           B       0.98      0.65      0.78     98955
           C       0.98      0.63      0.77     88794
           D       0.90      0.65      0.75     48333
           E       0.78      0.64      0.70     13270
           F       0.54      0.22      0.31      2222
           G       0.01      0.99      0.02       470

    accuracy                           0.72    346669
   macro avg       0.70      0.67      0.60    346669
weighted avg       0.89      0.72      0.78    346669


--- Validation (Voting (soft)) ---
ROC AUC: 0.9823 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.75      0.94      0.84     20275
           B       0.98      0.64      0.77     21203
           C       0.98      0.63      0.76     19027
           D       0.90      0.64      0.75     10355
           E       0.76      0.61      0.68      2846
           F       0.47      0.16      0.24       479
           G       0.01      0.97      0.02       101

    accuracy                           0.71     74286
   macro avg       0.69      0.66      0.58     74286
weighted avg       0.89      0.71      0.78     74286

Evaluate the hard voting ensemble model (no AUC is calculated):

Code

# NOTE: for hard voting `.predict_proba` is not available.
print_classification_report_for_grades(
    models_grades, "Voting (hard)", include_roc_auc=False
)

--- Train (Voting (hard)) ---
              precision    recall  f1-score   support

           A       0.67      0.99      0.80     94625
           B       0.75      0.67      0.71     98955
           C       0.83      0.64      0.73     88794
           D       0.86      0.59      0.70     48333
           E       0.75      0.62      0.68     13270
           F       0.89      0.15      0.26      2222
           G       0.09      0.99      0.16       470

    accuracy                           0.73    346669
   macro avg       0.69      0.66      0.58    346669
weighted avg       0.76      0.73      0.73    346669


--- Validation (Voting (hard)) ---
              precision    recall  f1-score   support

           A       0.67      0.99      0.80     20275
           B       0.74      0.66      0.70     21203
           C       0.82      0.64      0.72     19027
           D       0.85      0.59      0.70     10355
           E       0.75      0.60      0.67      2846
           F       0.84      0.15      0.25       479
           G       0.08      0.83      0.14       101

    accuracy                           0.73     74286
   macro avg       0.68      0.64      0.57     74286
weighted avg       0.76      0.73      0.73     74286

The best-performing model is LGBM. It will be explored in more detail.

Code

y_pred_validation_lgbm_grades = models_grades["LGBM"].predict(X_validation)

Code

sns.set_style("white")
ml.plot_confusion_matrices(
    y_validation_grades,
    y_pred_validation_lgbm_grades,
    figsize=(15, 4),
    text_kw={"size": 9},
);

It seems that the highest confusion is between F and G grades.

4.2.3 Feature Importance (General)

Internal LGBM method as well as SHAP values were used to evaluate feature importance. Variables total_rec_int, total_rec_prncp and installment were identified as the most important ones by both methods just in different order.

Please, find the details below.

Code

@my.cache_results(dir_interim + "task-2-grades--shap_lgbm_k=all.pkl")
def get_shap_values_grades_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_grades[model].steps[:-1])
    classifier = models_grades[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)

    return shap_values, X_validation_preproc


shap_values_lgbm_grades, data_for_lgbm_grades = get_shap_values_grades_lgbm()
# Time: 3m 17.8s

Code

sns.set_style("white")
lgb.plot_importance(
    models_grades["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance (Grades Prediction)",
);

Code

vals = np.abs(shap_values_lgbm_grades).mean(0).mean(0)
feature_importance = (
    pd.DataFrame(
        list(zip(data_for_lgbm_grades.columns, vals)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

feature_importance.query("importance > 0.01").head().style.format(precision=4)

	index	col_name	importance
0	22	total_rec_int	0.7640
1	21	total_rec_prncp	0.4286
2	3	installment	0.3728
3	2	funded_amnt_inv	0.1921
4	120	initial_list_status_f	0.1906

Code

shap.summary_plot(
    shap_values_lgbm_grades, data_for_lgbm_grades, plot_type="bar", max_display=150
)

4.2.4 Feature Importance (by Grade)

Analyzing feature importance by grade, it seems that for higher grade prediction, total interest received (total_rec_int) is the most important feature, while for lower grade prediction, the most important feature monthly installment size (installment).

Find the details below.

def shap_summary_plot__grades(gr_index):
    """Convenience function to plot SHAP summary plot for a given grade.

    Args:
        gr_index (int): The index of the grade to plot.
    """
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        shap.summary_plot(
            shap_values_lgbm_grades[gr_index],
            data_for_lgbm_grades,
            plot_type="dot",
            max_display=20,
            plot_size=(10, 6),
        )

Feature importances for Grade A:

Code

shap_summary_plot__grades(0)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade B:

Code

shap_summary_plot__grades(1)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade C:

Code

shap_summary_plot__grades(2)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade D:

Code

shap_summary_plot__grades(3)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade E:

Code

shap_summary_plot__grades(4)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade F:

Code

shap_summary_plot__grades(5)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade G:

Code

shap_summary_plot__grades(6)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

4.2.5 Re-Train Models on the Most Important Features

The LGBM model will be trained on a smaller number of features (41, 30, and 13). This number of features was selected using 0.010, 0.025, and 0.090 cut-off thresholds of SHAP feature importance. The model with 13 features showed the best overall performance (see Table 4.1) and improvement of classification accuracy for most of the grades except grade F (see Table 4.2).

Table 4.1. Overall classification performance (validation set) of the LGBM models with all 140 (all) and 13 features.

Number of features	ROC AUC (validation)
140 (all)	0.9950
45	0.9950
30	0.9951
13	0.9967

Table 4.2. Classification performance by grade (validation set) of the LGBM models with all 140 (all) and 13 features.

Grade	F1 (all features)	F1 (13 features)
A	0.97	0.98
B	0.92	0.94
C	0.91	0.93
D	0.92	0.94
E	0.88	0.90
F	0.80	0.78
G	0.41	0.49

Please, find more details below.

def fit_lgbm_grades_with_selection(features):
    """Template to fit a LGBM model with a smalled number of features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(features)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


@my.cache_results(dir_interim + "models_2_03_lgbm-45-features.pickle")
def fit_lgbm_grades_45():
    features = feature_importance.query("importance > 0.010").col_name.to_list()
    return fit_lgbm_grades_with_selection(features)


@my.cache_results(dir_interim + "models_2_03_lgbm-30-features.pickle")
def fit_lgbm_grades_30():
    features = feature_importance.query("importance > 0.025").col_name.to_list()
    return fit_lgbm_grades_with_selection(features)


@my.cache_results(dir_interim + "models_2_03_lgbm-13-features.pickle")
def fit_lgbm_grades_13():
    features = feature_importance.query("importance > 0.090").col_name.to_list()
    return fit_lgbm_grades_with_selection(features)

Code

models_grades["LGBM (45 features)"] = fit_lgbm_grades_45()

Code

models_grades["LGBM (30 features)"] = fit_lgbm_grades_30()

Code

models_grades["LGBM (13 features)"] = fit_lgbm_grades_13()

Code

print_classification_report_for_grades(models_grades, "LGBM (45 features)")

--- Train (LGBM (45 features)) ---
ROC AUC: 0.9964 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     94625
           B       0.93      0.92      0.92     98955
           C       0.94      0.91      0.92     88794
           D       0.93      0.92      0.92     48333
           E       0.85      0.96      0.90     13270
           F       0.91      1.00      0.95      2222
           G       0.77      1.00      0.87       470

    accuracy                           0.94    346669
   macro avg       0.90      0.96      0.92    346669
weighted avg       0.94      0.94      0.94    346669

--- Validation (LGBM (45 features)) ---
ROC AUC: 0.9950 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     20275
           B       0.92      0.92      0.92     21203
           C       0.93      0.90      0.91     19027
           D       0.92      0.92      0.92     10355
           E       0.84      0.94      0.89      2846
           F       0.77      0.85      0.81       479
           G       0.39      0.49      0.43       101

    accuracy                           0.93     74286
   macro avg       0.82      0.86      0.84     74286
weighted avg       0.93      0.93      0.93     74286

Code

print_classification_report_for_grades(models_grades, "LGBM (30 features)")

--- Train (LGBM (30 features)) ---
ROC AUC: 0.9965 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     94625
           B       0.93      0.93      0.93     98955
           C       0.94      0.91      0.92     88794
           D       0.93      0.92      0.93     48333
           E       0.85      0.96      0.90     13270
           F       0.87      0.99      0.93      2222
           G       0.72      1.00      0.84       470

    accuracy                           0.94    346669
   macro avg       0.89      0.95      0.92    346669
weighted avg       0.94      0.94      0.94    346669

--- Validation (LGBM (30 features)) ---
ROC AUC: 0.9951 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     20275
           B       0.92      0.92      0.92     21203
           C       0.93      0.90      0.92     19027
           D       0.93      0.92      0.92     10355
           E       0.84      0.93      0.88      2846
           F       0.74      0.85      0.79       479
           G       0.35      0.52      0.42       101

    accuracy                           0.93     74286
   macro avg       0.81      0.86      0.83     74286
weighted avg       0.93      0.93      0.93     74286

Code

print_classification_report_for_grades(models_grades, "LGBM (13 features)")

--- Train (LGBM (13 features)) ---
ROC AUC: 0.9976 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98     94625
           B       0.95      0.94      0.94     98955
           C       0.95      0.92      0.94     88794
           D       0.95      0.93      0.94     48333
           E       0.88      0.96      0.92     13270
           F       0.87      0.94      0.90      2222
           G       0.64      1.00      0.78       470

    accuracy                           0.95    346669
   macro avg       0.89      0.96      0.91    346669
weighted avg       0.95      0.95      0.95    346669

--- Validation (LGBM (13 features)) ---
ROC AUC: 0.9967 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98     20275
           B       0.94      0.94      0.94     21203
           C       0.95      0.92      0.93     19027
           D       0.95      0.93      0.94     10355
           E       0.87      0.94      0.90      2846
           F       0.74      0.83      0.78       479
           G       0.40      0.62      0.49       101

    accuracy                           0.95     74286
   macro avg       0.83      0.88      0.85     74286
weighted avg       0.95      0.95      0.95     74286

The model with 13 features performed best. Let’s analyze it in more detail.

The 13 features are:

Code

feature_importance.head(13).col_name.to_list()

['total_rec_int',
 'total_rec_prncp',
 'installment',
 'funded_amnt_inv',
 'initial_list_status_f',
 'out_prncp_inv',
 'issue_month',
 'loan_amnt',
 'funded_amnt',
 'disbursement_method_Cash',
 'out_prncp',
 'term_ 36 months',
 'fico_range_high']

Code

y_pred_validation_lgbm_grades_13 = models_grades["LGBM (13 features)"].predict(
    X_validation
)
sns.set_style("white")
ml.plot_confusion_matrices(
    y_validation_grades,
    y_pred_validation_lgbm_grades_13,
    figsize=(15, 4),
    text_kw={"size": 9},
);

LGBM model with 13 features showed the best ROC AUC value of 0.9967 (validation). Comparing each grade separately, in the 13-feature model, F1 scores improved in all grades except Grade F (see Table 4.2).

4.2.6 Evaluation on Test Set

The final model evaluation on the test set shows ROC AUC of 0.9968.

Please, find the details below.

Code

# Model input variables
grade_model_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "initial_list_status_f",
    "out_prncp_inv",
    "issue_month",
    "loan_amnt",
    "funded_amnt",
    "disbursement_method_Cash",
    "out_prncp",
    "term_ 36 months",
    "fico_range_high",
]

Code

X_train_validation = pd.concat([X_train, X_validation], axis="index")
y_train_validation_grades = pd.concat(
    [y_train_grades, y_validation_grades], axis="index"
)


@my.cache_results(dir_interim + "models_2_03_lgbm--final-evaluation.pickle")
def fit_lgbm_grades_final_evaluation():
    """Fit a LGBM model for final testing."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(grade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_validation, y_train_validation_grades)
    return pipeline


models_grades["LGBM (final evaluation)"] = fit_lgbm_grades_final_evaluation()

Code

ml.print_classification_report(
    models=models_grades,
    model_name="LGBM (final evaluation)",
    X_train=X_train_validation,
    X_test=X_test,
    y_train=y_train_validation_grades,
    y_test=y_test_grades,
    label_train_set="Train + Validation",
    label_test_set="Test",
    label_model="Final evaluation",
)

--- Train + Validation (Final evaluation) ---
ROC AUC: 0.9976 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98    114900
           B       0.94      0.94      0.94    120158
           C       0.95      0.92      0.93    107821
           D       0.95      0.93      0.94     58688
           E       0.88      0.96      0.92     16116
           F       0.87      0.92      0.90      2701
           G       0.60      1.00      0.75       571

    accuracy                           0.95    420955
   macro avg       0.88      0.95      0.91    420955
weighted avg       0.95      0.95      0.95    420955


--- Test (Final evaluation) ---
ROC AUC: 0.9968 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98     20277
           B       0.94      0.94      0.94     21207
           C       0.94      0.92      0.93     19029
           D       0.94      0.93      0.93     10358
           E       0.87      0.95      0.91      2842
           F       0.78      0.80      0.79       474
           G       0.38      0.66      0.49       100

    accuracy                           0.94     74287
   macro avg       0.83      0.88      0.85     74287
weighted avg       0.94      0.94      0.94     74287

Confusion matrices for the test set:

Code

y_pred_test = models_grades["LGBM (final evaluation)"].predict(X_test)

sns.set_style("white")
ml.plot_confusion_matrices(y_test, y_pred_test, figsize=(15, 4), text_kw={"size": 9});

4.2.7 Final Model for Deployment

In this section, the final pre-processing and prediction pipeline will be created and trained on the whole dataset. The model will be saved into a pickle file and deployed on Google Cloud Platform (GCP).

Code

# Pipeline input variables
grade_pipeline_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "initial_list_status",
    "out_prncp_inv",
    "issue_d",
    "loan_amnt",
    "funded_amnt",
    "disbursement_method",
    "out_prncp",
    "term",
    "fico_range_high",
]


@my.cache_results(dir_interim + "02--model_predict_grade.pkl")
def fit_lgbm_grades_final():
    """Final LGBM model for grades prediction."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(grade_pipeline_input)),
            ("preprocessor_1", PreprocessorForGrades()),
            ("preprocessor_2", clone(pre_processing_trees)),
            ("selector_2", ColumnSelector(grade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    # Load required dataset
    data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
    accepted_2018 = pd.read_feather(data_file_path)

    pipeline.set_output(transform="pandas")
    pipeline.fit(accepted_2018, accepted_2018["grade"])
    return pipeline


# Fit the model
loan_grade_predictor_final = fit_lgbm_grades_final()
loan_grade_predictor_final

4.3 Loan Sub-Grade Prediction

This section focuses on predicting loan sub-grades.

4.3.1 Train Models

Here 3 models will be trained:

logistic regression (LR);
Naive Bayes (NB);
LGBM.

The voting ensemble model will not be trained as it did not improve the performance in the grade prediction task.

Code

# Dictionary to collect the results
models_subgrades = {}

Code

@my.cache_results(dir_interim + "models_3_01_naive_bayes.pickle")
def fit_nb_subgrades():
    """Fit a Naive Bayes model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", GaussianNB()),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


models_subgrades["Naive Bayes"] = fit_nb_subgrades()
# Time: 13.3s

Code

@my.cache_results(dir_interim + "models_3_02_logistic_regression_sgd.pickle")
def fit_lr_sgd_subgrades():
    """Fit a Logistic Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "classifier",
                SGDClassifier(
                    random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


models_subgrades["Logistic Regression"] = fit_lr_sgd_subgrades()
# Time: 50.5s

Code

@my.cache_results(dir_interim + "models_3_03_lgbm.pickle")
def fit_lgbm_subgrades():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


models_subgrades["LGBM"] = fit_lgbm_subgrades()
# Time: 3m 23.9s

4.3.2 Evaluation

Next, the models will be evaluated. Macro-averaged ROC AUC will be used as the main performance metric and F1 score as an additional one. The results will be compared to the baseline model that always predicts the most frequent class.

Results: the best-performing model was LGBM with ROC AUC of 0.9668 and F1 of 0.58 in the validation set. Most correctly classified are subgrades from A1 to D5 and the highest confusion is between the lowest grades (F4 to G5), and for sug-brades from G2 to G5 there were no correct predictions at all.

In the baseline model, ROC AUC was 0.5 and F1 was 0.0.

Find the details below.

def print_classification_report_for_subgrades(models, model_name, include_roc_auc=True):
    """Make a classification report for a given model.

    Creates classification reports on training and validation sets.
    Args:
        models (dict|scikit learn model): Either a dictionary with models or a model.
        model_name (str): The name of the model to be evaluated.
        include_roc_auc (bool): Whether to include ROC AUC in the report.
    """
    ml.print_classification_report(
        models,
        model_name,
        X_train,
        X_validation,
        y_train_subgrades,
        y_validation_subgrades,
        label_train_set="Train",
        label_test_set="Validation",
        label_model=model_name,
        include_roc_auc=include_roc_auc,
    )

Code

dummy_subgrades = DummyClassifier(strategy="most_frequent")
dummy_subgrades.fit(X_train, y_train_subgrades)

print_classification_report_for_subgrades(dummy_subgrades, "baseline")

--- Train (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.00      0.00      0.00     20221
          A2       0.00      0.00      0.00     16010
          A3       0.00      0.00      0.00     17685
          A4       0.00      0.00      0.00     22154
          A5       0.00      0.00      0.00     18555
          B1       0.00      0.00      0.00     19626
          B2       0.00      0.00      0.00     20671
          B3       0.00      0.00      0.00     16223
          B4       0.06      1.00      0.12     22209
          B5       0.00      0.00      0.00     20226
          C1       0.00      0.00      0.00     19874
          C2       0.00      0.00      0.00     17635
          C3       0.00      0.00      0.00     18397
          C4       0.00      0.00      0.00     16885
          C5       0.00      0.00      0.00     16003
          D1       0.00      0.00      0.00     11457
          D2       0.00      0.00      0.00     10801
          D3       0.00      0.00      0.00      9859
          D4       0.00      0.00      0.00      8630
          D5       0.00      0.00      0.00      7586
          E1       0.00      0.00      0.00      2796
          E2       0.00      0.00      0.00      2234
          E3       0.00      0.00      0.00      2663
          E4       0.00      0.00      0.00      2299
          E5       0.00      0.00      0.00      3278
          F1       0.00      0.00      0.00      1063
          F2       0.00      0.00      0.00       403
          F3       0.00      0.00      0.00       305
          F4       0.00      0.00      0.00       220
          F5       0.00      0.00      0.00       231
          G1       0.00      0.00      0.00       341
          G2       0.00      0.00      0.00        62
          G3       0.00      0.00      0.00        31
          G4       0.00      0.00      0.00        17
          G5       0.00      0.00      0.00        19

    accuracy                           0.06    346669
   macro avg       0.00      0.03      0.00    346669
weighted avg       0.00      0.06      0.01    346669


--- Validation (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.00      0.00      0.00      4333
          A2       0.00      0.00      0.00      3430
          A3       0.00      0.00      0.00      3789
          A4       0.00      0.00      0.00      4747
          A5       0.00      0.00      0.00      3976
          B1       0.00      0.00      0.00      4205
          B2       0.00      0.00      0.00      4429
          B3       0.00      0.00      0.00      3476
          B4       0.06      1.00      0.12      4759
          B5       0.00      0.00      0.00      4334
          C1       0.00      0.00      0.00      4259
          C2       0.00      0.00      0.00      3779
          C3       0.00      0.00      0.00      3942
          C4       0.00      0.00      0.00      3618
          C5       0.00      0.00      0.00      3429
          D1       0.00      0.00      0.00      2455
          D2       0.00      0.00      0.00      2314
          D3       0.00      0.00      0.00      2112
          D4       0.00      0.00      0.00      1849
          D5       0.00      0.00      0.00      1625
          E1       0.00      0.00      0.00       600
          E2       0.00      0.00      0.00       479
          E3       0.00      0.00      0.00       571
          E4       0.00      0.00      0.00       493
          E5       0.00      0.00      0.00       703
          F1       0.00      0.00      0.00       228
          F2       0.00      0.00      0.00        87
          F3       0.00      0.00      0.00        66
          F4       0.00      0.00      0.00        48
          F5       0.00      0.00      0.00        50
          G1       0.00      0.00      0.00        73
          G2       0.00      0.00      0.00        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.06     74286
   macro avg       0.00      0.03      0.00     74286
weighted avg       0.00      0.06      0.01     74286

Evaluation of Naive Bayes model:

Code

print_classification_report_for_subgrades(models_subgrades, "Naive Bayes")

--- Train (Naive Bayes) ---
ROC AUC: 0.5804 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.14      0.19      0.16     20221
          A2       0.15      0.01      0.02     16010
          A3       0.08      0.00      0.00     17685
          A4       0.00      0.00      0.00     22154
          A5       0.00      0.00      0.00     18555
          B1       0.06      0.00      0.00     19626
          B2       0.00      0.00      0.00     20671
          B3       0.00      0.00      0.00     16223
          B4       0.07      0.00      0.00     22209
          B5       0.00      0.00      0.00     20226
          C1       0.04      0.00      0.00     19874
          C2       0.03      0.00      0.00     17635
          C3       0.00      0.00      0.00     18397
          C4       0.05      0.00      0.00     16885
          C5       0.33      0.00      0.00     16003
          D1       0.04      0.00      0.00     11457
          D2       0.03      0.00      0.00     10801
          D3       0.00      0.00      0.00      9859
          D4       0.00      0.00      0.00      8630
          D5       0.03      0.00      0.00      7586
          E1       0.06      0.03      0.04      2796
          E2       0.05      0.00      0.01      2234
          E3       0.04      0.00      0.01      2663
          E4       0.05      0.01      0.01      2299
          E5       0.03      0.21      0.05      3278
          F1       0.01      0.00      0.00      1063
          F2       0.01      0.00      0.01       403
          F3       0.00      0.01      0.00       305
          F4       0.00      0.05      0.01       220
          F5       0.00      0.00      0.00       231
          G1       0.00      0.04      0.01       341
          G2       0.00      0.24      0.00        62
          G3       0.00      0.19      0.00        31
          G4       0.00      0.76      0.00        17
          G5       0.00      1.00      0.00        19

    accuracy                           0.01    346669
   macro avg       0.04      0.08      0.01    346669
weighted avg       0.05      0.01      0.01    346669

--- Validation (Naive Bayes) ---
ROC AUC: 0.5474 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.14      0.20      0.17      4333
          A2       0.20      0.01      0.03      3430
          A3       0.08      0.00      0.00      3789
          A4       0.00      0.00      0.00      4747
          A5       0.00      0.00      0.00      3976
          B1       0.06      0.00      0.00      4205
          B2       0.00      0.00      0.00      4429
          B3       0.00      0.00      0.00      3476
          B4       0.08      0.00      0.00      4759
          B5       0.00      0.00      0.00      4334
          C1       0.09      0.00      0.00      4259
          C2       0.10      0.00      0.00      3779
          C3       0.00      0.00      0.00      3942
          C4       0.00      0.00      0.00      3618
          C5       0.00      0.00      0.00      3429
          D1       0.09      0.00      0.00      2455
          D2       0.04      0.00      0.00      2314
          D3       0.00      0.00      0.00      2112
          D4       0.00      0.00      0.00      1849
          D5       0.02      0.00      0.00      1625
          E1       0.08      0.03      0.05       600
          E2       0.00      0.00      0.00       479
          E3       0.04      0.01      0.01       571
          E4       0.00      0.00      0.00       493
          E5       0.03      0.20      0.05       703
          F1       0.00      0.00      0.00       228
          F2       0.00      0.00      0.00        87
          F3       0.00      0.00      0.00        66
          F4       0.00      0.02      0.00        48
          F5       0.00      0.00      0.00        50
          G1       0.00      0.03      0.01        73
          G2       0.00      0.08      0.00        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.50      0.00         4

    accuracy                           0.01     74286
   macro avg       0.03      0.03      0.01     74286
weighted avg       0.05      0.01      0.01     74286

Evaluation of Logistic Regression model:

Code

print_classification_report_for_subgrades(models_subgrades, "Logistic Regression")

--- Train (Logistic Regression) ---
ROC AUC: 0.8560 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.46      0.76      0.58     20221
          A2       0.31      0.06      0.10     16010
          A3       0.21      0.30      0.24     17685
          A4       0.21      0.25      0.23     22154
          A5       0.17      0.30      0.22     18555
          B1       0.20      0.05      0.08     19626
          B2       0.18      0.16      0.17     20671
          B3       0.11      0.20      0.14     16223
          B4       0.17      0.04      0.07     22209
          B5       0.15      0.07      0.10     20226
          C1       0.15      0.18      0.16     19874
          C2       0.12      0.03      0.05     17635
          C3       0.14      0.01      0.02     18397
          C4       0.13      0.02      0.03     16885
          C5       0.11      0.15      0.13     16003
          D1       0.09      0.10      0.09     11457
          D2       0.09      0.02      0.03     10801
          D3       0.09      0.15      0.11      9859
          D4       0.09      0.09      0.09      8630
          D5       0.11      0.17      0.14      7586
          E1       0.07      0.34      0.12      2796
          E2       0.06      0.26      0.09      2234
          E3       0.05      0.02      0.03      2663
          E4       0.12      0.07      0.08      2299
          E5       0.06      0.15      0.09      3278
          F1       0.08      0.16      0.11      1063
          F2       0.02      0.19      0.04       403
          F3       0.05      0.20      0.09       305
          F4       0.02      0.13      0.04       220
          F5       0.05      0.09      0.06       231
          G1       0.09      0.06      0.07       341
          G2       0.01      0.69      0.03        62
          G3       0.01      0.45      0.02        31
          G4       0.01      0.65      0.02        17
          G5       0.01      0.68      0.01        19

    accuracy                           0.17    346669
   macro avg       0.11      0.21      0.11    346669
weighted avg       0.17      0.17      0.15    346669

--- Validation (Logistic Regression) ---
ROC AUC: 0.8422 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.46      0.75      0.57      4333
          A2       0.29      0.06      0.10      3430
          A3       0.21      0.30      0.24      3789
          A4       0.21      0.24      0.22      4747
          A5       0.17      0.30      0.22      3976
          B1       0.21      0.05      0.09      4205
          B2       0.18      0.16      0.17      4429
          B3       0.11      0.21      0.15      3476
          B4       0.16      0.04      0.06      4759
          B5       0.14      0.07      0.09      4334
          C1       0.15      0.18      0.16      4259
          C2       0.11      0.03      0.05      3779
          C3       0.13      0.01      0.02      3942
          C4       0.12      0.02      0.03      3618
          C5       0.11      0.14      0.12      3429
          D1       0.08      0.09      0.08      2455
          D2       0.08      0.02      0.03      2314
          D3       0.08      0.13      0.10      2112
          D4       0.09      0.09      0.09      1849
          D5       0.12      0.19      0.15      1625
          E1       0.08      0.36      0.13       600
          E2       0.05      0.22      0.08       479
          E3       0.04      0.02      0.02       571
          E4       0.10      0.05      0.07       493
          E5       0.05      0.14      0.08       703
          F1       0.04      0.07      0.05       228
          F2       0.02      0.15      0.03        87
          F3       0.05      0.18      0.07        66
          F4       0.03      0.17      0.06        48
          F5       0.02      0.04      0.03        50
          G1       0.10      0.07      0.08        73
          G2       0.01      0.38      0.01        13
          G3       0.01      0.29      0.01         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.25      0.00         4

    accuracy                           0.16     74286
   macro avg       0.11      0.16      0.10     74286
weighted avg       0.17      0.16      0.14     74286

Evaluation of LGBM model:

Code

print_classification_report_for_subgrades(models_subgrades, "LGBM")

--- Train (LGBM) ---
ROC AUC: 0.9933 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.86      0.91      0.89     20221
          A2       0.72      0.83      0.77     16010
          A3       0.75      0.79      0.77     17685
          A4       0.82      0.72      0.77     22154
          A5       0.74      0.85      0.79     18555
          B1       0.78      0.74      0.76     19626
          B2       0.77      0.70      0.73     20671
          B3       0.65      0.75      0.70     16223
          B4       0.80      0.75      0.78     22209
          B5       0.83      0.78      0.81     20226
          C1       0.82      0.80      0.81     19874
          C2       0.78      0.75      0.77     17635
          C3       0.80      0.78      0.79     18397
          C4       0.83      0.78      0.80     16885
          C5       0.85      0.80      0.83     16003
          D1       0.83      0.84      0.83     11457
          D2       0.84      0.79      0.82     10801
          D3       0.86      0.81      0.83      9859
          D4       0.88      0.82      0.85      8630
          D5       0.87      0.89      0.88      7586
          E1       0.55      0.95      0.70      2796
          E2       0.65      0.94      0.77      2234
          E3       0.89      0.95      0.91      2663
          E4       0.87      0.95      0.91      2299
          E5       0.93      0.95      0.94      3278
          F1       0.95      1.00      0.97      1063
          F2       0.91      1.00      0.95       403
          F3       0.97      1.00      0.99       305
          F4       0.96      1.00      0.98       220
          F5       0.99      1.00      1.00       231
          G1       0.97      1.00      0.98       341
          G2       1.00      1.00      1.00        62
          G3       1.00      1.00      1.00        31
          G4       1.00      1.00      1.00        17
          G5       1.00      1.00      1.00        19

    accuracy                           0.80    346669
   macro avg       0.85      0.87      0.86    346669
weighted avg       0.80      0.80      0.80    346669

--- Validation (LGBM) ---
ROC AUC: 0.9668 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.84      0.89      0.86      4333
          A2       0.67      0.78      0.72      3430
          A3       0.70      0.75      0.73      3789
          A4       0.79      0.68      0.73      4747
          A5       0.71      0.82      0.76      3976
          B1       0.74      0.69      0.72      4205
          B2       0.72      0.66      0.69      4429
          B3       0.60      0.69      0.64      3476
          B4       0.76      0.73      0.74      4759
          B5       0.79      0.75      0.77      4334
          C1       0.78      0.75      0.77      4259
          C2       0.74      0.70      0.72      3779
          C3       0.74      0.74      0.74      3942
          C4       0.79      0.74      0.76      3618
          C5       0.82      0.74      0.78      3429
          D1       0.78      0.78      0.78      2455
          D2       0.78      0.70      0.73      2314
          D3       0.76      0.71      0.73      2112
          D4       0.79      0.73      0.76      1849
          D5       0.75      0.80      0.78      1625
          E1       0.37      0.69      0.48       600
          E2       0.37      0.60      0.46       479
          E3       0.62      0.60      0.61       571
          E4       0.58      0.59      0.58       493
          E5       0.69      0.71      0.70       703
          F1       0.65      0.85      0.74       228
          F2       0.49      0.54      0.52        87
          F3       0.53      0.48      0.51        66
          F4       0.29      0.17      0.21        48
          F5       0.21      0.08      0.12        50
          G1       0.39      0.53      0.45        73
          G2       0.00      0.00      0.00        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.74     74286
   macro avg       0.58      0.59      0.58     74286
weighted avg       0.74      0.74      0.74     74286

The best-performing model is LGBM. It will be explored in more detail.

Code

y_pred_subgrades_validation_lgbm = models_subgrades["LGBM"].predict(X_validation)

Code

ml.plot_confusion_matrices(
    y_validation_subgrades,
    y_pred_subgrades_validation_lgbm,
    figsize=(15, 30),
    text_kw={"size": 5},
    layout="vertical",
);

It seems that the highest confusion is between the lowest grades: F4 to G5. In the group of the lowest grades from G2 to G5 no correct prediction at all.

4.3.3 Feature Importance

Internal LGBM method as well as SHAP values were used to evaluate feature importance. This time it was decided to keep 27 features with the highest SHAP values.

Please, find the details below.

Code

@my.cache_results(dir_interim + "task-2-subgrades--shap_lgbm_k=all.pkl")
def get_shap_values_subgrades_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_subgrades[model].steps[:-1])
    classifier = models_subgrades[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)

    return shap_values, X_validation_preproc


shap_values_lgbm_subgrades, data_for_lgbm_subgrades = get_shap_values_subgrades_lgbm()
# Time: 18m 30.0s

Code

lgb.plot_importance(
    models_subgrades["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance (Sub-Grade Prediction)",
);

Code

vals_subgrades = np.abs(shap_values_lgbm_subgrades).mean(0).mean(0)
feature_importance_subgrades = (
    pd.DataFrame(
        list(zip(data_for_lgbm_subgrades.columns, vals_subgrades)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

Code

feature_importance_subgrades.head(30).style.format(precision=4)

	index	col_name	importance
0	22	total_rec_int	0.6425
1	21	total_rec_prncp	0.3997
2	3	installment	0.3133
3	2	funded_amnt_inv	0.1600
4	0	loan_amnt	0.1322
5	1	funded_amnt	0.1273
6	20	total_pymnt_inv	0.1253
7	126	disbursement_method_Cash	0.1206
8	120	initial_list_status_f	0.1086
9	18	out_prncp_inv	0.1041
10	8	fico_range_low	0.1038
11	19	total_pymnt	0.1011
12	9	fico_range_high	0.0967
13	82	issue_month	0.0951
14	96	term_ 36 months	0.0949
15	17	out_prncp	0.0908
16	97	term_ 60 months	0.0802
17	26	last_pymnt_amnt	0.0652
18	6	dti	0.0626
19	49	bc_open_to_buy	0.0625
20	28	last_fico_range_low	0.0557
21	27	last_fico_range_high	0.0492
22	42	all_util	0.0378
23	74	percent_bc_gt_75	0.0370
24	81	title_len	0.0358
25	5	annual_inc	0.0297
26	57	mort_acc	0.0237
27	36	mths_since_rcnt_il	0.0227
28	43	total_rev_hi_lim	0.0222
29	58	mths_since_recent_bc	0.0174

Code

shap.summary_plot(
    shap_values_lgbm_subgrades, data_for_lgbm_subgrades, plot_type="bar", max_display=30
)

4.3.4 Re-Train Model on the Most Important Features

By using 27 features instead of 140, the classification performance slightly increased in ROC AUC (validation) from 0.9668 to 0.9688 and in F1 from 0.58 to 0.60 Still, the issues with the lowest subgrades persist.

Find the details below.

def fit_lgbm_subgrades_with_selection(features):
    """Template to fit a LGBM model with a smaller number of features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(features)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


@my.cache_results(dir_interim + "models_3_03_lgbm-27-features.pickle")
def fit_lgbm_subgrades_27():
    features = feature_importance.head(27).col_name.to_list()
    return fit_lgbm_subgrades_with_selection(features)

Code

models_subgrades["LGBM (27 features)"] = fit_lgbm_subgrades_27()
# Time: 2m 5.0 s

Code

print_classification_report_for_subgrades(models_subgrades, "LGBM (27 features)")

--- Train (LGBM (27 features)) ---
ROC AUC: 0.9934 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.88      0.92      0.90     20221
          A2       0.74      0.84      0.79     16010
          A3       0.77      0.81      0.79     17685
          A4       0.82      0.74      0.78     22154
          A5       0.76      0.86      0.81     18555
          B1       0.78      0.74      0.76     19626
          B2       0.77      0.72      0.75     20671
          B3       0.65      0.77      0.70     16223
          B4       0.82      0.75      0.78     22209
          B5       0.84      0.78      0.81     20226
          C1       0.81      0.80      0.80     19874
          C2       0.78      0.75      0.77     17635
          C3       0.80      0.78      0.79     18397
          C4       0.83      0.80      0.82     16885
          C5       0.86      0.80      0.83     16003
          D1       0.85      0.83      0.84     11457
          D2       0.86      0.81      0.83     10801
          D3       0.85      0.81      0.83      9859
          D4       0.86      0.81      0.84      8630
          D5       0.87      0.88      0.88      7586
          E1       0.53      0.92      0.67      2796
          E2       0.64      0.91      0.75      2234
          E3       0.87      0.90      0.88      2663
          E4       0.87      0.92      0.89      2299
          E5       0.93      0.92      0.93      3278
          F1       0.93      1.00      0.96      1063
          F2       0.86      1.00      0.92       403
          F3       0.95      1.00      0.97       305
          F4       0.92      1.00      0.96       220
          F5       0.92      1.00      0.96       231
          G1       0.93      1.00      0.97       341
          G2       1.00      1.00      1.00        62
          G3       1.00      1.00      1.00        31
          G4       1.00      1.00      1.00        17
          G5       1.00      1.00      1.00        19

    accuracy                           0.80    346669
   macro avg       0.84      0.87      0.86    346669
weighted avg       0.80      0.80      0.80    346669

--- Validation (LGBM (27 features)) ---
ROC AUC: 0.9688 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.86      0.90      0.88      4333
          A2       0.70      0.81      0.75      3430
          A3       0.75      0.79      0.77      3789
          A4       0.79      0.71      0.74      4747
          A5       0.74      0.84      0.78      3976
          B1       0.76      0.72      0.74      4205
          B2       0.73      0.68      0.71      4429
          B3       0.60      0.71      0.65      3476
          B4       0.79      0.73      0.76      4759
          B5       0.80      0.75      0.78      4334
          C1       0.78      0.76      0.77      4259
          C2       0.74      0.72      0.73      3779
          C3       0.76      0.74      0.75      3942
          C4       0.80      0.76      0.78      3618
          C5       0.82      0.75      0.79      3429
          D1       0.81      0.78      0.79      2455
          D2       0.81      0.73      0.77      2314
          D3       0.76      0.74      0.75      2112
          D4       0.80      0.75      0.77      1849
          D5       0.81      0.82      0.81      1625
          E1       0.40      0.72      0.52       600
          E2       0.42      0.66      0.51       479
          E3       0.64      0.64      0.64       571
          E4       0.63      0.62      0.63       493
          E5       0.74      0.71      0.73       703
          F1       0.67      0.82      0.74       228
          F2       0.39      0.53      0.45        87
          F3       0.54      0.52      0.53        66
          F4       0.38      0.31      0.34        48
          F5       0.14      0.14      0.14        50
          G1       0.37      0.52      0.43        73
          G2       0.17      0.08      0.11        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.75     74286
   macro avg       0.60      0.61      0.60     74286
weighted avg       0.76      0.75      0.75     74286

Confusion matrices for the test set:

Code

y_pred_subgrades_validation_lgbm_27 = models_subgrades["LGBM (27 features)"].predict(
    X_validation
)
ml.plot_confusion_matrices(
    y_validation_subgrades,
    y_pred_subgrades_validation_lgbm_27,
    figsize=(15, 30),
    text_kw={"size": 5},
    layout="vertical",
);

4.3.5 Evaluation on Test Set

The final model evaluation on the test set shows ROC AUC of 0.9645 and F1 score of 0.61.

Please, find the details below.

Code

# Model input variables
subgrade_model_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "loan_amnt",
    "funded_amnt",
    "total_pymnt_inv",
    "disbursement_method_Cash",
    "initial_list_status_f",
    "out_prncp_inv",
    "fico_range_low",
    "total_pymnt",
    "fico_range_high",
    "issue_month",
    "term_ 36 months",
    "out_prncp",
    "term_ 60 months",
    "last_pymnt_amnt",
    "dti",
    "bc_open_to_buy",
    "last_fico_range_low",
    "last_fico_range_high",
    "all_util",
    "percent_bc_gt_75",
    "title_len",
    "annual_inc",
    "mort_acc",
]

Code

X_train_validation = pd.concat([X_train, X_validation], axis="index")
y_train_validation_subgrades = pd.concat(
    [y_train_subgrades, y_validation_subgrades], axis="index"
)


@my.cache_results(dir_interim + "models_3_03_lgbm--final-evaluation.pickle")
def fit_lgbm_subgrades_final_evaluation():
    """Fit a LGBM model for final testing."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(subgrade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_validation, y_train_validation_subgrades)
    return pipeline


models_subgrades["LGBM (final evaluation)"] = fit_lgbm_subgrades_final_evaluation()

Code

ml.print_classification_report(
    models=models_subgrades,
    model_name="LGBM (final evaluation)",
    X_train=X_train_validation,
    X_test=X_test,
    y_train=y_train_validation_subgrades,
    y_test=y_test_subgrades,
    label_train_set="Train + Validation",
    label_test_set="Test",
    label_model="Final evaluation",
)

--- Train + Validation (Final evaluation) ---
ROC AUC: 0.9932 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.87      0.92      0.89     24554
          A2       0.73      0.83      0.77     19440
          A3       0.76      0.80      0.78     21474
          A4       0.82      0.74      0.78     26901
          A5       0.74      0.84      0.79     22531
          B1       0.79      0.74      0.76     23831
          B2       0.74      0.72      0.73     25100
          B3       0.64      0.76      0.69     19699
          B4       0.83      0.74      0.79     26968
          B5       0.83      0.78      0.80     24560
          C1       0.83      0.77      0.80     24133
          C2       0.75      0.75      0.75     21414
          C3       0.79      0.76      0.77     22339
          C4       0.84      0.78      0.81     20503
          C5       0.85      0.82      0.84     19432
          D1       0.84      0.83      0.83     13912
          D2       0.85      0.82      0.83     13115
          D3       0.85      0.80      0.82     11971
          D4       0.86      0.82      0.84     10479
          D5       0.87      0.87      0.87      9211
          E1       0.52      0.90      0.66      3396
          E2       0.62      0.90      0.73      2713
          E3       0.87      0.88      0.88      3234
          E4       0.85      0.89      0.87      2792
          E5       0.93      0.91      0.92      3981
          F1       0.92      1.00      0.96      1291
          F2       0.87      1.00      0.93       490
          F3       0.93      1.00      0.96       371
          F4       0.90      1.00      0.95       268
          F5       0.90      1.00      0.95       281
          G1       0.92      1.00      0.96       414
          G2       0.96      1.00      0.98        75
          G3       1.00      1.00      1.00        38
          G4       1.00      1.00      1.00        21
          G5       1.00      1.00      1.00        23

    accuracy                           0.79    420955
   macro avg       0.84      0.87      0.85    420955
weighted avg       0.80      0.79      0.79    420955


--- Test (Final evaluation) ---
ROC AUC: 0.9645 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.86      0.89      0.88      4333
          A2       0.69      0.81      0.75      3431
          A3       0.73      0.77      0.75      3790
          A4       0.80      0.73      0.76      4747
          A5       0.73      0.81      0.77      3976
          B1       0.76      0.71      0.74      4206
          B2       0.71      0.69      0.70      4430
          B3       0.60      0.71      0.65      3477
          B4       0.80      0.72      0.76      4760
          B5       0.80      0.76      0.78      4334
          C1       0.81      0.75      0.78      4259
          C2       0.72      0.73      0.73      3779
          C3       0.75      0.73      0.74      3943
          C4       0.82      0.75      0.78      3618
          C5       0.83      0.79      0.81      3430
          D1       0.80      0.78      0.79      2455
          D2       0.78      0.77      0.77      2315
          D3       0.80      0.73      0.77      2113
          D4       0.81      0.74      0.77      1849
          D5       0.80      0.81      0.81      1626
          E1       0.40      0.73      0.52       599
          E2       0.42      0.63      0.50       479
          E3       0.67      0.64      0.65       570
          E4       0.61      0.62      0.61       492
          E5       0.78      0.75      0.77       702
          F1       0.73      0.85      0.78       227
          F2       0.41      0.43      0.42        86
          F3       0.54      0.62      0.58        65
          F4       0.42      0.38      0.40        47
          F5       0.26      0.33      0.29        49
          G1       0.34      0.48      0.40        73
          G2       0.25      0.15      0.19        13
          G3       0.00      0.00      0.00         6
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.75     74287
   macro avg       0.61      0.62      0.61     74287
weighted avg       0.76      0.75      0.75     74287

Confusion matrices for the test set:

Code

y_pred_test = models_grades["LGBM (final evaluation)"].predict(X_test)

sns.set_style("white")
ml.plot_confusion_matrices(y_test, y_pred_test, figsize=(15, 4), text_kw={"size": 9});

4.3.6 Final Model for Deployment

Code

# Pipeline input variables
subgrade_pipeline_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "loan_amnt",
    "funded_amnt",
    "total_pymnt_inv",
    "disbursement_method",
    "initial_list_status",
    "out_prncp_inv",
    "fico_range_low",
    "total_pymnt",
    "fico_range_high",
    "issue_d",
    "term",
    "out_prncp",
    "last_pymnt_amnt",
    "dti",
    "bc_open_to_buy",
    "last_fico_range_low",
    "last_fico_range_high",
    "all_util",
    "percent_bc_gt_75",
    "title",
    "annual_inc",
    "mort_acc",
]


@my.cache_results(dir_interim + "03--model_predict_subgrade.pkl")
def fit_lgbm_subgrades_final():
    """Final LGBM model for sub-grades prediction."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(subgrade_pipeline_input)),
            ("preprocessor_1", PreprocessorForSubgrades()),
            ("preprocessor_2", clone(pre_processing_trees)),
            ("selector_2", ColumnSelector(subgrade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    # Load required dataset
    data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
    accepted_2018 = pd.read_feather(data_file_path)

    pipeline.set_output(transform="pandas")
    pipeline.fit(accepted_2018, accepted_2018["sub_grade"])
    return pipeline


# Fit the model
loan_subgrade_predictor_final = fit_lgbm_subgrades_final()
loan_subgrade_predictor_final

# Time: 5m 25.1s

Code

data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
accepted_2018 = pd.read_feather(data_file_path)

accepted_2018[subgrade_pipeline_input].iloc[[23486, 4, 300717]].to_dict(orient="list")

{'total_rec_int': [728.280029296875, 11.579999923706055, 1397.8199462890625],
 'total_rec_prncp': [2723.12, 3000.0, 1399.58],
 'installment': [288.010009765625, 93.0999984741211, 705.3800048828125],
 'funded_amnt_inv': [9000.0, 3000.0, 30000.0],
 'loan_amnt': [9000.0, 3000.0, 30000.0],
 'funded_amnt': [9000.0, 3000.0, 30000.0],
 'total_pymnt_inv': [3451.4, 3011.58, 2797.4],
 'disbursement_method': ['Cash', 'Cash', 'Cash'],
 'initial_list_status': ['w', 'w', 'w'],
 'out_prncp_inv': [6276.88, 0.0, 28600.42],
 'fico_range_low': [715.0, 760.0, 715.0],
 'total_pymnt': [3451.4, 3011.5772850636, 2797.4],
 'fico_range_high': [719.0, 764.0, 719.0],
 'issue_d': ['Mar-2018', 'Mar-2018', 'Nov-2018'],
 'term': [' 36 months', ' 36 months', ' 60 months'],
 'out_prncp': [6276.88, 0.0, 28600.42],
 'last_pymnt_amnt': [288.01, 614.03, 705.38],
 'dti': [14.510000228881836, 0.5799999833106995, 32.220001220703125],
 'bc_open_to_buy': [455.0, 30359.0, 17345.0],
 'last_fico_range_low': [805.0, 760.0, 755.0],
 'last_fico_range_high': [809.0, 764.0, 759.0],
 'all_util': [74.0, 1.0, 89.0],
 'percent_bc_gt_75': [100.0, 0.0, 25.0],
 'title': ['Credit card refinancing', 'Major purchase', 'Debt consolidation'],
 'annual_inc': [67000.0, 52000.0, 60000.0],
 'mort_acc': [0.0, 4.0, 0.0]}

4.4 Interest Rate Prediction

This section focuses on predicting loan sub-grades.

4.4.1 Train Models

Here 4 models will be trained:

linear regression;
LASSO regression;
Ridge regression;
LGBM (regular gradient boosting).

Code

# Dictionary to collect the results
models_int_rate = {}

Code

@my.cache_results(dir_interim + "models_4_01_linear_regression_sgd.pickle")
def fit_lr_sgd_int_rate():
    """Fit a Linear Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "regressor",
                SGDRegressor(
                    random_state=1,
                    loss="squared_error",  # Use 'squared_error' for regression
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["Linear Regression"] = fit_lr_sgd_int_rate()

Code

# Function for LASSO Regression
@my.cache_results(dir_interim + "models_4_02_lasso_regression_sgd.pickle")
def fit_lasso_regression():
    """Fit a LASSO Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "regressor",
                SGDRegressor(
                    random_state=1,
                    penalty="l1",  # L1 regularization for LASSO
                    alpha=1.0,
                    loss="squared_error",
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["LASSO Regression"] = fit_lasso_regression()

Code

# Function for Ridge Regression
@my.cache_results(dir_interim + "models_4_03_ridge_regression_sgd.pickle")
def fit_ridge_regression():
    """Fit a Ridge Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "regressor",
                SGDRegressor(
                    random_state=1,
                    penalty="l2",  # L2 regularization for Ridge
                    alpha=1.0,
                    loss="squared_error",
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["Ridge Regression"] = fit_ridge_regression()

Code

@my.cache_results(dir_interim + "models_4_04_lgbm_regression.pickle")
def fit_lgbm_int_rate_regression():
    """Fit a LGBM regression model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["LGBM"] = fit_lgbm_int_rate_regression()

4.4.2 Evaluation

The LGBM model outperformed the remaining models with RMSE=1.416 (while SD=5.153) and MAE=0.897 (while median absolute deviation, MedianAD, is 3.610).

For linear regression, errors are extremely large and R² is calculated incorrectly (negative value while should be in the range between 0 and 1).

Code

print("--- Train ---")
ml.regression_scores(
    models_int_rate, X_train, y_train_int_rate, sort_by="RMSE", color="orange"
)

--- Train ---

	n	SD	RMSE	MedianAD	MAE	R²	SD_RMSE_ratio	MedianAD_MAE_ratio
LGBM	346669	5.149	1.382	3.610	0.881	0.928	3.727	4.096
Ridge Regression	346669	5.149	3.682	3.610	2.811	0.489	1.398	1.284
LASSO Regression	346669	5.149	4.329	3.610	3.401	0.293	1.189	1.061
Linear Regression	346669	5.149	53939835148.674	3.610	1450519710.172	-109724639462755778560.000	0.000	0.000

Code

print("--- Validation ---")
ml.regression_scores(
    models_int_rate, X_validation, y_validation_int_rate, sort_by="RMSE"
)

--- Validation ---

	n	SD	RMSE	MedianAD	MAE	R²	SD_RMSE_ratio	MedianAD_MAE_ratio
LGBM	74286	5.153	1.416	3.610	0.897	0.925	3.640	4.024
Ridge Regression	74286	5.153	3.692	3.610	2.822	0.487	1.396	1.279
LASSO Regression	74286	5.153	4.338	3.610	3.410	0.291	1.188	1.059
Linear Regression	74286	5.153	34670485474.882	3.610	1397356309.915	-45272654153225068544.000	0.000	0.000

4.4.3 Feature Importance

Code

@my.cache_results(dir_interim + "task-2-int_rate--shap_lgbm_k=all.pkl")
def get_shap_values_int_rate_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_int_rate[model].steps[:-1])
    regressor = models_int_rate[model]["regressor"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(regressor)
    shap_values = tree_explainer.shap_values(X_validation_preproc)

    return shap_values, X_validation_preproc


shap_values_lgbm_int_rate, data_for_lgbm_int_rate = get_shap_values_int_rate_lgbm()
# Time: 21.0s

Code

sns.set_style("white")
lgb.plot_importance(
    models_int_rate["LGBM"]["regressor"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance (Interest Rate Prediction)",
);

Code

shap.summary_plot(
    shap_values_lgbm_int_rate, data_for_lgbm_int_rate, plot_type="bar", max_display=50
)

Code

shap.summary_plot(shap_values_lgbm_int_rate, data_for_lgbm_int_rate, max_display=30)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Code

vals_int_rate = np.abs(shap_values_lgbm_int_rate).mean(0)
feature_importance_int_rate = (
    pd.DataFrame(
        list(zip(data_for_lgbm_int_rate.columns, vals_int_rate)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

Code

feature_importance_int_rate.head(40).style.format(precision=3)

	index	col_name	importance
0	22	total_rec_int	2.881
1	21	total_rec_prncp	1.945
2	82	issue_month	0.501
3	9	fico_range_high	0.468
4	3	installment	0.463
5	17	out_prncp	0.452
6	1	funded_amnt	0.348
7	126	disbursement_method_Cash	0.340
8	0	loan_amnt	0.307
9	49	bc_open_to_buy	0.269
10	27	last_fico_range_high	0.244
11	18	out_prncp_inv	0.209
12	97	term_ 60 months	0.207
13	2	funded_amnt_inv	0.193
14	96	term_ 36 months	0.186
15	6	dti	0.178
16	8	fico_range_low	0.173
17	57	mort_acc	0.099
18	26	last_pymnt_amnt	0.093
19	74	percent_bc_gt_75	0.092
20	28	last_fico_range_low	0.083
21	5	annual_inc	0.081
22	10	inq_last_6mths	0.070
23	36	mths_since_rcnt_il	0.065
24	42	all_util	0.055
25	43	total_rev_hi_lim	0.042
26	120	initial_list_status_f	0.042
27	81	title_len	0.037
28	108	purpose_credit_card	0.032
29	58	mths_since_recent_bc	0.031
30	38	il_util	0.030
31	72	num_tl_op_past_12m	0.027
32	11	mths_since_last_delinq	0.023
33	54	mo_sin_old_rev_tl_op	0.021
34	104	verification_status_Verified	0.020
35	13	pub_rec	0.018
36	20	total_pymnt_inv	0.016
37	19	total_pymnt	0.016
38	121	initial_list_status_w	0.015
39	127	disbursement_method_DirectPay	0.015

4.4.4 Re-Train Models on the Most Important Features

The LGBM model was trained on a smaller number of features based on various SHAP value thresholds. Threshold 0.10 resulted in 17 features which performed best (validation RMSE = 1.364, MAE = 0.850).

Please, find the details below.

Code

# Function with the analysis template
def fit_lgbm_int_rate_on_certain_features(features):
    """Template to fit LGBM regression model on certain features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(features)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


def fit_lgbm_int_rate_by_shap(importance_threshold):
    """Function for feature selection based on SHAP values"""
    features = feature_importance_int_rate.query(
        f"importance > {importance_threshold}"
    ).col_name.to_list()

    k = len(features)

    return f"LGBM ({k} features)", fit_lgbm_int_rate_on_certain_features(features)

Code

for threshold in [0.02, 0.03, 0.05, 0.08, 0.10, 0.30, 0.40, 1.00]:
    model_name, model = fit_lgbm_int_rate_by_shap(threshold)
    models_int_rate[model_name] = model
# Time: 2m 52.0s

Code

print("--- Train ---")
ml.regression_scores(
    models_int_rate,
    X_train,
    y_train_int_rate,
    sort_by="RMSE",
    color="orange",
    incorrect_as_nan=True,
)

--- Train ---

	n	SD	RMSE	MedianAD	MAE	R²	SD_RMSE_ratio	MedianAD_MAE_ratio
LGBM (17 features)	346669	5.149	1.333	3.610	0.835	0.933	3.862	4.323
LGBM (22 features)	346669	5.149	1.341	3.610	0.846	0.932	3.839	4.269
LGBM (25 features)	346669	5.149	1.344	3.610	0.851	0.932	3.831	4.243
LGBM (30 features)	346669	5.149	1.362	3.610	0.867	0.930	3.781	4.165
LGBM (34 features)	346669	5.149	1.369	3.610	0.872	0.929	3.762	4.140
LGBM	346669	5.149	1.382	3.610	0.881	0.928	3.727	4.096
LGBM (9 features)	346669	5.149	1.479	3.610	0.859	0.917	3.481	4.204
LGBM (6 features)	346669	5.149	1.595	3.610	0.894	0.904	3.228	4.039
LGBM (2 features)	346669	5.149	2.809	3.610	1.826	0.702	1.833	1.977
Ridge Regression	346669	5.149	3.682	3.610	2.811	0.489	1.398	1.284
LASSO Regression	346669	5.149	4.329	3.610	3.401	0.293	1.189	1.061
Linear Regression	346669	5.149	53939835148.674	3.610	1450519710.172	nan	0.000	0.000

Code

print("--- Validation ---")
ml.regression_scores(
    models_int_rate,
    X_validation,
    y_validation_int_rate,
    sort_by="RMSE",
    incorrect_as_nan=True,
)

--- Validation ---

	n	SD	RMSE	MedianAD	MAE	R²	SD_RMSE_ratio	MedianAD_MAE_ratio
LGBM (17 features)	74286	5.153	1.364	3.610	0.850	0.930	3.778	4.246
LGBM (22 features)	74286	5.153	1.371	3.610	0.859	0.929	3.758	4.201
LGBM (25 features)	74286	5.153	1.373	3.610	0.865	0.929	3.754	4.173
LGBM (30 features)	74286	5.153	1.393	3.610	0.881	0.927	3.700	4.099
LGBM (34 features)	74286	5.153	1.402	3.610	0.888	0.926	3.676	4.067
LGBM	74286	5.153	1.416	3.610	0.897	0.925	3.640	4.024
LGBM (9 features)	74286	5.153	1.511	3.610	0.868	0.914	3.410	4.159
LGBM (6 features)	74286	5.153	1.623	3.610	0.906	0.901	3.174	3.983
LGBM (2 features)	74286	5.153	2.819	3.610	1.824	0.701	1.828	1.979
Ridge Regression	74286	5.153	3.692	3.610	2.822	0.487	1.396	1.279
LASSO Regression	74286	5.153	4.338	3.610	3.410	0.291	1.188	1.059
Linear Regression	74286	5.153	34670485474.882	3.610	1397356309.915	nan	0.000	0.000

4.4.5 Evaluation on Test Set

On the test set, RMSE = 1.347 and MAE = 0.838 (even a bit better than before which might be a consequence that the model was trained on a larger dataset).

Please, find the details below.

Code

# Model input variables
int_rate_model_input = [
    "total_rec_int",
    "total_rec_prncp",
    "issue_month",
    "fico_range_high",
    "installment",
    "out_prncp",
    "funded_amnt",
    "disbursement_method_Cash",
    "loan_amnt",
    "bc_open_to_buy",
    "last_fico_range_high",
    "out_prncp_inv",
    "term_ 60 months",
    "funded_amnt_inv",
    "term_ 36 months",
    "dti",
    "fico_range_low",
]

Code

X_train_validation = pd.concat([X_train, X_validation], axis="index")
y_train_validation_int_rate = pd.concat(
    [y_train_int_rate, y_validation_int_rate], axis="index"
)


@my.cache_results(dir_interim + "models_4_03_lgbm--final-evaluation.pickle")
def fit_lgbm_int_rate_final_evaluation():
    """Fit a LGBM model for final testing."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(int_rate_model_input)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_validation, y_train_validation_int_rate)
    return pipeline


model_int_rate_final = {}
model_int_rate_final["LGBM (final evaluation)"] = fit_lgbm_int_rate_final_evaluation()

Code

print("--- Train + Validation ---")
display(
    ml.regression_scores(
        model_int_rate_final,
        X_train_validation,
        y_train_validation_int_rate,
        sort_by="RMSE",
        color="orange",
    )
)

--- Train + Validation ---

	n	SD	RMSE	MedianAD	MAE	R²	SD_RMSE_ratio	MedianAD_MAE_ratio
LGBM (final evaluation)	420955	5.151	1.332	3.610	0.831	0.933	3.868	4.346

Code

print("--- Test ---")
display(
    ml.regression_scores(model_int_rate_final, X_test, y_test_int_rate, sort_by="RMSE")
)

--- Test ---

	n	SD	RMSE	MedianAD	MAE	R²	SD_RMSE_ratio	MedianAD_MAE_ratio
LGBM (final evaluation)	74287	5.152	1.347	3.610	0.838	0.932	3.825	4.307

4.4.6 Final Model for Deployment

Code

# Pipeline input variables
int_rate_pipeline_input = [
    "total_rec_int",
    "total_rec_prncp",
    "issue_d",
    "fico_range_high",
    "installment",
    "out_prncp",
    "funded_amnt",
    "disbursement_method",
    "loan_amnt",
    "bc_open_to_buy",
    "last_fico_range_high",
    "out_prncp_inv",
    "term",
    "funded_amnt_inv",
    "dti",
    "fico_range_low",
]


@my.cache_results(dir_interim + "04--model_predict_int_rate.pkl")
def fit_lgbm_int_rate_for_deployment():
    """Final LGBM model for sub-grades prediction."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(int_rate_pipeline_input)),
            ("preprocessor_1", PreprocessorForInterestRates()),
            ("preprocessor_2", clone(pre_processing_trees)),
            ("selector_2", ColumnSelector(int_rate_model_input)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    # Load required dataset
    data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
    accepted_2018 = pd.read_feather(data_file_path)

    pipeline.set_output(transform="pandas")
    pipeline.fit(accepted_2018, accepted_2018["int_rate"])
    return pipeline


# Fit the model
loan_int_rate_predictor_final = fit_lgbm_int_rate_for_deployment()
loan_int_rate_predictor_final

5 Deployment

Currently, the models are deployed on the Google Cloud Platform (GCP) as a single Flask app that is available at https://lending-club-app-x6jg32rquq-ew.a.run.app/. More details and examples on how to use the models are available in the README file of the app subdirectory of this project’s GitHub repository.

The app folder on the GitHub repository of the project: app.

6 Final Remarks

In binary classification, the default threshold of 0.5 was used.
Threshold adjustment (e.g., via ROC curve analysis) might be beneficial.
Hyperparameter tuning was not performed in this analysis due to the lack of time assigned to the task. Models with tuned parameters might lead to better performance. Hope, the Scikit Learn’s defaults work well enough.
In loan status prediction, for logistic regression and Naive Bayes, capped versions of ‘loan_amount’ and ‘debt_to_income_ratio’ were used. And for LGBM, non-capped. It could have been tried to check which type of pre-processing works better instead of making assumptions.
Feature selection after initial modeling was based on SHAP values and was done manually. It helped to achieve almost the same or even better performance with significantly fewer features. More rigorous procedures like recursive feature elimination (RFE) might lead to an even better subset of variables.
This notebook from time to time started generating underflow issues, especially in some plots and predict_proba() method. So the analysis was not performed in one go. (This also leads to the next point.)
Many intermediate results were saved to files:
1. Better and more systematic naming conventions should make it easier to track the results.
2. Saving so many results on a disc may save time program crashes but needs an excessive amount of space to save the results. If disc space is an issue, some more clever way may be invented to remove older versions of the results that are not needed.
3. Due to constant saving some parts of code became more dense and, thus, harder to read.
In some cases (especially model training) there is a lot of repeated code that could have been generalized. Unfortunately, this would have required much more time.
I explored 3 libraries of AutoML. Unfortunately, none of them were included in the analysis:
- auto-sklearn works only on Linux (I used Windows) and has issues with the newest versions of some packages (I cannot remember which ones: pandas, numpy, scikit-learn, or the newest Python itself (I used 3.11).
- ludwig dependencies conflict with the dependencies of jupyter;
- tplot2 has poor documentation and tpot does not support the newest versions of pandas. I managed to install it into a separate environment and run some toy examples but decided not to include it in the analysis due lack of available time.
Plots:
- plot themes are not unified throughout the project (the same type of plot might have a different theme). I am not really sure, which functions silently change the themes.
- some plots may be better labeled.
- some pots (especially SHAP values) are large and could be made smaller.
Some results are poorly described or come without explanation. I used them to make decisions for further analysis or inspect whether pre-processing or other steps of analysis were performed as expected.
Not all functions from functions subfolder were used. They should be treated as a separate module.
Models were created and deployed. Unfortunately, I do not have enough field knowledge if the selected variables make sense and do not leak information. A good discussion with a field specialist would be beneficial.