Lending Club Data Analysis

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-11-30

Updated

2023-11-30

Lending Club project logo. Originally generated with Leonardo.Ai.

Annotation

In this project, a comprehensive analysis of Lending Club loan data was conducted. Changing over-time trends were identified thus to ensure relevance of the modeling phase, data from the most recent year, 2018, was exclusively utilized. The modeling process included two major tasks: predicting loan application status (accepted/rejected) and forecasting key attributes of accepted loans, including grade, sub-grade, and interest rate.

The development of four distinct models was executed with a thorough approach, addressing challenges such as group imbalance and data size. Rigorous procedures were employed to refine and optimize each model. Subsequently, the most effective models were selected for deployment on the Google Cloud Platform (GCP).

While the models have been successfully deployed and are currently accessible through an API, ongoing efforts for refinement and enhancement are acknowledged. Continuous improvement remains a priority to ensure the models’ accuracy and effectiveness over time.

1 Data

1.1 Explore Data Files

The dataset comes in two files: one for the accepted loans and one for the rejected ones. The dimensions and size of the data are as follows:

  • 2.3M rows, 151 columns, and 1.6 GB of accepted loans’ data.
  • 27.6M rows, 9 columns, and 1.7 GB of rejected loans’ data.

Regarding the size of data, some optimizations will be used to reduce the memory footprint (e.g., more efficient data types and only necessary columns will be used).

The data spans from 2007 to 2018.

As in different files the number of variables is different and variable names do not match, it seems that the best-matching variables are those, listed in Table 1.1.

Table 1.1. The names of matching variables in the datasets of accepted and rejected loans.
Variable Name in “Accepted” Name in “Rejected”
Loan Amount loan_amnt Amount Requested
Application Date issue_d Application Date
Loan Title title Loan Title
Risk Score fico_range_low and fico_range_high Risk_Score
Debt-To-Income Ratio dti Debt-To-Income Ratio
Zip Code zip_code Zip Code
State addr_state State
Employment Length emp_length Employment Length
Policy Code policy_code Policy Code

Details:

Code
!echo Data file names:
!ls data/raw/
Data file names:
accepted_2007_to_2018Q4.csv
rejected_2007_to_2018Q4.csv
Code
!echo File sizes:
!cd data/raw/ &&\
    du -m accepted_2007_to_2018Q4.csv rejected_2007_to_2018Q4.csv |\
    sed 's/\([0-9]\+\)/\1 MB /'
File sizes:
1598 MB     accepted_2007_to_2018Q4.csv
1700 MB     rejected_2007_to_2018Q4.csv
Code
# NOTE: header line is also included here
!echo Number of lines per file:
!cd data/raw/ &&\
    wc --lines accepted_2007_to_2018Q4.csv rejected_2007_to_2018Q4.csv
Number of lines per file:
   2260702 accepted_2007_to_2018Q4.csv
  27648742 rejected_2007_to_2018Q4.csv
  29909444 total
Code
!echo Number of columns per file:

!cd data/raw/ &&\
    (csvcut -n accepted_2007_to_2018Q4.csv | wc -l | xargs printf "%5d\n" &&\
          echo accepted_2007_to_2018Q4.csv) |\
    paste -s -d ' '

!cd data/raw/ &&\
    (csvcut -n rejected_2007_to_2018Q4.csv | wc -l | xargs printf "%5d\n" &&\
          echo rejected_2007_to_2018Q4.csv) |\
    paste -s -d ' '
Number of columns per file:
  151 accepted_2007_to_2018Q4.csv 
    9 rejected_2007_to_2018Q4.csv 

A few top rows (formatted as table) of each file:

Code
!cd data/raw/ &&\
    head -n 5 accepted_2007_to_2018Q4.csv | csvlook
|         id | member_id | loan_amnt | funded_amnt | funded_amnt_inv |       term | int_rate | installment | grade | sub_grade | emp_title                   | emp_length | home_ownership | annual_inc | verification_status | issue_d  | loan_status | pymnt_plan | url                                                               | desc | purpose            | title              | zip_code | addr_state |   dti | delinq_2yrs | earliest_cr_line | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | annual_inc_joint | dti_joint | verification_status_joint | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | revol_bal_joint | sec_app_fico_range_low | sec_app_fico_range_high | sec_app_earliest_cr_line | sec_app_inq_last_6mths | sec_app_mort_acc | sec_app_open_acc | sec_app_revol_util | sec_app_open_act_il | sec_app_num_rev_accts | sec_app_chargeoff_within_12_mths | sec_app_collections_12_mths_ex_med | sec_app_mths_since_last_major_derog | hardship_flag | hardship_type | hardship_reason | hardship_status | deferral_term | hardship_amount | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term |
| ---------- | --------- | --------- | ----------- | --------------- | ---------- | -------- | ----------- | ----- | --------- | --------------------------- | ---------- | -------------- | ---------- | ------------------- | -------- | ----------- | ---------- | ----------------------------------------------------------------- | ---- | ------------------ | ------------------ | -------- | ---------- | ----- | ----------- | ---------------- | -------------- | --------------- | -------------- | ---------------------- | ---------------------- | -------- | ------- | --------- | ---------- | --------- | ------------------- | --------- | ------------- | ----------- | --------------- | --------------- | ------------- | ------------------ | ---------- | ----------------------- | ------------ | --------------- | ------------ | ------------------ | -------------------- | ------------------- | -------------------------- | --------------------------- | ----------- | ---------------- | ---------------- | --------- | ------------------------- | -------------- | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- | ------------------ | ------------ | ------- | ----------- | ----------- | ---------- | -------- | ---------------- | ------ | ----------- | ------------ | -------------------- | ----------- | -------------- | ------- | ------------------------ | ----------- | ------------------ | -------------------- | --------------------- | -------------- | -------- | -------------------- | ------------------------ | --------------------- | ------------------------------ | --------------------- | -------------- | --------------- | ----------- | --------- | --------- | ------------- | ------------- | ------------------- | -------- | ---------------- | ------------ | ------------------ | ------------------ | -------------- | ---------------- | -------------------- | --------- | --------------- | ----------------- | -------------- | -------------------------- | --------------- | ---------------------- | ----------------------- | ------------------------ | ---------------------- | ---------------- | ---------------- | ------------------ | ------------------- | --------------------- | -------------------------------- | ---------------------------------- | ----------------------------------- | ------------- | ------------- | --------------- | --------------- | ------------- | --------------- | ------------------- | ----------------- | ----------------------- | --------------- | ------------ | -------------------- | ------------------------------------------ | ------------------------------ | ---------------------------- | ------------------- | -------------------- | ------------------------- | ----------------- | --------------- | ----------------- | --------------------- | --------------- |
| 68,407,277 |           |     3,600 |       3,600 |           3,600 | 0004-01-01 |    13.99 |      123.03 | C     | C4        | leadman                     | 10+ years  | MORTGAGE       |     55,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68407277 |      | debt_consolidation | Debt consolidation | 190xx    | PA         |  5.91 |           0 | Aug-2003         |            675 |             679 |              1 |                     30 |                        |        7 |       0 |     2,765 |       29.7 |        13 | w                   |      0.00 |          0.00 |  4,421.724… |        4,421.72 |        3,600.00 |        821.72 |                  0 |          0 |                       0 | Jan-2019     |          122.67 |              | Mar-2019           |                  564 |                 560 |                          0 |                          30 |           1 | Individual       |                  |           |                           |              0 |          722 |     144,904 |           2 |           2 |           0 |           1 |                 21 |        4,981 |      36 |           3 |           3 |        722 |       34 |            9,300 |      3 |           1 |            4 |                    4 |      20,701 |          1,506 |    37.2 |                        0 |           0 |                148 |                  128 |                     3 |              3 |        1 |                    4 |                       69 |                     4 |                             69 |                     2 |              2 |               4 |           2 |         5 |         3 |             4 |             9 |                   4 |        7 |                0 |            0 |                  0 |                  3 |           76.9 |              0.0 |                    0 |         0 |         178,050 |             7,746 |          2,400 |                     13,734 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,355,089 |           |    24,700 |      24,700 |          24,700 | 0004-01-01 |    11.99 |      820.28 | C     | C1        | Engineer                    | 10+ years  | MORTGAGE       |     65,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68355089 |      | small_business     | Business           | 577xx    | SD         | 16.06 |           1 | Dec-1999         |            715 |             719 |              4 |                      6 |                        |       22 |       0 |    21,470 |       19.2 |        38 | w                   |      0.00 |          0.00 | 25,679.660… |       25,679.66 |       24,700.00 |        979.66 |                  0 |          0 |                       0 | Jun-2016     |          926.35 |              | Mar-2019           |                  699 |                 695 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     204,396 |           1 |           1 |           0 |           1 |                 19 |       18,005 |      73 |           2 |           3 |      6,472 |       29 |          111,800 |      0 |           0 |            6 |                    4 |       9,733 |         57,830 |    27.1 |                        0 |           0 |                113 |                  192 |                     2 |              2 |        4 |                    2 |                          |                     0 |                              6 |                     0 |              5 |               5 |          13 |        17 |         6 |            20 |            27 |                   5 |       22 |                0 |            0 |                  0 |                  2 |           97.4 |              7.7 |                    0 |         0 |         314,017 |            39,475 |         79,300 |                     24,667 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,341,763 |           |    20,000 |      20,000 |          20,000 | 0006-01-01 |    10.78 |      432.66 | B     | B4        | truck driver                | 10+ years  | MORTGAGE       |     63,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68341763 |      | home_improvement   |                    | 605xx    | IL         | 10.78 |           0 | Aug-2000         |            695 |             699 |              0 |                        |                        |        6 |       0 |     7,869 |       56.2 |        18 | w                   |      0.00 |          0.00 | 22,705.924… |       22,705.92 |       20,000.00 |      2,705.92 |                  0 |          0 |                       0 | Jun-2017     |       15,813.30 |              | Mar-2019           |                  704 |                 700 |                          0 |                             |           1 | Joint App        |           71,000 |     13.85 | Not Verified              |              0 |            0 |     189,699 |           0 |           1 |           0 |           4 |                 19 |       10,827 |      73 |           0 |           2 |      2,081 |       65 |           14,000 |      2 |           5 |            1 |                    6 |      31,617 |          2,737 |    55.9 |                        0 |           0 |                125 |                  184 |                    14 |             14 |        5 |                  101 |                          |                    10 |                                |                     0 |              2 |               3 |           2 |         4 |         6 |             4 |             7 |                   3 |        6 |                0 |            0 |                  0 |                  0 |          100.0 |             50.0 |                    0 |         0 |         218,418 |            18,696 |          6,200 |                     14,877 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 66,310,712 |           |    35,000 |      35,000 |          35,000 | 0006-01-01 |    14.85 |      829.90 | C     | C5        | Information Systems Officer | 10+ years  | MORTGAGE       |    110,000 | Source Verified     | Dec-2015 | Current     |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=66310712 |      | debt_consolidation | Debt consolidation | 076xx    | NJ         | 17.06 |           0 | Sep-2008         |            785 |             789 |              0 |                        |                        |       13 |       0 |     7,802 |       11.6 |        17 | w                   | 15,897.65 |     15,897.65 | 31,464.010… |       31,464.01 |       19,102.35 |     12,361.66 |                  0 |          0 |                       0 | Feb-2019     |          829.90 | Apr-2019     | Mar-2019           |                  679 |                 675 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     301,500 |           1 |           1 |           0 |           1 |                 23 |       12,609 |      70 |           1 |           1 |      6,987 |       45 |           67,300 |      0 |           1 |            0 |                    2 |      23,192 |         54,962 |    12.1 |                        0 |           0 |                 36 |                   87 |                     2 |              2 |        1 |                    2 |                          |                       |                                |                     0 |              4 |               5 |           8 |        10 |         2 |            10 |            13 |                   5 |       13 |                0 |            0 |                  0 |                  1 |          100.0 |              0.0 |                    0 |         0 |         381,215 |            52,226 |         62,500 |                     18,000 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
Code
!cd data/raw/ &&\
    head -n 5 rejected_2007_to_2018Q4.csv | csvlook
| Amount Requested | Application Date | Loan Title                       | Risk_Score | Debt-To-Income Ratio | Zip Code | State | Employment Length | Policy Code |
| ---------------- | ---------------- | -------------------------------- | ---------- | -------------------- | -------- | ----- | ----------------- | ----------- |
|            1,000 |       2007-05-26 | Wedding Covered but No Honeymoon |        693 |                10.00 | 481xx    | NM    | 4 years           |           0 |
|            1,000 |       2007-05-26 | Consolidating Debt               |        703 |                10.00 | 010xx    | MA    | < 1 year          |           0 |
|           11,000 |       2007-05-27 | Want to consolidate my debt      |        715 |                10.00 | 212xx    | MD    | 1 year            |           0 |
|            6,000 |       2007-05-27 | waksman                          |        698 |                38.64 | 017xx    | MA    | < 1 year          |           0 |

Only the matching variables (Table 1.1) from the dataset of accepted loans:

Code
!cd data/raw/ &&\
    head -n 5 accepted_2007_to_2018Q4.csv |\
    csvcut -c loan_amnt,issue_d,title,fico_range_low,fico_range_high,dti,zip_code,addr_state,emp_length,policy_code |\
    csvlook
| loan_amnt | issue_d  | title              | fico_range_low | fico_range_high |   dti | zip_code | addr_state | emp_length | policy_code |
| --------- | -------- | ------------------ | -------------- | --------------- | ----- | -------- | ---------- | ---------- | ----------- |
|     3,600 | Dec-2015 | Debt consolidation |            675 |             679 |  5.91 | 190xx    | PA         | 10+ years  |           1 |
|    24,700 | Dec-2015 | Business           |            715 |             719 | 16.06 | 577xx    | SD         | 10+ years  |           1 |
|    20,000 | Dec-2015 |                    |            695 |             699 | 10.78 | 605xx    | IL         | 10+ years  |           1 |
|    35,000 | Dec-2015 | Debt consolidation |            785 |             789 | 17.06 | 076xx    | NJ         | 10+ years  |           1 |

1.2 Variable Description

The description of abbreviated variable names can be found at https://figshare.com/articles/dataset/Lending_club_dataset_description/20016077 (last checked 2023-11-26).

Variables, such as ZIP code, will be explained below.

1.2.1 ZIP Codes

In USA, postal codes are called ZIP (stands for “zone improvement plan”) codes. The meaning of the first 5 digits of the ZIP code are illustrated below.

The meaning of USA ZIP (postal) code digits (source)

The illustration of what the first digit means is shown below.

The USA national ares defined by the 1st digit in ZIP code (source).

2 Python Packages and Functions

The next cells contain the main Python packages and functions, which will be used in the analysis.

# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

# Plotting
%matplotlib inline

# Packages and modules -------------------------------
# Utilities
import os
import re
import warnings
import numpy as np

# Dataframes
import pandas as pd

# EDA and plotting
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

import sweetviz
import klib

# Data wrangling, maths, feature engineering
import numpy as np

# Patch sklearn with Intel's version
from sklearnex import patch_sklearn

patch_sklearn()  # Run this code before importing from sklearn

# Machine learning
import lightgbm as lgb
from sklearn import set_config

from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
)
from sklearn.preprocessing import (StandardScaler, OneHotEncoder)
from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split

# ML: classification models
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import VotingClassifier, RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, SGDRegressor
from lightgbm import LGBMClassifier, LGBMRegressor

# ML: feature selection
from feature_engine.selection import DropFeatures

from feature_engine.creation import CyclicalFeatures

# ML: explainability
import shap

# Display
from IPython.display import display

# Custom functions
import functions.fun_utils as my
import functions.fun_analysis as an
import functions.fun_ml as ml
import functions.utils as utils
from functions.utils import (
    ColumnSelector,
    PreprocessorForGrades,
    PreprocessorForSubgrades,
    PreprocessorForInterestRates,
)

%aimport functions.fun_utils
%aimport functions.fun_analysis
%aimport functions.fun_ml
%aimport functions.utils

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(10, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_colwidth", 50)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# Turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)

# Scikit-learn options
set_config(transform_output="pandas")

# Analysis parameters: use Sweetviz for eda?
do_eda = True

# For caching results ---------------------------------
dir_interim = "data/interim/"
os.mkdir(dir_interim) if not os.path.exists(dir_interim) else None
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

Many custom functions are defined in separate files and are imported as modules. Some functions specific to this project are defined in the next cell. There will be even more ad-hoc convenience functions defined in the analysis (as it seemed a more appropriate place to define them there).

def axis_formatter(format="M", precision=0):
    """Return a function for formatting axis labels.

    Args:
        format (str): The format to use. Either 'M', 'k', or ''.
        precision (int): The number of decimal places to use.

    """
    if format == "M":
        power = 1e-6
    elif format == "k":
        power = 1e-3
    elif format == "":
        power = 1
    else:
        raise ValueError("format must be either 'M', 'k', or ''.")

    def formatter(x, pos):
        return f"{x * power:1.{precision}f}{format}"

    return FuncFormatter(formatter)


def create_column_selector_pattern(column_names):
    """Creates a regex pattern that selects columns matching the column names.

    Given a list of column names, creates a regex pattern that selects columns
    matching the column names.

    Args:
    - column_names: list of str, names of columns to select

    Returns:
    - pattern: str, regex pattern that selects columns matching the column names

    Example:
    >>> column_names = ['age', 'gender', 'income']
    >>> pattern = create_column_selector_pattern(column_names)
    # pattern will be '^(?:age|gender|income)$'
    # This pattern can be used as the value of `pattern` argument of
        ScikitLearn's `make_column_selector()`.
    """
    if isinstance(column_names, str):
        column_names = [column_names]

    if not isinstance(column_names, list):
        raise TypeError("column_names must be either a list or a string.")

    pattern = "|".join(column_names)
    pattern = "^(?:" + pattern + ")$"
    return pattern


def str_to_list(x):
    """Convert a docstring to a list of words."""
    return re.findall(r"\w+", x)

3 Task 1: Predicting Loan Status

In this part, the focus is on predicting whether a loan will be accepted or rejected.

3.1 Import and Inspect Data

First, a subset of columns from the dataset of accepted loans will be imported as ds_accepted. This subset will contain the columns that match the rejected loans’ data (see Table 1.1) with an additional column on income information to perform verification related to the debt-to-income ratio column. Next, the dataset of rejected loans will be imported as ds_rejected. For both datasets, column data types will be pre-selected according to the preliminary investigation of data files. Non-matching data types for matching columns will be fixed after further inspection.

# fmt: off
# Columns for "Accepted" Dataset
columns_in_accepted_ds = [
    'loan_amnt',           # Loan Amount
    'issue_d',             # Application Date
    'title',               # Loan Title
    'fico_range_low',      # Risk Score (Low End)
    'fico_range_high',     # Risk Score (High End)
    'dti',                 # Debt-To-Income Ratio
    'zip_code',            # Zip Code
    'addr_state',          # State
    'emp_length',          # Employment Length
    'policy_code',         # Policy Code
    'annual_inc'           # Annual Income
]

# Define the data types for the selected columns
column_data_types_accepted_ds = {
    'loan_amnt': "float32",
    'issue_d': str,
    'title': str,
    'fico_range_low': "Int16",
    'fico_range_high': "Int16",
    'dti': "float32",
    'zip_code': "category",
    'addr_state': "category",
    'emp_length': "category",
    'policy_code': "Int8"
}

# Define a dictionary for column renaming
column_rename_dict = {
    'loan_amnt': 'loan_amount',
    'issue_d': 'date',
    'title': 'loan_title',
    'fico_range_low': 'risk_score_low',
    'fico_range_high': 'risk_score_high',
    'dti': 'debt_to_income_ratio',
    'zip_code': 'zip_code',
    'addr_state': 'state',
    'emp_length': 'employment_length',
    'policy_code': 'policy_code'
}
# fmt: on

# Read "accepted" dataset, reorder and rename columns
ds_accepted = pd.read_csv(
    "data/raw/accepted_2007_to_2018Q4.csv",
    usecols=columns_in_accepted_ds,
    dtype=column_data_types_accepted_ds,
)[columns_in_accepted_ds].rename(columns=column_rename_dict)


del columns_in_accepted_ds, column_data_types_accepted_ds, column_rename_dict
# Define the data types for the selected columns
column_data_types_rejected_ds = {
    "Amount Requested": "float32",
    "Application Date": str,
    "Loan Title": str,
    "Risk_Score": "Int16",
    "Debt-To-Income Ratio": str,
    "Zip Code": "category",
    "State": "category",
    "Employment Length": "category",
    "Policy Code": "Int8",
}

# Define a dictionary for column renaming
column_rename_dict = {
    "Amount Requested": "loan_amount",
    "Application Date": "date",
    "Loan Title": "loan_title",
    "Risk_Score": "risk_score",
    "Debt-To-Income Ratio": "debt_to_income_ratio",
    "Zip Code": "zip_code",
    "State": "state",
    "Employment Length": "employment_length",
    "Policy Code": "policy_code",
}

# Read the "rejected" dataset with data type conversion and column renaming
ds_rejected = pd.read_csv(
    "data/raw/rejected_2007_to_2018Q4.csv",
    dtype=column_data_types_rejected_ds,
).rename(columns=column_rename_dict)

del column_data_types_rejected_ds, column_rename_dict

The dataset of accepted loans currently has more columns as they are needed to do some calculations and verifications. The columns that are not needed will be dropped later.

Code
ds_accepted.shape
(2260701, 11)
Code
ds_rejected.shape
(27648741, 9)

Both datasets have duplicates:

Code
ds_accepted.duplicated().sum()
32
Code
ds_rejected.duplicated().sum()
157954

Currently, dataset of accepted loans takes approximately 95 MB while dataset of rejected loans takes 976 MB of memory (~ 10x more). The columns with string data are the largest ones.

Code
ds_accepted.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 11 entries, loan_amount to annual_inc
dtypes: Int16(2), Int8(1), category(3), float32(2), float64(1), object(2)
memory usage: 94.9+ MB
Code
ds_rejected.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Columns: 9 entries, loan_amount to policy_code
dtypes: Int16(1), Int8(1), category(3), float32(1), object(3)
memory usage: 975.7+ MB

In both datasets, it is seen that policy_code is either a constant or almost constant variable.

In ds_accepted, the variable with the largest number of missing values is employment_length (6.5%), and in ds_rejected, risk_score has 66.9% of missing values.

Code
an.col_info(ds_accepted, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_amount float32 9.0 MB 1,572 0.1% 33 <0.1% 187,236 8.3% 8.3% 10000.0
2 date object 146.9 MB 139 <0.1% 33 <0.1% 61,992 2.7% 2.7% Mar-2016
3 loan_title object 167.9 MB 63,154 2.8% 23,359 1.0% 1,153,293 51.0% 51.5% Debt consolidation
4 risk_score_low Int16 6.8 MB 48 <0.1% 33 <0.1% 186,580 8.3% 8.3% 660
5 risk_score_high Int16 6.8 MB 48 <0.1% 33 <0.1% 186,580 8.3% 8.3% 664
6 debt_to_income_ratio float32 9.0 MB 10,845 0.5% 1,744 0.1% 1,732 0.1% 0.1% 0.0
7 zip_code category 4.6 MB 956 <0.1% 34 <0.1% 23,908 1.1% 1.1% 112xx
8 state category 2.3 MB 51 <0.1% 33 <0.1% 314,533 13.9% 13.9% CA
9 employment_length category 2.3 MB 11 <0.1% 146,940 6.5% 748,005 33.1% 35.4% 10+ years
10 policy_code Int8 4.5 MB 1 <0.1% 33 <0.1% 2,260,668 >99.9% 100.0% 1
11 annual_inc float64 18.1 MB 89,368 4.0% 37 <0.1% 87,189 3.9% 3.9% 60000.0
Code
an.col_info(ds_rejected, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_amount float32 110.6 MB 3,640 <0.1% 0 0% 3,920,004 14.2% 14.2% 10000.0
2 date object 1.9 GB 4,238 <0.1% 0 0% 42,112 0.2% 0.2% 2018-12-04
3 loan_title object 2.0 GB 73,927 0.3% 1,305 <0.1% 6,418,016 23.2% 23.2% Debt consolidation
4 risk_score Int16 82.9 MB 692 <0.1% 18,497,630 66.9% 178,456 0.6% 2.0% 501
5 debt_to_income_ratio object 1.7 GB 126,145 0.5% 0 0% 1,362,556 4.9% 4.9% 100%
6 zip_code category 55.4 MB 1,001 <0.1% 293 <0.1% 267,102 1.0% 1.0% 112xx
7 state category 27.7 MB 51 <0.1% 22 <0.1% 3,242,169 11.7% 11.7% CA
8 employment_length category 27.6 MB 11 <0.1% 951,355 3.4% 22,994,315 83.2% 86.1% < 1 year
9 policy_code Int8 55.3 MB 2 <0.1% 918 <0.1% 27,559,694 99.7% 99.7% 0
Code
ds_accepted.head()
loan_amount date loan_title risk_score_low risk_score_high debt_to_income_ratio zip_code state employment_length policy_code annual_inc
0 3600.00 Dec-2015 Debt consolidation 675 679 5.91 190xx PA 10+ years 1 55000.00
1 24700.00 Dec-2015 Business 715 719 16.06 577xx SD 10+ years 1 65000.00
2 20000.00 Dec-2015 NaN 695 699 10.78 605xx IL 10+ years 1 63000.00
3 35000.00 Dec-2015 Debt consolidation 785 789 17.06 076xx NJ 10+ years 1 110000.00
4 10400.00 Dec-2015 Major purchase 695 699 25.37 174xx PA 3 years 1 104433.00
Code
ds_rejected.head()
loan_amount date loan_title risk_score debt_to_income_ratio zip_code state employment_length policy_code
0 1000.00 2007-05-26 Wedding Covered but No Honeymoon 693 10% 481xx NM 4 years 0
1 1000.00 2007-05-26 Consolidating Debt 703 10% 010xx MA < 1 year 0
2 11000.00 2007-05-27 Want to consolidate my debt 715 10% 212xx MD 1 year 0
3 6000.00 2007-05-27 waksman 698 38.64% 017xx MA < 1 year 0
4 1500.00 2007-05-27 mdrigo 509 9.43% 209xx MD < 1 year 0

Dates in the accepted loans dataset are in the format of MMM-YYYY (e.g., Mar-2016) format, and in the rejected loans dataset are in YYYY-MM-DD (e.g., 2018-12-04). This should be unified.

Code
ds_accepted.date.unique()[:10]
array(['Dec-2015', 'Nov-2015', 'Oct-2015', 'Sep-2015', 'Aug-2015',
       'Jul-2015', 'Jun-2015', 'May-2015', 'Apr-2015', 'Mar-2015'],
      dtype=object)
Code
ds_rejected.date.unique()[:10]
array(['2007-05-26', '2007-05-27', '2007-05-28', '2007-05-29',
       '2007-05-30', '2007-05-31', '2007-06-01', '2007-06-02',
       '2007-06-03', '2007-06-04'], dtype=object)

In the dataset of accepted loans, relevant FICO (risk) score values range from 610 to 850. The difference between the lower and upper boundaries of this score in most cases is 4 points and just in rare cases it is 5. To express the score not as an interval but as a single value, an average of lower and upper boundaries will be used. In the rejected loans dataset, the risk score is expressed as a single value, ranging from 0 to 990.

Details: Risk score
Code
ds_accepted[["risk_score_low", "risk_score_high"]].describe()
risk_score_low risk_score_high
count 2260668.00 2260668.00
mean 698.59 702.59
std 33.01 33.01
min 610.00 614.00
25% 675.00 679.00
50% 690.00 694.00
75% 715.00 719.00
max 845.00 850.00
Code
risk_score_diff = ds_accepted.risk_score_high - ds_accepted.risk_score_low
display(risk_score_diff.describe())
an.summarize_discrete(pd.DataFrame({"risk_score_diff": risk_score_diff}))
count   2260668.00
mean          4.00
std           0.01
min           4.00
25%           4.00
50%           4.00
75%           4.00
max           5.00
dtype: Float64
risk_score_diff n percent
4 2,260,227 >99.9%
5 441 <0.1%
Code
ds_rejected.risk_score.describe()
count   9151111.00
mean        628.17
std          89.94
min           0.00
25%         591.00
50%         637.00
75%         675.00
max         990.00
Name: risk_score, dtype: Float64

In the accepted loans dataset, debt-to-income ratio units of measurement are not present. However, it seems that the debt-to-income ratio is in percent as values are comparable to percentage values of the loan-to-income ratio (the debt-to-income ratio is a more complex indicator so exact values do not match):

ds_accepted_10 = ds_accepted.head(n=10)
display(
    pd.DataFrame({
        "loan_to_income_ratio": ds_accepted_10.loan_amount
        / ds_accepted_10.annual_inc
        * 100,
        "debt_to_income_ratio": ds_accepted_10.debt_to_income_ratio,
    })
)
del ds_accepted_10
loan_to_income_ratio debt_to_income_ratio
0 6.55 5.91
1 38.00 16.06
2 31.75 10.78
3 31.82 17.06
4 9.96 25.37
5 35.15 10.20
6 11.11 14.67
7 23.53 17.61
8 11.76 13.07
9 19.05 34.80

It seems, that in ds_rejected, it is enough to remove the % sign from debt_to_income_ratio and convert it to numeric values.

What is more, some values are negative (-1%) which may mean that it is missing value indicator. This will be accounted for in the further analysis.

Code
ds_rejected.debt_to_income_ratio.value_counts()
debt_to_income_ratio
100%         1362556
-1%          1203063
0%           1045102
9999%          76984
1.2%           32659
              ...   
983.82%            1
1352.48%           1
3544.74%           1
5452.96%           1
21215.75%          1
Name: count, Length: 126145, dtype: int64
Code
ds_accepted.debt_to_income_ratio.value_counts()
debt_to_income_ratio
0.00      1732
18.00     1584
14.40     1577
16.80     1576
19.20     1566
          ... 
261.54       1
74.98        1
111.40       1
180.90       1
250.72       1
Name: count, Length: 10845, dtype: int64

The employment length categories are the same but in incorrect order.

Code
ds_accepted.employment_length.cat.categories
Index(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
       '6 years', '7 years', '8 years', '9 years', '< 1 year'],
      dtype='object')
Code
ds_rejected.employment_length.cat.categories
Index(['1 year', '10+ years', '2 years', '3 years', '4 years', '5 years',
       '6 years', '7 years', '8 years', '9 years', '< 1 year'],
      dtype='object')

The abbreviations of the states are the same in both datasets, but the order is different. “Rejected” also has missing values represented as empty strings. These should be accounted for.

Code
states_accepted = ds_accepted.state.cat.categories
print("n categories =", len(states_accepted))
states_accepted
n categories = 51
Index(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
       'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS',
       'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR',
       'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV',
       'WY', 'ID', 'IA'],
      dtype='object')
Code
states_rejected = ds_rejected.state.cat.categories
print("n categories =", len(states_rejected))
states_rejected
n categories = 51
Index(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
       'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN',
       'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH',
       'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA',
       'WI', 'WV', 'WY'],
      dtype='object')

ZIP codes are represented as 3-digit numbers. The meaning of these digits is illustrated in Section 1.2.1.

Code
zip_accepted = ds_accepted.zip_code.cat.categories
print("n categories =", len(zip_accepted))
zip_accepted
n categories = 956
Index(['010xx', '011xx', '012xx', '013xx', '014xx', '015xx', '016xx', '017xx',
       '018xx', '019xx',
       ...
       '733xx', '964xx', '375xx', '514xx', '698xx', '643xx', '202xx', '552xx',
       '055xx', '896xx'],
      dtype='object', length=956)

ZIP codes with less than 3 digits present indicate possible issues.

Code
# Number of ZIP codes with less than 3 digits:
(ds_rejected.zip_code.str.count(r"[0-9]") < 3).sum()
1
Code
zip_rejected = ds_rejected.zip_code.cat.categories
print("n categories =", len(zip_rejected))
zip_rejected
n categories = 1001
Index(['000xx', '002xx', '006xx', '007xx', '008xx', '009xx', '010xx', '011xx',
       '012xx', '013xx',
       ...
       '839xx', '695xx', '818xx', '866xx', '849xx', '694xx', '579xx', '518xx',
       '004xx', '699xx'],
      dtype='object', length=1001)

There are 1001 unique values while 1000 is expected. This means that there are mistakes in the data. Manual inspection showed that 09O (with the letter “O”) instead of 090 was present in the data. This will be fixed.

3.2 Pre-Process Data

First, datasets will be pre-processed to make them mergeable. Then, they will be merged and further pre-processing will be done.

3.2.1 Pre-Process and Merge

Prepare data types for categorical variables.

Code
# Work length categories are  "< 1 year", "1 year", ..., "9 years", "10+ years"
work_dtype = pd.CategoricalDtype(utils.work_categories, ordered=True)

# State abbreviation categories
state_dtype = pd.CategoricalDtype(
    sorted(ds_accepted.state.cat.categories), ordered=False
)

Apply pre-processing steps to both datasets. The names of pre-processed datasets will be suffixed with _2.

Code
ds_accepted_2 = (
    # fmt: off
    ds_accepted
    .dropna(subset=['loan_amount'])
    .drop_duplicates()
    .assign(
        date=lambda x: pd.to_datetime(x.date, format="%b-%Y").dt.floor("d"),
        risk_score=lambda x: (
            ((x.risk_score_high + x.risk_score_low) / 2).astype("float16")
        ),
        state=lambda x: x.state.astype(state_dtype),
        zip_area=lambda x: x.zip_code.str[0].astype("Int16"),
        zip_code=lambda x: x.zip_code.str[:3].astype("Int16"),
        employment_length=lambda x: x.employment_length.astype(work_dtype),
        loan_status=1,
    )
    # fmt: on
    .astype({"loan_status": "Int8"}).drop(
        columns=["risk_score_high", "risk_score_low", "annual_inc"]
    )
)
an.col_info(ds_accepted_2, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_amount float32 9.0 MB 1,572 0.1% 0 0% 187,236 8.3% 8.3% 10000.0
2 date datetime64[ns] 18.1 MB 139 <0.1% 0 0% 61,992 2.7% 2.7% 2016-03-01 00:00:00
3 loan_title object 167.9 MB 63,154 2.8% 23,326 1.0% 1,153,293 51.0% 51.5% Debt consolidation
4 debt_to_income_ratio float32 9.0 MB 10,845 0.5% 1,711 0.1% 1,732 0.1% 0.1% 0.0
5 zip_code Int16 6.8 MB 956 <0.1% 1 <0.1% 23,908 1.1% 1.1% 112
6 state category 2.3 MB 51 <0.1% 0 0% 314,533 13.9% 13.9% CA
7 employment_length category 2.3 MB 11 <0.1% 146,907 6.5% 748,005 33.1% 35.4% 10+ years
8 policy_code Int8 4.5 MB 1 <0.1% 0 0% 2,260,668 100.0% 100.0% 1
9 risk_score float16 4.5 MB 48 <0.1% 0 0% 186,580 8.3% 8.3% 662.0
10 zip_area Int16 6.8 MB 10 <0.1% 1 <0.1% 404,303 17.9% 17.9% 9
11 loan_status Int8 4.5 MB 1 <0.1% 0 0% 2,260,668 100.0% 100.0% 1
Code
ds_rejected_2 = (
    # fmt: off
    ds_rejected
    .dropna(subset=['loan_amount'])
    .drop_duplicates()
    .assign(
        date=lambda x: pd.to_datetime(x.date, format="%Y-%m-%d").dt.floor("d"),
        debt_to_income_ratio=lambda x: (
            x.debt_to_income_ratio.str.rstrip("%").astype("float32")
        ),
        state=lambda x: x.state.astype(state_dtype),
        zip_area=lambda x: x.zip_code.str[0].astype("Int16"),
        zip_code=lambda x: (
            x.zip_code.str[:3].str.replace("O|o", "0", regex=True).astype("Int16")
        ),
        employment_length=lambda x: x.employment_length.astype(work_dtype),
        loan_status=0,
    )
    # fmt: on
    .astype({"loan_status": "Int8", "risk_score": "float16"})
)[ds_accepted_2.columns]

an.col_info(ds_rejected_2, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_amount float32 110.0 MB 3,640 <0.1% 0 0% 3,889,772 14.1% 14.1% 10000.0
2 date datetime64[ns] 219.9 MB 4,238 <0.1% 0 0% 41,696 0.2% 0.2% 2018-12-04 00:00:00
3 loan_title object 2.0 GB 73,927 0.3% 1,285 <0.1% 6,362,745 23.1% 23.1% Debt consolidation
4 debt_to_income_ratio float32 110.0 MB 126,145 0.5% 0 0% 1,311,209 4.8% 4.8% 100.0
5 zip_code Int16 82.5 MB 1,000 <0.1% 292 <0.1% 262,986 1.0% 1.0% 112
6 state category 27.5 MB 51 <0.1% 22 <0.1% 3,218,415 11.7% 11.7% CA
7 employment_length category 27.5 MB 11 <0.1% 949,702 3.5% 22,841,895 83.1% 86.1% < 1 year
8 policy_code Int8 55.0 MB 2 <0.1% 918 <0.1% 27,401,835 99.7% 99.7% 0
9 risk_score float16 55.0 MB 692 <0.1% 18,359,858 66.8% 178,272 0.6% 2.0% 501.0
10 zip_area Int16 82.5 MB 10 <0.1% 292 <0.1% 4,568,853 16.6% 16.6% 3
11 loan_status Int8 55.0 MB 1 <0.1% 0 0% 27,490,787 100.0% 100.0% 0

Merge datasets by binding rows. Call the resulting dataset loans and save it into a feather file as an intermediate result.

Code
# Bind rows
loans = pd.concat([ds_accepted_2, ds_rejected_2])
Code
# Save intermediate results
loans = loans.reset_index(drop=True)
loans.to_feather(dir_interim + "task-1-1_merged.feather")
del ds_accepted, ds_rejected, ds_accepted_2, ds_rejected_2
Code
loans.shape
(29751455, 11)
Code
an.col_info(loans, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_amount float32 119.0 MB 3,640 <0.1% 0 0% 4,077,008 13.7% 13.7% 10000.0
2 date datetime64[ns] 238.0 MB 4,238 <0.1% 0 0% 83,503 0.3% 0.3% 2018-10-01 00:00:00
3 loan_title object 2.1 GB 127,125 0.4% 24,611 0.1% 7,516,038 25.3% 25.3% Debt consolidation
4 debt_to_income_ratio float32 119.0 MB 126,161 0.4% 1,711 <0.1% 1,311,215 4.4% 4.4% 100.0
5 zip_code Int16 89.3 MB 1,000 <0.1% 293 <0.1% 286,894 1.0% 1.0% 112
6 state category 29.8 MB 51 <0.1% 22 <0.1% 3,532,948 11.9% 11.9% CA
7 employment_length category 29.8 MB 11 <0.1% 1,096,609 3.7% 23,031,883 77.4% 80.4% < 1 year
8 policy_code Int8 59.5 MB 3 <0.1% 918 <0.1% 27,401,835 92.1% 92.1% 0
9 risk_score float16 59.5 MB 693 <0.1% 18,359,858 61.7% 243,521 0.8% 2.1% 662.0
10 zip_area Int16 89.3 MB 10 <0.1% 293 <0.1% 4,880,433 16.4% 16.4% 3
11 loan_status Int8 59.5 MB 2 <0.1% 0 0% 27,490,787 92.4% 92.4% 0

3.2.2 Pre-Process Merged Data

Next, the merged dataset will be pre-processed further. Intermediate results will be saved into feather files as calculations are time-consuming.

Note

Pre-processing and EDA were iterative procedures: some pre-processing steps were invented after EDA on the training set. In this report, some of the steps are presented in a sequential (no cyclic) order.

Code
loans = pd.read_feather(dir_interim + "task-1-1_merged.feather")

Extract parts of dates and convert them to numeric values.

file = dir_interim + "task-1-2_preprocess_1_date_elements.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    loans = loans.assign(
        year=lambda df: df["date"].dt.year,
        month=lambda df: df["date"].dt.month,
    )

    loans.to_feather(file)

del file

Extract various technical features of titles like length, number of words, and number of certain characters.

file = dir_interim + "task-1-2_preprocess_2_title_stats.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    word_pattern = r"\w+((-\w)*\w)*"
    loans = loans.assign(
        title_len=lambda df: df["loan_title"].str.len(),
        title_n_capital_letters=lambda df: df["loan_title"].str.count(r"[A-Z]"),
        title_n_non_capital_letters=lambda df: df["loan_title"].str.count(r"[a-z]"),
        title_n_letters=lambda df: df.title_n_capital_letters
        + df.title_n_non_capital_letters,
        title_n_digits=lambda df: df["loan_title"].str.count(r"[0-9]"),
        title_n_punctuation=lambda df: df["loan_title"].str.count(r"[^\w\s]"),
        title_n_spaces=lambda df: df["loan_title"].str.count(r"[ ]"),
        title_n_words=lambda df: df["loan_title"].str.count(word_pattern),
        title_in_title_case=lambda df: df["loan_title"].str.istitle(),
        title_in_upper_case=lambda df: df["loan_title"].str.isupper(),
        title_in_lower_case=lambda df: df["loan_title"].str.islower(),
        title_is_missing_or_empty=lambda df: df["loan_title"].isna()
        | df["loan_title"].str.strip().eq(""),
    )

    loans.to_feather(file)

del file

The titles were manually inspected and it was noticed that some information was presented inconsistently (e.g., presence of abbreviations, inconsistent abbreviations, spelling mistakes). So the next step was to create a bit more unified version of titles for further pre-processing.

file = dir_interim + "task-1-2_preprocess_3_cleaner_title.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    loans = loans.assign(
        loan_title_unified=lambda df: (
            df["loan_title"]
            .str.lower()
            .str.replace(r"[-\t_.!/ ]+", " ", regex=True)
            .str.replace(r"for", "")
            .str.replace(r"(card|loan)s?", r" \1 ", regex=True)
            .str.replace(r"pay( )*off?s?", " payoff ", regex=True)
            .str.replace(r"(cc|creditcard|^c c )", " credit card ", regex=True)
            .str.replace(r"(de[bp]i?t)|(deb( |$))s?", " debt ", regex=True)
            .str.replace(r"debt( )*cons?( |$)", "debt consolidation", regex=True)
            .str.replace(r"re fi", "refi")
            .str.replace(
                r"consolidat(e|ing|ions|or)|"  # various endings
                r"conso(l(id)?(atio)?)?( |$)|"  # various abbreviations
                r"con[sc][oia]?l?[iao]?[dt]?a?[tc]ion",  # various misspellings
                "consolidation",
                regex=True,
            )
            .str.replace(r"[ ]+", " ", regex=True)
            .str.strip()
            .replace("", pd.NA)
            .replace(np.nan, pd.NA)
        )
    )

    loans.to_feather(file)

del file

Extract the presence/absence of certain words in the titles. These words were subjectively chosen based on the manual inspection of titles.

file = dir_interim + "task-1-2_preprocess_4_extract_title_features.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    # Extract features from the loan title
    loans = loans.assign(
        credit=lambda df: df["loan_title_unified"].str.contains(r"credit"),
        card=lambda df: df["loan_title_unified"].str.contains(r"card"),
        refinancing=lambda df: df["loan_title_unified"].str.contains(r"refi"),
        consolidation=lambda df: df["loan_title_unified"].str.contains(r"consolidat"),
        debt_related=lambda df: df["loan_title_unified"].str.contains(r"debt"),
        bills_taxes=lambda df: df["loan_title_unified"].str.contains(r"bill|tax|rent"),
        payoff=lambda df: df["loan_title_unified"].str.contains(r"payoff|payitoff"),
        home_upgrade=lambda df: df["loan_title_unified"].str.contains(
            r"home i[nm]p|home re[np]|home upgrade|renovation|"
            r"roof|furniture|kitchen|bathroom|pool|windows"
        ),
        home_related=lambda df: df["loan_title_unified"].str.contains(
            r"home|house|apartment|flat"
        ),
        home_buying=lambda df: df["loan_title_unified"].str.contains(
            r"home buying|new home|buy house|new house|my house"
        ),
        fixing=lambda df: df["loan_title_unified"].str.contains(
            r"fix|i[nm]prov|imp$|repair|upgrade|renovat"
        ),
        major_purchase_unspecified=lambda df: df["loan_title_unified"].str.contains(
            r"major purchase"
        ),
        relocation=lambda df: df["loan_title_unified"].str.contains(
            r"moving|move|relocate|relocation(?! forward)"
        ),
        weddings=lambda df: df["loan_title_unified"].str.contains(
            r"wedding|engagement|marr[iy]"
        ),
        car=lambda df: df["loan_title_unified"].str.contains(r"car|auto|jeep"),
        motorcycle=lambda df: df["loan_title_unified"].str.contains(
            r"motorcycle|harley"
        ),
        vehicle_unspecified_or_other=lambda df: df["loan_title_unified"].str.contains(
            r"truck|atv|boat|camper|vehicle"
        ),
        medical_expenses=lambda df: df["loan_title_unified"].str.contains(
            r"med|health|surgery|dental"
        ),
        education=lambda df: df["loan_title_unified"].str.contains(
            r"student|college|school|educat|learn"
        ),
        investment=lambda df: df["loan_title_unified"].str.contains(r"invest"),
        vocation=lambda df: df["loan_title_unified"].str.contains(
            r"vacation|travel|holiday|spa"
        ),
        renewable_energy=lambda df: df["loan_title_unified"].str.contains(
            r"renewable energy|green"
        ),
    )

    # Change data types
    data_types_dict = {
        "loan_status": "Int8",
        "year": "Int16",
        "month": "Int8",
        "risk_score": "float32",
        "policy_code": "Int8",
        "zip_area": "Int16",
        "zip_code": "Int16",
        "title_len": "float32",
        "title_n_capital_letters": "float32",
        "title_n_non_capital_letters": "float32",
        "title_n_letters": "float32",
        "title_n_digits": "float32",
        "title_n_punctuation": "float32",
        "title_n_spaces": "float32",
        "title_n_words": "float32",
        "title_in_title_case": "Int8",
        "title_in_upper_case": "Int8",
        "title_in_lower_case": "Int8",
        "title_is_missing_or_empty": "Int8",
        "credit": "Int8",
        "card": "Int8",
        "refinancing": "Int8",
        "consolidation": "Int8",
        "debt_related": "Int8",
        "bills_taxes": "Int8",
        "payoff": "Int8",
        "home_upgrade": "Int8",
        "home_related": "Int8",
        "home_buying": "Int8",
        "fixing": "Int8",
        "major_purchase_unspecified": "Int8",
        "relocation": "Int8",
        "weddings": "Int8",
        "car": "Int8",
        "motorcycle": "Int8",
        "vehicle_unspecified_or_other": "Int8",
        "medical_expenses": "Int8",
        "education": "Int8",
        "investment": "Int8",
        "vocation": "Int8",
        "renewable_energy": "Int8",
    }
    loans = loans.astype(data_types_dict)

    # Save
    loans.to_feather(file)

del file

The following variables are created after the first round of EDA on training data (the subset dedicated to EDA):

file = dir_interim + "task-1-2_preprocess_5_add_more_features.feather"

if os.path.exists(file):
    loans = pd.read_feather(file)
else:
    # Add more features
    loans = loans.assign(
        loan_amount_above_40k=lambda df: (df.loan_amount > 40000).astype("Int8"),
        loan_amount_log=lambda df: np.log1p(df.loan_amount),
        loan_amount_cap_40k=lambda df: df.loan_amount.clip(upper=40000),
        # Leave the original values as they are for EDA
        debt_to_income_ratio_original=lambda df: df.debt_to_income_ratio,
        debt_to_income_ratio_orig_cap_100=lambda df: df.debt_to_income_ratio.clip(
            upper=100
        ),
        debt_to_income_ratio_is_na_original=lambda df: df.debt_to_income_ratio_original.isna().astype(
            "Int8"
        ),
        # Replace -1 with NA
        debt_to_income_ratio=lambda df: df.debt_to_income_ratio.replace(
            -1, np.nan
        ).astype("float32"),
        debt_to_income_ratio_cap_100=lambda df: df.debt_to_income_ratio.clip(upper=100),
        employment_length_num=lambda df: df.employment_length.replace(
            dict(zip(utils.work_categories, range(len(utils.work_categories))))
        ).astype("Int8"),
        employment_length_is_na=lambda df: df.employment_length.isna().astype("Int8"),
        debt_to_income_ratio_is_na=lambda df: (
            df.debt_to_income_ratio.isna().astype("Int8")
        ),
        zip_code_is_na=lambda df: df.zip_code.isna().astype("Int8"),
        risk_score_is_na=lambda df: df.risk_score.isna().astype("Int8"),
    )

    # Sin/Cos transformation for months
    loans = (
        CyclicalFeatures("month", {"month": 12})
        .fit_transform(loans)
        .astype({"month_sin": "float32", "month_cos": "float32"})
    )

    # Re-order columns
    new_column_order = [
        "loan_status",
        "policy_code",
        "date",
        "year",
        "month",
        "month_sin",
        "month_cos",
        "loan_amount",
        "loan_amount_log",
        "loan_amount_cap_40k",
        "loan_amount_above_40k",
        "debt_to_income_ratio",
        "debt_to_income_ratio_cap_100",
        "debt_to_income_ratio_is_na",
        "debt_to_income_ratio_original",
        "debt_to_income_ratio_orig_cap_100",
        "debt_to_income_ratio_is_na_original",
        "risk_score",
        "risk_score_is_na",
        "employment_length",
        "employment_length_num",
        "employment_length_is_na",
        "state",
        "zip_area",
        "zip_code",
        "zip_code_is_na",
        "loan_title",
        "loan_title_unified",
        "title_is_missing_or_empty",
        "title_len",
        "title_n_letters",
        "title_n_capital_letters",
        "title_n_non_capital_letters",
        "title_n_digits",
        "title_n_punctuation",
        "title_n_spaces",
        "title_n_words",
        "title_in_title_case",
        "title_in_upper_case",
        "title_in_lower_case",
        "credit",
        "card",
        "refinancing",
        "consolidation",
        "debt_related",
        "bills_taxes",
        "payoff",
        "home_upgrade",
        "home_related",
        "home_buying",
        "fixing",
        "major_purchase_unspecified",
        "relocation",
        "weddings",
        "car",
        "motorcycle",
        "vehicle_unspecified_or_other",
        "medical_expenses",
        "education",
        "investment",
        "vocation",
        "renewable_energy",
    ]
    loans = loans[new_column_order]
    # Save intermediate results
    loans.to_feather(file)

del file
# Time: 1m 39.1s

3.2.3 Inspect After Pre-Processing

In this section, the dataset will be inspected for the most obvious issues before spiting it into training and testing subsets and performing EDA.

Code
loans.shape
(29751455, 62)
Code
loans.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29751455 entries, 0 to 29751454
Columns: 62 entries, loan_status to renewable_energy
dtypes: Int16(3), Int8(36), category(2), datetime64[ns](1), float32(18), object(2)
memory usage: 5.0+ GB
Code
an.col_info(loans, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_status Int8 59.5 MB 2 <0.1% 0 0% 27,490,787 92.4% 92.4% 0
2 policy_code Int8 59.5 MB 3 <0.1% 918 <0.1% 27,401,835 92.1% 92.1% 0
3 date datetime64[ns] 238.0 MB 4,238 <0.1% 0 0% 83,503 0.3% 0.3% 2018-10-01 00:00:00
4 year Int16 89.3 MB 12 <0.1% 0 0% 9,910,336 33.3% 33.3% 2018
5 month Int8 59.5 MB 12 <0.1% 0 0% 2,928,639 9.8% 9.8% 10
6 month_sin float32 119.0 MB 8 <0.1% 0 0% 5,715,577 19.2% 19.2% -0.8660254
7 month_cos float32 119.0 MB 8 <0.1% 0 0% 5,189,861 17.4% 17.4% -0.8660254
8 loan_amount float32 119.0 MB 3,640 <0.1% 0 0% 4,077,008 13.7% 13.7% 10000.0
9 loan_amount_log float32 119.0 MB 3,640 <0.1% 0 0% 4,077,008 13.7% 13.7% 9.210441
10 loan_amount_cap_40k float32 119.0 MB 2,074 <0.1% 0 0% 4,077,008 13.7% 13.7% 10000.0
11 loan_amount_above_40k Int8 59.5 MB 2 <0.1% 0 0% 29,577,932 99.4% 99.4% 0
12 debt_to_income_ratio float32 119.0 MB 126,160 0.4% 1,167,451 3.9% 1,311,215 4.4% 4.6% 100.0
13 debt_to_income_ratio_cap_100 float32 119.0 MB 10,001 <0.1% 1,167,451 3.9% 2,118,993 7.1% 7.4% 100.0
14 debt_to_income_ratio_is_na Int8 59.5 MB 2 <0.1% 0 0% 28,584,004 96.1% 96.1% 0
15 debt_to_income_ratio_original float32 119.0 MB 126,161 0.4% 1,711 <0.1% 1,311,215 4.4% 4.4% 100.0
16 debt_to_income_ratio_orig_cap_100 float32 119.0 MB 10,002 <0.1% 1,711 <0.1% 2,118,993 7.1% 7.1% 100.0
17 debt_to_income_ratio_is_na_original Int8 59.5 MB 2 <0.1% 0 0% 29,749,744 >99.9% >99.9% 0
18 risk_score float32 119.0 MB 693 <0.1% 18,359,858 61.7% 243,521 0.8% 2.1% 662.0
19 risk_score_is_na Int8 59.5 MB 2 <0.1% 0 0% 18,359,858 61.7% 61.7% 1
20 employment_length category 29.8 MB 11 <0.1% 1,096,609 3.7% 23,031,883 77.4% 80.4% < 1 year
21 employment_length_num Int8 59.5 MB 11 <0.1% 1,096,609 3.7% 23,031,883 77.4% 80.4% 0
22 employment_length_is_na Int8 59.5 MB 2 <0.1% 0 0% 28,654,846 96.3% 96.3% 0
23 state category 29.8 MB 51 <0.1% 22 <0.1% 3,532,948 11.9% 11.9% CA
24 zip_area Int16 89.3 MB 10 <0.1% 293 <0.1% 4,880,433 16.4% 16.4% 3
25 zip_code Int16 89.3 MB 1,000 <0.1% 293 <0.1% 286,894 1.0% 1.0% 112
26 zip_code_is_na Int8 59.5 MB 2 <0.1% 0 0% 29,751,162 >99.9% >99.9% 0
27 loan_title object 2.1 GB 127,125 0.4% 24,611 0.1% 7,516,038 25.3% 25.3% Debt consolidation
28 loan_title_unified object 2.1 GB 99,875 0.3% 39,845 0.1% 13,424,529 45.1% 45.2% debt consolidation
29 title_is_missing_or_empty Int8 59.5 MB 2 <0.1% 0 0% 29,711,614 99.9% 99.9% 0
30 title_len float32 119.0 MB 195 <0.1% 24,611 0.1% 13,434,558 45.2% 45.2% 18.0
31 title_n_letters float32 119.0 MB 182 <0.1% 24,611 0.1% 13,439,150 45.2% 45.2% 17.0
32 title_n_capital_letters float32 119.0 MB 51 <0.1% 24,611 0.1% 16,695,522 56.1% 56.2% 1.0
33 title_n_non_capital_letters float32 119.0 MB 172 <0.1% 24,611 0.1% 7,532,772 25.3% 25.3% 16.0
34 title_n_digits float32 119.0 MB 28 <0.1% 24,611 0.1% 29,711,627 99.9% 99.9% 0.0
35 title_n_punctuation float32 119.0 MB 37 <0.1% 24,611 0.1% 29,701,231 99.8% 99.9% 0.0
36 title_n_spaces float32 119.0 MB 96 <0.1% 24,611 0.1% 15,653,630 52.6% 52.7% 0.0
37 title_n_words float32 119.0 MB 95 <0.1% 24,611 0.1% 15,656,042 52.6% 52.7% 1.0
38 title_in_title_case Int8 59.5 MB 2 <0.1% 24,611 0.1% 26,103,081 87.7% 87.8% 0
39 title_in_upper_case Int8 59.5 MB 2 <0.1% 24,611 0.1% 29,713,719 99.9% >99.9% 0
40 title_in_lower_case Int8 59.5 MB 2 <0.1% 24,611 0.1% 17,200,037 57.8% 57.9% 0
41 credit Int8 59.5 MB 2 <0.1% 39,845 0.1% 25,539,395 85.8% 86.0% 0
42 card Int8 59.5 MB 2 <0.1% 39,845 0.1% 25,544,566 85.9% 86.0% 0
43 refinancing Int8 59.5 MB 2 <0.1% 39,845 0.1% 26,942,551 90.6% 90.7% 0
44 consolidation Int8 59.5 MB 2 <0.1% 39,845 0.1% 16,231,183 54.6% 54.6% 0
45 debt_related Int8 59.5 MB 2 <0.1% 39,845 0.1% 16,239,908 54.6% 54.7% 0
46 bills_taxes Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,701,513 99.8% >99.9% 0
47 payoff Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,686,528 99.8% 99.9% 0
48 home_upgrade Int8 59.5 MB 2 <0.1% 39,845 0.1% 28,374,364 95.4% 95.5% 0
49 home_related Int8 59.5 MB 2 <0.1% 39,845 0.1% 27,681,412 93.0% 93.2% 0
50 home_buying Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,206,602 98.2% 98.3% 0
51 fixing Int8 59.5 MB 2 <0.1% 39,845 0.1% 28,373,548 95.4% 95.5% 0
52 major_purchase_unspecified Int8 59.5 MB 2 <0.1% 39,845 0.1% 28,711,014 96.5% 96.6% 0
53 relocation Int8 59.5 MB 2 <0.1% 39,845 0.1% 28,995,543 97.5% 97.6% 0
54 weddings Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,688,728 99.8% 99.9% 0
55 car Int8 59.5 MB 2 <0.1% 39,845 0.1% 24,230,788 81.4% 81.6% 0
56 motorcycle Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,710,218 99.9% >99.9% 0
57 vehicle_unspecified_or_other Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,709,495 99.9% >99.9% 0
58 medical_expenses Int8 59.5 MB 2 <0.1% 39,845 0.1% 28,935,736 97.3% 97.4% 0
59 education Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,697,558 99.8% >99.9% 0
60 investment Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,709,698 99.9% >99.9% 0
61 vocation Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,387,506 98.8% 98.9% 0
62 renewable_energy Int8 59.5 MB 2 <0.1% 39,845 0.1% 29,653,502 99.7% 99.8% 0
Code
loans.describe().T
count mean min 25% 50% 75% max std
loan_status 29751455.00 0.08 0.00 0.00 0.00 0.00 1.00 0.26
policy_code 29750537.00 0.08 0.00 0.00 0.00 0.00 2.00 0.28
date 29751455 2016-12-17 10:41:14.614325760 2007-05-26 00:00:00 2016-01-23 00:00:00 2017-05-28 00:00:00 2018-04-20 00:00:00 2018-12-31 00:00:00 NaN
year 29751455.00 2016.43 2007.00 2016.00 2017.00 2018.00 2018.00 1.68
month 29751455.00 6.93 1.00 4.00 7.00 10.00 12.00 3.37
month_sin 29751455.00 -0.08 -1.00 -0.87 -0.00 0.50 1.00 0.69
month_cos 29751455.00 -0.01 -1.00 -0.87 -0.00 0.87 1.00 0.73
loan_amount 29751455.00 13251.77 0.00 5000.00 10000.00 20000.00 1400000.00 13716.20
loan_amount_log 29751455.00 9.04 0.00 8.52 9.21 9.90 14.15 0.98
loan_amount_cap_40k 29751455.00 12805.58 0.00 5000.00 10000.00 20000.00 40000.00 10199.33
loan_amount_above_40k 29751455.00 0.01 0.00 0.00 0.00 0.00 1.00 0.08
debt_to_income_ratio 28584004.00 139.46 0.00 10.00 20.58 35.88 50000032.00 10348.52
debt_to_income_ratio_cap_100 28584004.00 27.91 0.00 10.00 20.58 35.88 100.00 25.35
debt_to_income_ratio_is_na 29751455.00 0.04 0.00 0.00 0.00 0.00 1.00 0.19
debt_to_income_ratio_original 29749744.00 133.95 -1.00 8.58 19.67 35.03 50000032.00 10143.76
debt_to_income_ratio_orig_cap_100 29749744.00 26.77 -1.00 8.58 19.67 35.03 100.00 25.14
debt_to_income_ratio_is_na_original 29751455.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
risk_score 11391597.00 642.54 0.00 605.00 655.00 689.00 990.00 96.22
risk_score_is_na 29751455.00 0.62 0.00 0.00 1.00 1.00 1.00 0.49
employment_length_num 28654846.00 1.10 0.00 0.00 0.00 0.00 10.00 2.54
employment_length_is_na 29751455.00 0.04 0.00 0.00 0.00 0.00 1.00 0.19
zip_area 29751162.00 4.57 0.00 2.00 4.00 7.00 9.00 2.98
zip_code 29751162.00 502.90 0.00 274.00 452.00 781.00 999.00 299.15
zip_code_is_na 29751455.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00
title_is_missing_or_empty 29751455.00 0.00 0.00 0.00 0.00 0.00 1.00 0.04
title_len 29726844.00 14.62 1.00 11.00 18.00 18.00 3572.00 5.99
title_n_letters 29726844.00 13.74 0.00 10.00 17.00 17.00 2655.00 5.13
title_n_capital_letters 29726844.00 0.60 0.00 0.00 1.00 1.00 283.00 0.54
title_n_non_capital_letters 29726844.00 13.14 0.00 9.00 16.00 17.00 2433.00 5.04
title_n_digits 29726844.00 0.00 0.00 0.00 0.00 0.00 175.00 0.08
title_n_punctuation 29726844.00 0.00 0.00 0.00 0.00 0.00 288.00 0.09
title_n_spaces 29726844.00 0.59 0.00 0.00 0.00 1.00 650.00 0.65
title_n_words 29726844.00 1.59 0.00 1.00 1.00 2.00 618.00 0.81
title_in_title_case 29726844.00 0.12 0.00 0.00 0.00 0.00 1.00 0.33
title_in_upper_case 29726844.00 0.00 0.00 0.00 0.00 0.00 1.00 0.02
title_in_lower_case 29726844.00 0.42 0.00 0.00 0.00 1.00 1.00 0.49
credit 29711610.00 0.14 0.00 0.00 0.00 0.00 1.00 0.35
card 29711610.00 0.14 0.00 0.00 0.00 0.00 1.00 0.35
refinancing 29711610.00 0.09 0.00 0.00 0.00 0.00 1.00 0.29
consolidation 29711610.00 0.45 0.00 0.00 0.00 1.00 1.00 0.50
debt_related 29711610.00 0.45 0.00 0.00 0.00 1.00 1.00 0.50
bills_taxes 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.02
payoff 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.03
home_upgrade 29711610.00 0.05 0.00 0.00 0.00 0.00 1.00 0.21
home_related 29711610.00 0.07 0.00 0.00 0.00 0.00 1.00 0.25
home_buying 29711610.00 0.02 0.00 0.00 0.00 0.00 1.00 0.13
fixing 29711610.00 0.05 0.00 0.00 0.00 0.00 1.00 0.21
major_purchase_unspecified 29711610.00 0.03 0.00 0.00 0.00 0.00 1.00 0.18
relocation 29711610.00 0.02 0.00 0.00 0.00 0.00 1.00 0.15
weddings 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.03
car 29711610.00 0.18 0.00 0.00 0.00 0.00 1.00 0.39
motorcycle 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
vehicle_unspecified_or_other 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
medical_expenses 29711610.00 0.03 0.00 0.00 0.00 0.00 1.00 0.16
education 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.02
investment 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
vocation 29711610.00 0.01 0.00 0.00 0.00 0.00 1.00 0.10
renewable_energy 29711610.00 0.00 0.00 0.00 0.00 0.00 1.00 0.04

3.2.4 Split Data into Train, Validation, and Test Sets

Code
# If the analysis is restored from this point, load the data
if "loans" not in locals():
    loans = pd.read_feather(
        dir_interim + "task-1-2_preprocess_5_add_more_features.feather"
    )

Remove columns that are not needed for the analysis and consume a lot of memory.

Code
loans = loans.drop(columns=["loan_title", "loan_title_unified"])
  • Split data into training, validation, and test sets (70%:15%:15%).
  • Stratification by loan status is used to ensure that the proportion of accepted and rejected loans is the same in all sets.
Code
# Train, validation, test split (stratified by loan status): 70%:15%:15%

path_train = dir_interim + "task-1-loans_train.feather"
path_validation = dir_interim + "task-1-loans_validation.feather"
path_test = dir_interim + "task-1-loans_test.feather"

if (
    os.path.exists(path_train)
    and os.path.exists(path_validation)
    and os.path.exists(path_test)
):
    # Load saved results, if they exist
    loans_train = pd.read_feather(path_train)
    loans_validation = pd.read_feather(path_validation)
    loans_test = pd.read_feather(path_test)
else:
    # Split data
    loans_train, loans_validation = train_test_split(
        loans, test_size=0.3, random_state=42, stratify=loans.loan_status
    )

    loans_validation, loans_test = train_test_split(
        loans_validation,
        test_size=0.5,
        random_state=42,
        stratify=loans_validation.loan_status,
    )

    # Save as feather files
    loans_train.to_feather(path_train)
    loans_validation.to_feather(path_validation)
    loans_test.to_feather(path_test)
Code
if "loans" in locals():
    del loans

3.3 EDA of All Years

To make EDA faster, only 10% of the training data will be used for EDA (which is 2.1 million cases).

Various trends seem to be different comparing years. E.g.:

  • The number of applications grows each year with the maximum in 2018 (the most recent year; see Figure 3.1).
  • The percentage of accepted loan applications differs significantly between years (chi-squared test, p-value < 0.001) with a maximum of 14.9% of accepted loans in 2013 and a minimum of 5.0% in 2018 (see Table 3.1).
  • The category “Employment length of 5 years” seems to be over-represented in 2014, 2015, and especially in 2016 and 2017 while in 2018 this over-representation is less obvious (if present at all).
  • Refinancing (Figure 3.5) and home buying (Figure 3.6) purposes show an increase in the year 2018 compared to previous years.
  • In the year 2018, no missing titles (Figure 3.4), no lower-case titles (Figure 3.3), and some other title properties suggest that there was some standardization in data collection procedures at Lending Club.

These findings may indicate that analyzing previous years may be less relevant for predicting future trends.

Find more EDA results below.

Code
file_path = dir_interim + "task-1-loans_eda.feather"

if os.path.exists(file_path):
    # Load from file, if present.
    loans_eda = pd.read_feather(file_path)
else:
    # Do calculations.
    loans_eda = loans_train.sample(frac=0.1, random_state=20)
    loans_eda.to_feather(file_path)
Code
print(f"{loans_train.shape[0]/1e6:.1f}M rows in training set.")
print(f"{loans_eda.shape[0]/1e6: .1f}M rows in EDA sub-set (from training set).")
20.8M rows in training set.
 2.1M rows in EDA sub-set (from training set).
Column info
Code
an.col_info(loans_eda, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_status Int8 4.2 MB 2 <0.1% 0 0% 1,924,091 92.4% 92.4% 0
2 policy_code Int8 4.2 MB 3 <0.1% 70 <0.1% 1,917,868 92.1% 92.1% 0
3 date datetime64[ns] 16.7 MB 4,155 0.2% 0 0% 5,781 0.3% 0.3% 2018-10-01 00:00:00
4 year Int16 6.2 MB 12 <0.1% 0 0% 694,223 33.3% 33.3% 2018
5 month Int8 4.2 MB 12 <0.1% 0 0% 204,881 9.8% 9.8% 10
6 month_sin float32 8.3 MB 8 <0.1% 0 0% 399,619 19.2% 19.2% -0.8660254
7 month_cos float32 8.3 MB 8 <0.1% 0 0% 363,560 17.5% 17.5% -0.8660254
8 loan_amount float32 8.3 MB 2,095 0.1% 0 0% 285,145 13.7% 13.7% 10000.0
9 loan_amount_log float32 8.3 MB 2,095 0.1% 0 0% 285,145 13.7% 13.7% 9.210441
10 loan_amount_cap_40k float32 8.3 MB 1,600 0.1% 0 0% 285,145 13.7% 13.7% 10000.0
11 loan_amount_above_40k Int8 4.2 MB 2 <0.1% 0 0% 2,070,516 99.4% 99.4% 0
12 debt_to_income_ratio float32 8.3 MB 39,432 1.9% 81,734 3.9% 91,623 4.4% 4.6% 100.0
13 debt_to_income_ratio_cap_100 float32 8.3 MB 9,994 0.5% 81,734 3.9% 148,622 7.1% 7.4% 100.0
14 debt_to_income_ratio_is_na Int8 4.2 MB 2 <0.1% 0 0% 2,000,868 96.1% 96.1% 0
15 debt_to_income_ratio_original float32 8.3 MB 39,433 1.9% 125 <0.1% 91,623 4.4% 4.4% 100.0
16 debt_to_income_ratio_orig_cap_100 float32 8.3 MB 9,995 0.5% 125 <0.1% 148,622 7.1% 7.1% 100.0
17 debt_to_income_ratio_is_na_original Int8 4.2 MB 2 <0.1% 0 0% 2,082,477 >99.9% >99.9% 0
18 risk_score float32 8.3 MB 654 <0.1% 1,285,066 61.7% 16,923 0.8% 2.1% 662.0
19 risk_score_is_na Int8 4.2 MB 2 <0.1% 0 0% 1,285,066 61.7% 61.7% 1
20 employment_length category 2.1 MB 11 <0.1% 76,401 3.7% 1,612,460 77.4% 80.4% < 1 year
21 employment_length_num Int8 4.2 MB 11 <0.1% 76,401 3.7% 1,612,460 77.4% 80.4% 0
22 employment_length_is_na Int8 4.2 MB 2 <0.1% 0 0% 2,006,201 96.3% 96.3% 0
23 state category 2.1 MB 51 <0.1% 3 <0.1% 247,753 11.9% 11.9% CA
24 zip_area Int16 6.2 MB 10 <0.1% 27 <0.1% 340,617 16.4% 16.4% 3
25 zip_code Int16 6.2 MB 986 <0.1% 27 <0.1% 20,130 1.0% 1.0% 112
26 zip_code_is_na Int8 4.2 MB 2 <0.1% 0 0% 2,082,575 >99.9% >99.9% 0
27 title_is_missing_or_empty Int8 4.2 MB 2 <0.1% 0 0% 2,079,742 99.9% 99.9% 0
28 title_len float32 8.3 MB 79 <0.1% 1,804 0.1% 940,825 45.2% 45.2% 18.0
29 title_n_letters float32 8.3 MB 74 <0.1% 1,804 0.1% 941,060 45.2% 45.2% 17.0
30 title_n_capital_letters float32 8.3 MB 38 <0.1% 1,804 0.1% 1,168,488 56.1% 56.2% 1.0
31 title_n_non_capital_letters float32 8.3 MB 67 <0.1% 1,804 0.1% 527,687 25.3% 25.4% 16.0
32 title_n_digits float32 8.3 MB 14 <0.1% 1,804 0.1% 2,079,718 99.9% 99.9% 0.0
33 title_n_punctuation float32 8.3 MB 16 <0.1% 1,804 0.1% 2,078,989 99.8% 99.9% 0.0
34 title_n_spaces float32 8.3 MB 28 <0.1% 1,804 0.1% 1,095,780 52.6% 52.7% 0.0
35 title_n_words float32 8.3 MB 28 <0.1% 1,804 0.1% 1,095,949 52.6% 52.7% 1.0
36 title_in_title_case Int8 4.2 MB 2 <0.1% 1,804 0.1% 1,827,590 87.8% 87.8% 0
37 title_in_upper_case Int8 4.2 MB 2 <0.1% 1,804 0.1% 2,079,823 99.9% >99.9% 0
38 title_in_lower_case Int8 4.2 MB 2 <0.1% 1,804 0.1% 1,203,682 57.8% 57.8% 0
39 credit Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,787,690 85.8% 86.0% 0
40 card Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,788,076 85.9% 86.0% 0
41 refinancing Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,885,717 90.5% 90.7% 0
42 consolidation Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,135,780 54.5% 54.6% 0
43 debt_related Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,136,445 54.6% 54.6% 0
44 bills_taxes Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,079,041 99.8% >99.9% 0
45 payoff Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,077,970 99.8% 99.9% 0
46 home_upgrade Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,986,705 95.4% 95.5% 0
47 home_related Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,938,471 93.1% 93.2% 0
48 home_buying Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,044,582 98.2% 98.3% 0
49 fixing Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,986,669 95.4% 95.5% 0
50 major_purchase_unspecified Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,009,575 96.5% 96.6% 0
51 relocation Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,029,413 97.4% 97.6% 0
52 weddings Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,078,101 99.8% 99.9% 0
53 car Int8 4.2 MB 2 <0.1% 2,860 0.1% 1,696,235 81.4% 81.6% 0
54 motorcycle Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,079,642 99.9% >99.9% 0
55 vehicle_unspecified_or_other Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,079,573 99.9% >99.9% 0
56 medical_expenses Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,024,973 97.2% 97.4% 0
57 education Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,078,790 99.8% >99.9% 0
58 investment Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,079,601 99.9% >99.9% 0
59 vocation Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,056,743 98.8% 98.9% 0
60 renewable_energy Int8 4.2 MB 2 <0.1% 2,860 0.1% 2,075,725 99.7% 99.8% 0
EDA: Sweetviz report
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_training_1 = sweetviz.analyze(
            [loans_eda, "Training data"],
            target_feat="loan_status",
            pairwise_analysis="off",
        )
        report_training_1.show_notebook()
crosstab_status_by_year = an.CrossTab("year", "loan_status", data=loans_eda)
crosstab_status_by_year().assign(
    percent_accepted=lambda df: df[1] / (df[1] + df[0]) * 100
)
Table 3.1. Percentage of accepted loans by year.
loan_status 0 1 percent_accepted
year
2007 401 46 10.29
2008 1833 189 9.35
2009 3954 351 8.15
2010 7859 876 10.03
2011 15226 1491 8.92
2012 23612 3722 13.62
2013 52973 9283 14.91
2014 134360 16686 11.05
2015 199956 29476 12.85
2016 333098 30465 8.38
2017 491675 30847 5.90
2018 659144 35079 5.05
fig, ax = plt.subplots(
    2, 1, figsize=(10, 6), sharex=True, gridspec_kw={"height_ratios": [3, 2]}
)

crosstab_status_by_year.barplot(ax=ax[0])
crosstab_status_by_year.barplot(normalize="rows", stacked=True, xlabel="Year", ax=ax[1])

# Get the limits of the x-axis and y-axis
xlim = ax[0].get_xlim()
ylim = ax[0].get_ylim()
# Set the position of the annotation
x_pos = xlim[1] - 0.25 * (xlim[1] - xlim[0])  # 25% offset from the right edge
y_pos = ylim[1] - 0.05 * (ylim[1] - ylim[0])  # 5% offset from the top edge

chi_sq_res_1 = crosstab_status_by_year.chisq_test("short")

ax[0].annotate(
    chi_sq_res_1.capitalize(),
    xy=(x_pos, y_pos),
    xycoords="data",
    ha="center",
    va="top",
)
ax[0].yaxis.set_major_formatter(axis_formatter("M", precision=1))
ax[0].get_legend().set_title("Loan status")
ax[1].get_legend().remove()
Fig. 3.1. Number of loan applications by year.

The next EDA plots and tables will be used to compare distributions of variables over the years. In a few cases, monthly trends will also be explored.

EDA: Financial, Risk-Related and Other Indicators
ct = an.CrossTab("year", "employment_length", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

With “< 1 year” group removed:

ct = an.CrossTab(
    "year", "employment_length", data=loans_eda.query("employment_length != '< 1 year'")
)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct
(a) Counts
(b) Percentages
Fig. 3.2. Employment length by year.
ct = an.CrossTab("month", "employment_length", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)

del ct

With “< 1 year” group removed:

ct = an.CrossTab(
    "month",
    "employment_length",
    data=loans_eda.query("employment_length != '< 1 year'"),
)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)

del ct

ct = an.CrossTab("month", "employment_length_is_na", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)
del ct

sns.violinplot(x="year", y="loan_amount", hue="loan_status", data=loans_eda);

sns.violinplot(
    x="year", y="loan_amount_cap_40k", hue="loan_status", data=loans_eda, bw_adjust=3
);

sns.violinplot(
    x="year", y="debt_to_income_ratio_cap_100", hue="loan_status", data=loans_eda
);

sns.violinplot(x="year", y="risk_score", hue="loan_status", data=loans_eda);

ct = an.CrossTab("month", "risk_score_is_na", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Month", legend=False, colormap="Paired"
)
del ct

EDA: Title Properties
x = "year"
y = "title_len"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 100)

del x, y, hue, data

x = "year"
y = "title_n_capital_letters"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 10)

del x, y, hue, data

x = "year"
y = "title_n_non_capital_letters"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 40)

del x, y, hue, data

x = "year"
y = "title_n_letters"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 40)

del x, y, hue, data

x = "year"
y = "title_n_digits"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)

del x, y, hue, data

x = "year"
y = "title_n_punctuation"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 2)

del x, y, hue, data

x = "year"
y = "title_n_spaces"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 20)

del x, y, hue, data

x = "year"
y = "title_n_words"
hue = "loan_status"

data = loans_eda[[x, y, hue]].dropna()
sns.violinplot(x=x, y=y, hue=hue, data=data)
plt.gca().set_ylim(0, 13)

del x, y, hue, data

It seems that loan title got standardized in 2018, as all-lower-case titles are not present in 2018:

ct = an.CrossTab("year", "title_in_lower_case", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
display(ct.counts.T)
del ct
Number of loan titles in lower case by year.
year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
title_in_lower_case
0 156 1378 3031 1368 2055 4141 10775 26229 48439 92373 319514 694223
1 291 644 1273 7366 14662 23193 51481 124817 180984 269479 202926 0
(a) As table (↑)
(b) Counts
(c) Percentages
Fig. 3.3. Number of loan titles in lower case by year.
ct = an.CrossTab("year", "title_in_upper_case", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "title_in_title_case", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "title_is_missing_or_empty", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
ct.counts.assign(percent_missing_or_empty=lambda df: df[1] / (df[1] + df[0]) * 100)
del ct
(a) As table (↑)
(b) Counts
Fig. 3.4. Missing or empty loan titles by year.
EDA: Title Contents (Loan Purpose)

Here, the presence of certain words, phrases, or abbreviations in the titles will be compared over the years explored.

ct = an.CrossTab("year", "credit", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "card", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

Refinancing cases seem to increase in 2018.

ct = an.CrossTab("year", "refinancing", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct
(a) Counts
(b) Percentages
Fig. 3.5. Number of loan titles related to refinancing by year.
ct = an.CrossTab("year", "consolidation", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "debt_related", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "bills_taxes", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "payoff", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "home_upgrade", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "home_related", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "home_buying", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct
(a) Counts
(b) Percentages
Fig. 3.6. Number of loan titles related to home buying by year.
ct = an.CrossTab("year", "fixing", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "major_purchase_unspecified", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "relocation", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "weddings", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "car", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "motorcycle", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "vehicle_unspecified_or_other", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "medical_expenses", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "education", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "investment", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "vocation", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

ct = an.CrossTab("year", "renewable_energy", data=loans_eda)
ct.barplot(colormap="Paired")
ct.barplot(
    normalize="rows", stacked=True, xlabel="Year", legend=False, colormap="Paired"
)
del ct

3.4 EDA of Most Recent Year (2018)

Based on the EDA of all years, I decided to focus the analysis only on data from the most recent year (2018).

3.4.1 General EDA

There are 694.2 thousand cases in the EDA subset. The general EDA suggested that the following variables can be excluded:

  • as irrelevant:
    • date
    • year
    • policy_code
    • debt_to_income_ratio_is_na_original (created for EDA only)
    • debt_to_income_ratio_orig_cap_100 (created for EDA only)
    • debt_to_income_ratio_original (created for EDA only)
  • constant:
    • zip_code_is_na
    • title_is_missing_or_empty
    • title_n_digits
    • title_n_punctuation
    • title_in_upper_case
    • title_in_lower_case
    • bills_taxes
    • payoff
    • weddings
    • motorcycle
    • vehicle_unspecified_or_other
    • investment
  • almost constant:
    • education
    • debt_to_income_ratio_is_na
    • renewable_energy

Find the details below.

Code
file_path = dir_interim + "task-1-loans_eda.feather"

if os.path.exists(file_path):
    # Load from file, if present.
    loans_eda = pd.read_feather(file_path)
Code
loans_eda_2018 = loans_eda.query("year == 2018")
print(f"{loans_eda_2018.shape[0]/1e3:,.1f}k rows for year 2018")
694.2k rows for year 2018
Code
loans_eda_2018.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
Index: 694223 entries, 19248942 to 12767628
Columns: 60 entries, loan_status to renewable_energy
dtypes: Int16(3), Int8(36), category(2), datetime64[ns](1), float32(18)
memory usage: 113.2 MB
Code
an.col_info(loans_eda_2018, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_status Int8 1.4 MB 2 <0.1% 0 0% 659,144 94.9% 94.9% 0
2 policy_code Int8 1.4 MB 3 <0.1% 0 0% 656,360 94.5% 94.5% 0
3 date datetime64[ns] 5.6 MB 365 0.1% 0 0% 5,781 0.8% 0.8% 2018-10-01 00:00:00
4 year Int16 2.1 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 2018
5 month Int8 1.4 MB 12 <0.1% 0 0% 66,932 9.6% 9.6% 10
6 month_sin float32 2.8 MB 8 <0.1% 0 0% 132,879 19.1% 19.1% -0.8660254
7 month_cos float32 2.8 MB 8 <0.1% 0 0% 128,343 18.5% 18.5% -0.8660254
8 loan_amount float32 2.8 MB 1,849 0.3% 0 0% 114,672 16.5% 16.5% 10000.0
9 loan_amount_log float32 2.8 MB 1,849 0.3% 0 0% 114,672 16.5% 16.5% 9.210441
10 loan_amount_cap_40k float32 2.8 MB 1,550 0.2% 0 0% 114,672 16.5% 16.5% 10000.0
11 loan_amount_above_40k Int8 1.4 MB 2 <0.1% 0 0% 691,679 99.6% 99.6% 0
12 debt_to_income_ratio float32 2.8 MB 24,784 3.6% 4,556 0.7% 63,399 9.1% 9.2% 100.0
13 debt_to_income_ratio_cap_100 float32 2.8 MB 9,752 1.4% 4,556 0.7% 83,910 12.1% 12.2% 100.0
14 debt_to_income_ratio_is_na Int8 1.4 MB 2 <0.1% 0 0% 689,667 99.3% 99.3% 0
15 debt_to_income_ratio_original float32 2.8 MB 24,785 3.6% 85 <0.1% 63,399 9.1% 9.1% 100.0
16 debt_to_income_ratio_orig_cap_100 float32 2.8 MB 9,753 1.4% 85 <0.1% 83,910 12.1% 12.1% 100.0
17 debt_to_income_ratio_is_na_original Int8 1.4 MB 2 <0.1% 0 0% 694,138 >99.9% >99.9% 0
18 risk_score float32 2.8 MB 462 0.1% 613,924 88.4% 2,555 0.4% 3.2% 682.0
19 risk_score_is_na Int8 1.4 MB 2 <0.1% 0 0% 613,924 88.4% 88.4% 1
20 employment_length category 695.2 kB 11 <0.1% 23,447 3.4% 600,178 86.5% 89.5% < 1 year
21 employment_length_num Int8 1.4 MB 11 <0.1% 23,447 3.4% 600,178 86.5% 89.5% 0
22 employment_length_is_na Int8 1.4 MB 2 <0.1% 0 0% 670,776 96.6% 96.6% 0
23 state category 699.3 kB 51 <0.1% 0 0% 78,026 11.2% 11.2% CA
24 zip_area Int16 2.1 MB 10 <0.1% 0 0% 117,537 16.9% 16.9% 3
25 zip_code Int16 2.1 MB 944 0.1% 0 0% 6,719 1.0% 1.0% 770
26 zip_code_is_na Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
27 title_is_missing_or_empty Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
28 title_len float32 2.8 MB 10 <0.1% 0 0% 289,103 41.6% 41.6% 18.0
29 title_n_letters float32 2.8 MB 10 <0.1% 0 0% 289,103 41.6% 41.6% 17.0
30 title_n_capital_letters float32 2.8 MB 2 <0.1% 0 0% 689,755 99.4% 99.4% 1.0
31 title_n_non_capital_letters float32 2.8 MB 11 <0.1% 0 0% 289,103 41.6% 41.6% 16.0
32 title_n_digits float32 2.8 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0.0
33 title_n_punctuation float32 2.8 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0.0
34 title_n_spaces float32 2.8 MB 3 <0.1% 0 0% 415,081 59.8% 59.8% 1.0
35 title_n_words float32 2.8 MB 3 <0.1% 0 0% 415,081 59.8% 59.8% 2.0
36 title_in_title_case Int8 1.4 MB 2 <0.1% 0 0% 543,485 78.3% 78.3% 0
37 title_in_upper_case Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
38 title_in_lower_case Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
39 credit Int8 1.4 MB 2 <0.1% 0 0% 575,160 82.8% 82.8% 0
40 card Int8 1.4 MB 2 <0.1% 0 0% 575,160 82.8% 82.8% 0
41 refinancing Int8 1.4 MB 2 <0.1% 0 0% 575,160 82.8% 82.8% 0
42 consolidation Int8 1.4 MB 2 <0.1% 0 0% 405,120 58.4% 58.4% 0
43 debt_related Int8 1.4 MB 2 <0.1% 0 0% 405,120 58.4% 58.4% 0
44 bills_taxes Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
45 payoff Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
46 home_upgrade Int8 1.4 MB 2 <0.1% 0 0% 676,116 97.4% 97.4% 0
47 home_related Int8 1.4 MB 2 <0.1% 0 0% 649,006 93.5% 93.5% 0
48 home_buying Int8 1.4 MB 2 <0.1% 0 0% 667,113 96.1% 96.1% 0
49 fixing Int8 1.4 MB 2 <0.1% 0 0% 676,116 97.4% 97.4% 0
50 major_purchase_unspecified Int8 1.4 MB 2 <0.1% 0 0% 668,980 96.4% 96.4% 0
51 relocation Int8 1.4 MB 2 <0.1% 0 0% 680,419 98.0% 98.0% 0
52 weddings Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
53 car Int8 1.4 MB 2 <0.1% 0 0% 541,696 78.0% 78.0% 0
54 motorcycle Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
55 vehicle_unspecified_or_other Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
56 medical_expenses Int8 1.4 MB 2 <0.1% 0 0% 677,389 97.6% 97.6% 0
57 education Int8 1.4 MB 2 <0.1% 0 0% 694,218 >99.9% >99.9% 0
58 investment Int8 1.4 MB 1 <0.1% 0 0% 694,223 100.0% 100.0% 0
59 vocation Int8 1.4 MB 2 <0.1% 0 0% 688,563 99.2% 99.2% 0
60 renewable_energy Int8 1.4 MB 2 <0.1% 0 0% 693,471 99.9% 99.9% 0
Code
# Columns to exclude
# (loan_title and loan_title_unified are already excluded)

to_exclude_1 = str_to_list(
    """
date
year
policy_code
debt_to_income_ratio_is_na_original
debt_to_income_ratio_orig_cap_100
debt_to_income_ratio_original
zip_code_is_na
title_is_missing_or_empty
title_n_digits
title_n_punctuation
title_in_upper_case
title_in_lower_case
bills_taxes
payoff
weddings
motorcycle
vehicle_unspecified_or_other
investment
education
renewable_energy
"""
)
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_training_2018 = sweetviz.analyze(
            [loans_eda_2018.drop(columns=to_exclude_1), "Training data"],
            target_feat="loan_status",
            pairwise_analysis="off",
        )
        report_training_2018.show_notebook()

3.4.3 Relationships between Variables

Hierarchical clustering can be performed on a correlation matrix to cluster variables with similar correlation patterns (these clusters can be identified as squares created of 2 or more variables along the diagonal). Absolute values of correlation coefficients make it easier to identify the clusters. Thus, Figure 3.7 helped to identify variables that are highly correlated with each other and manually populate the list of variables to exclude from the analysis.

Looking into correlations of variables to the target, variables risk_score_is_na, risk_score, employment_length_num indicate the highest correlation to the target (Table 3.2).

Code
data_num = loans_eda_2018.drop(columns=to_exclude_1).select_dtypes("number")
corr_coefs = data_num.corr(method="pearson")
corr_coefs_abs = corr_coefs.fillna(0).abs()
g = sns.clustermap(
    corr_coefs_abs,
    method="ward",
    cmap="Greens",
    annot=True,
    annot_kws={"size": 8},
    vmin=0,
    vmax=1,
    figsize=(14, 11),
    cbar_pos=(0.94, 0.91, 0.03, 0.1),
    cbar_kws={"location": "right"},
    dendrogram_ratio=(0.075, 0),
)

g.fig.suptitle(
    "Absolute Values of Pearson Correlation (Matrix with Hierarchical Clustering)",
    fontsize=17,
    y=1.03,
    x=0.55,
);
Fig. 3.7. Matrix of absolute values of Pearson correlation coefficients.

The list of variables to exclude from the analysis based on this and previous explorations:

Code
to_exclude_2 = str_to_list(
    """
date
year
policy_code
debt_to_income_ratio_is_na_original
debt_to_income_ratio_orig_cap_100
debt_to_income_ratio_original
zip_code_is_na
title_is_missing_or_empty
title_n_digits
title_n_punctuation
title_in_upper_case
title_in_lower_case
bills_taxes
payoff
weddings
motorcycle
vehicle_unspecified_or_other
investment
education   
renewable_energy
title_n_spaces
title_n_non_capital_letters
title_n_letters
title_in_title_case
fixing
credit
card
debt_related
zip_code
state
employment_length
loan_amount_log
month
"""
)
an.get_pointbiserial_corr_scores(
    loans_eda_2018.drop(columns=to_exclude_2), target="loan_status"
)
# Time: 2m 41.0s
Table 3.2. Point-biserial correlation coefficients between the variables and the target.
  variable_1 variable_2 n r_pb p p_adj
1 loan_status risk_score_is_na 694223 -0.638 <0.001 <0.001
2 loan_status risk_score 80299 0.596 <0.001 <0.001
3 loan_status employment_length_num 670776 0.589 <0.001 <0.001
4 loan_status debt_to_income_ratio_cap_100 689667 -0.093 <0.001 <0.001
5 loan_status title_len 694223 0.092 <0.001 <0.001
6 loan_status title_n_words 694223 0.070 <0.001 <0.001
7 loan_status employment_length_is_na 694223 0.068 <0.001 <0.001
8 loan_status loan_amount_cap_40k 694223 0.065 <0.001 <0.001
9 loan_status home_upgrade 694223 0.060 <0.001 <0.001
10 loan_status consolidation 694223 0.051 <0.001 <0.001
11 loan_status refinancing 694223 0.051 <0.001 <0.001
12 loan_status loan_amount 694223 0.048 <0.001 <0.001
13 loan_status home_buying 694223 -0.034 <0.001 <0.001
14 loan_status car 694223 0.025 <0.001 <0.001
15 loan_status relocation 694223 -0.023 <0.001 <0.001
16 loan_status title_n_capital_letters 694223 -0.019 <0.001 <0.001
17 loan_status medical_expenses 694223 -0.016 <0.001 <0.001
18 loan_status major_purchase_unspecified 694223 -0.016 <0.001 <0.001
19 loan_status loan_amount_above_40k 694223 -0.014 <0.001 <0.001
20 loan_status home_related 694223 0.012 <0.001 <0.001
21 loan_status debt_to_income_ratio_is_na 694223 -0.012 <0.001 <0.001
22 loan_status month_sin 694223 0.011 <0.001 <0.001
23 loan_status debt_to_income_ratio 689667 -0.010 <0.001 <0.001
24 loan_status zip_area 694223 0.007 <0.001 <0.001
25 loan_status month_cos 694223 -0.003 0.014 0.028
26 loan_status vocation 694223 -0.003 0.024 0.028

The last check before modeling:

Code
an.col_info(loans_eda_2018.drop(columns=to_exclude_2), style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_status Int8 1.4 MB 2 <0.1% 0 0% 659,144 94.9% 94.9% 0
2 month_sin float32 2.8 MB 8 <0.1% 0 0% 132,879 19.1% 19.1% -0.8660254
3 month_cos float32 2.8 MB 8 <0.1% 0 0% 128,343 18.5% 18.5% -0.8660254
4 loan_amount float32 2.8 MB 1,849 0.3% 0 0% 114,672 16.5% 16.5% 10000.0
5 loan_amount_cap_40k float32 2.8 MB 1,550 0.2% 0 0% 114,672 16.5% 16.5% 10000.0
6 loan_amount_above_40k Int8 1.4 MB 2 <0.1% 0 0% 691,679 99.6% 99.6% 0
7 debt_to_income_ratio float32 2.8 MB 24,784 3.6% 4,556 0.7% 63,399 9.1% 9.2% 100.0
8 debt_to_income_ratio_cap_100 float32 2.8 MB 9,752 1.4% 4,556 0.7% 83,910 12.1% 12.2% 100.0
9 debt_to_income_ratio_is_na Int8 1.4 MB 2 <0.1% 0 0% 689,667 99.3% 99.3% 0
10 risk_score float32 2.8 MB 462 0.1% 613,924 88.4% 2,555 0.4% 3.2% 682.0
11 risk_score_is_na Int8 1.4 MB 2 <0.1% 0 0% 613,924 88.4% 88.4% 1
12 employment_length_num Int8 1.4 MB 11 <0.1% 23,447 3.4% 600,178 86.5% 89.5% 0
13 employment_length_is_na Int8 1.4 MB 2 <0.1% 0 0% 670,776 96.6% 96.6% 0
14 zip_area Int16 2.1 MB 10 <0.1% 0 0% 117,537 16.9% 16.9% 3
15 title_len float32 2.8 MB 10 <0.1% 0 0% 289,103 41.6% 41.6% 18.0
16 title_n_capital_letters float32 2.8 MB 2 <0.1% 0 0% 689,755 99.4% 99.4% 1.0
17 title_n_words float32 2.8 MB 3 <0.1% 0 0% 415,081 59.8% 59.8% 2.0
18 refinancing Int8 1.4 MB 2 <0.1% 0 0% 575,160 82.8% 82.8% 0
19 consolidation Int8 1.4 MB 2 <0.1% 0 0% 405,120 58.4% 58.4% 0
20 home_upgrade Int8 1.4 MB 2 <0.1% 0 0% 676,116 97.4% 97.4% 0
21 home_related Int8 1.4 MB 2 <0.1% 0 0% 649,006 93.5% 93.5% 0
22 home_buying Int8 1.4 MB 2 <0.1% 0 0% 667,113 96.1% 96.1% 0
23 major_purchase_unspecified Int8 1.4 MB 2 <0.1% 0 0% 668,980 96.4% 96.4% 0
24 relocation Int8 1.4 MB 2 <0.1% 0 0% 680,419 98.0% 98.0% 0
25 car Int8 1.4 MB 2 <0.1% 0 0% 541,696 78.0% 78.0% 0
26 medical_expenses Int8 1.4 MB 2 <0.1% 0 0% 677,389 97.6% 97.6% 0
27 vocation Int8 1.4 MB 2 <0.1% 0 0% 688,563 99.2% 99.2% 0

3.5 Modeling (Year 2018)

3.5.1 Prepare Training, Validation, and Test Sets

Code
path_train = dir_interim + "task-1-loans_train.feather"
path_validation = dir_interim + "task-1-loans_validation.feather"
path_test = dir_interim + "task-1-loans_test.feather"

if (
    os.path.exists(path_train)
    and os.path.exists(path_validation)
    and os.path.exists(path_test)
):
    # Load saved results
    loans_train = pd.read_feather(path_train)
    loans_validation = pd.read_feather(path_validation)
    loans_test = pd.read_feather(path_test)
else:
    # Save as feather files
    loans_train.to_feather(path_train)
    loans_validation.to_feather(path_validation)
    loans_test.to_feather(path_test)
Code
# Filter the data for year 2018 only
loans_train_2018 = loans_train.query("year == 2018")
loans_validation_2018 = loans_validation.query("year == 2018")
loans_test_2018 = loans_test.query("year == 2018")

# Create X_train, y_train, X_validation, y_validation, X_test, and y_test
X_train = loans_train_2018.drop(columns=to_exclude_2 + ["loan_status"])
y_train = loans_train_2018["loan_status"]

X_validation = loans_validation_2018.drop(columns=to_exclude_2 + ["loan_status"])
y_validation = loans_validation_2018["loan_status"]

X_test = loans_test_2018.drop(columns=to_exclude_2 + ["loan_status"])
y_test = loans_test_2018["loan_status"]
Code
# Remove some unnecessary variables
del (
    loans_train_2018,
    loans_validation_2018,
    loans_test_2018,
    loans_train,
    loans_validation,
    loans_test,
)

if "loans_eda_2018" in locals():
    del loans_eda_2018

if "loans_eda" in locals():
    del loans_eda

3.5.2 Create Pre-Processing Pipelines

Some steps of group-independent pre-processing have already been done in the previous chapters and will not be repeated here. Here, pipelines for group-dependent processing will be created.

Suffix _lr means that the pipeline (or other object) is for non-tree-based models (logistic regression and Naive Bayes) and _trees means that it is for tree-based models (LGBM):

  • _lr pipeline contains one extra step: numeric data scaling.
numeric_common = str_to_list(
    """
risk_score
employment_length_num
title_len
title_n_capital_letters
title_n_words
"""
)

# Impute median, scale:
numeric_features_for_lr = (
    str_to_list(
        """
loan_amount_cap_40k
debt_to_income_ratio_cap_100
"""
    )
    + numeric_common
)

exclude_for_lr = str_to_list(
    """
loan_amount
debt_to_income_ratio
"""
)

# Impute median:
numeric_features_for_trees = (
    str_to_list(
        """
loan_amount
debt_to_income_ratio
"""
    )
    + numeric_common
)

exclude_for_trees = str_to_list(
    """
loan_amount_cap_40k
debt_to_income_ratio_cap_100
"""
)

# One-Hot Encoding:
categorical_features = ["zip_area"]

# Impute 0:
binary_features = str_to_list(
    """
loan_amount_above_40k
refinancing
consolidation
home_upgrade
home_related
home_buying
major_purchase_unspecified
relocation
car
medical_expenses
vocation
"""
)

# Pass Through
indicators = str_to_list(
    """
debt_to_income_ratio_is_na
employment_length_is_na
risk_score_is_na
"""
)

Group-independent steps that can be performed before cross-validation:

select_binary_01 = make_column_selector(
    pattern=create_column_selector_pattern(binary_features)
)
select_categorical = make_column_selector(
    pattern=create_column_selector_pattern(categorical_features)
)

# Create the pipelines
binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value=0))]
)
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))]
)


group_independent_transforms = ColumnTransformer(
    transformers=[
        ("binary", binary_transformer, select_binary_01),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)

group_independent_preprocessor_for_lr = Pipeline(
    steps=[
        ("dropper", DropFeatures(exclude_for_lr)),
        ("transformer", group_independent_transforms),
    ]
)

group_independent_preprocessor_for_trees = Pipeline(
    steps=[
        ("dropper", DropFeatures(exclude_for_trees)),
        ("transformer", group_independent_transforms),
    ]
)
  • For logistic regression and Naive Bayes, loan_amount, debt_to_income_ratio will be dropped, and capped variables will be used instead:
group_independent_preprocessor_for_lr
Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['loan_amount',
                                                'debt_to_income_ratio'])),
                ('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                                   verbose_feature_names_out=False))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • For tree-based models, non-capped versions of the same variables will be used:
group_independent_preprocessor_for_trees
Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['loan_amount_cap_40k',
                                                'debt_to_income_ratio_cap_100'])),
                ('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000026409B7ABD0>),
                                                 ('categorical',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7AFAD0>)],
                                   verbose_feature_names_out=False))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Group-dependent pre-processing steps that should be performed after cross-validation splits before each model re-fitting:

# Group-dependent pre-processing steps
# that will be performed before each model re-fitting

# Select numeric variables by data type and name pattern
select_numeric_for_lr = make_column_selector(
    dtype_include="number",
    pattern=create_column_selector_pattern(numeric_features_for_lr),
)
select_numeric_for_trees = make_column_selector(
    dtype_include="number",
    pattern=create_column_selector_pattern(numeric_features_for_trees),
)

# Create the pipelines
numeric_transformer_for_lr = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
numeric_transformer_for_trees = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

group_dependent_preprocessor_for_lr = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_for_lr, select_numeric_for_lr),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)

group_dependent_preprocessor_for_trees = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_for_trees, select_numeric_for_trees),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)
  • For logistic regression and Naive Bayes:
# For LR and NB
group_dependent_preprocessor_for_lr
ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7BBFD0>)],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • For tree-based models:
# For tree-based models
group_dependent_preprocessor_for_trees
ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002640D7BBE50>)],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

3.5.3 Apply Group-Independent Pre-Processing

Apply group-independent pre-processing steps to the training, validation, and test sets and save the results as separate datasets.

Code
print(X_train.shape)
print(X_validation.shape)
print(X_test.shape)
(6939617, 26)
(1484768, 26)
(1485951, 26)

Next, apply group-independent pre-processing and cache the results.

Code
# For LR and NB
path_train = dir_interim + "task-1-X_train_lr.feather"
path_validation = dir_interim + "task-1-X_validation_lr.feather"
path_test = dir_interim + "task-1-X_test_lr.feather"

if (
    os.path.exists(path_train)
    and os.path.exists(path_validation)
    and os.path.exists(path_test)
):
    # Load saved results
    X_train_lr = pd.read_feather(path_train)
    X_validation_lr = pd.read_feather(path_validation)
    X_test_lr = pd.read_feather(path_test)
else:
    # Pre-process
    X_train_lr = group_independent_preprocessor_for_lr.fit_transform(X_train)
    X_validation_lr = group_independent_preprocessor_for_lr.transform(X_validation)
    X_test_lr = group_independent_preprocessor_for_lr.transform(X_test)

    # Save as feather files
    X_train_lr.to_feather(path_train)
    X_validation_lr.to_feather(path_validation)
    X_test_lr.to_feather(path_test)
Code
# For LGBM
path_train_trees = dir_interim + "task-1-X_train_trees.feather"
path_validation_trees = dir_interim + "task-1-X_validation_trees.feather"
path_test_trees = dir_interim + "task-1-X_test_trees.feather"

if (
    os.path.exists(path_train_trees)
    and os.path.exists(path_validation_trees)
    and os.path.exists(path_test_trees)
):
    # Load saved results
    X_train_trees = pd.read_feather(path_train_trees)
    X_validation_trees = pd.read_feather(path_validation_trees)
    X_test_trees = pd.read_feather(path_test_trees)
else:
    # Pre-process
    X_train_trees = group_independent_preprocessor_for_trees.fit_transform(X_train)
    X_validation_trees = group_independent_preprocessor_for_trees.transform(
        X_validation
    )
    X_test_trees = group_independent_preprocessor_for_trees.transform(X_test)

    # Save as feather files
    X_train_lr.to_feather(path_train_trees)
    X_validation_lr.to_feather(path_validation_trees)
    X_test_lr.to_feather(path_test_trees)

3.5.4 Train Models

Code
# Dictionary to collect results (logistic regression and Naive Bayes)
models_lr = {}
Code
@my.cache_results(dir_interim + "models_1_01_naive_bayes-2.pickle")
def fit_nb():
    """Fit a Naive Bayes model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(group_dependent_preprocessor_for_lr)),
            ("classifier", GaussianNB()),
        ]
    )
    pipeline.fit(X_train_lr, y_train)

    return pipeline


models_lr["Naive Bayes"] = fit_nb()
# Time 1m 38.1s
Code
@my.cache_results(dir_interim + "models_1_02_logistic_regression_sgd-2.pickle")
def fit_lr_sgd():
    """Fit a Logistic Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(group_dependent_preprocessor_for_lr)),
            (
                "classifier",
                SGDClassifier(
                    random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
                ),
            ),
        ]
    )
    pipeline.fit(X_train_lr, y_train)
    return pipeline


models_lr["Logistic Regression"] = fit_lr_sgd()
# 51.5s
Code
# Dictionary to collect results (tree-based models)
models_trees = {}
Code
@my.cache_results(dir_interim + "models_1_03_lgbm-2.pickle")
def fit_lgbm():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(group_dependent_preprocessor_for_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="binary",
                    metric="binary_logloss",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    pipeline.fit(X_train_trees, y_train)

    return pipeline


# Time: 57.2s
models_trees["LGBM"] = fit_lgbm()

3.5.5 Evaluate Models

print("--- Train ---")

trees_performance_train = ml.classification_scores(
    models_trees,
    X_train_trees,
    y_train,
    style=False,
)
ml.classification_scores(
    models_lr,
    X_train_lr,
    y_train,
    add=trees_performance_train,
    color="orange",
    sort_by="ROC_AUC",
)
--- Train ---
Table 3.3. Classification scores for the train set. The rows are sorted by ROC AUC score. The best values in each column are highlighted.
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM 6939617 0.950 0.997 0.998 0.996 0.967 0.998 0.999 0.996 0.936 1.000 1.000
Logistic Regression 6939617 0.950 0.981 0.989 0.977 0.840 0.990 0.997 0.980 0.725 1.000 0.998
Naive Bayes 6939617 0.950 0.938 0.963 0.926 0.614 0.966 0.991 0.935 0.445 0.999 0.987
print("--- Validation ---")

trees_performance_val = ml.classification_scores(
    models_trees,
    X_validation_trees,
    y_validation,
    style=False,
)
ml.classification_scores(
    models_lr,
    X_validation_lr,
    y_validation,
    add=trees_performance_val,
    sort_by="ROC_AUC",
)
--- Validation ---
Table 3.4. Classification scores for the validation set. The rows are sorted by ROC AUC score. The best values in each column are highlighted.
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM 1484768 0.950 0.997 0.998 0.995 0.966 0.998 0.999 0.996 0.935 1.000 1.000
Logistic Regression 1484768 0.950 0.981 0.989 0.978 0.841 0.990 0.997 0.980 0.726 1.000 0.998
Naive Bayes 1484768 0.950 0.938 0.963 0.927 0.614 0.966 0.991 0.935 0.445 1.000 0.988
Code
y_pred_validation_nb = models_lr["Naive Bayes"].predict(X_validation_lr)
ml.plot_confusion_matrices(y_validation, y_pred_validation_nb);

Code
y_pred_validation_lr = models_lr["Logistic Regression"].predict(X_validation_lr)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lr);

Code
y_pred_validation_lgbm = models_trees["LGBM"].predict(X_validation_trees)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm);

For the next round of analysis, the LGBM model is selected as the best-performing one.

3.5.6 Feature Importance (Best Model)

Feature importance for the LGBM model will be evaluated. Both internal LGBM and SHAP feature importance will be used.

Both methods indicate, that there are 9 most important features, just suggest different order of features.

Code
@my.cache_results(dir_interim + "task-1-shap_lgbm_k=all.pkl")
def get_shap_values_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_trees[model].steps[:-1])
    classifier = models_trees[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation_trees)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm, data_for_lgbm = get_shap_values_lgbm()
# Time:7m 49.3s
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
Code
lgb.plot_importance(
    models_trees["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(10, 6),
    height=0.8,
    title="LGBM Feature Importance",
);

Code
shap.summary_plot(shap_values_lgbm[1], data_for_lgbm, plot_type="bar")

Code
shap.summary_plot(shap_values_lgbm[1], data_for_lgbm)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

3.5.7 LGBM Re-Evaluation with Most Important Variables

The LGBM model was re-evaluated with 9 most important features only.

Validation accuracy, F1 score (of accepted loans), and PPV (positive predictive value) slightly dropped while others remained the same at the precision of 3 decimal digits (Table 3.6). Due to lower complexity, the model with 9 features will be used in the next steps.

The details of modeling are presented below.

@my.cache_results(dir_interim + "models_1_03_lgbm_k=9.pkl")
def fit_lgbm_k9():
    """Fit a LGBM model."""

    to_include = str_to_list(
        """
    risk_score_is_na
    risk_score
    employment_length_num
    debt_to_income_ratio
    month_cos
    month_sin
    employment_length_is_na
    title_len
    loan_amount
    """
    )

    pipeline = Pipeline(
        steps=[
            ("selector", ColumnSelector(to_include)),
            ("preprocessor", clone(group_dependent_preprocessor_for_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="binary",
                    metric="binary_logloss",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_trees, y_train)
    return pipeline


models_trees["LGBM (feat=9)"] = fit_lgbm_k9()
# Time 35.4s
print("--- Train ---")

trees_performance_train = ml.classification_scores(
    models_trees,
    X_train_trees,
    y_train,
    style=False,
)
train_performance = ml.classification_scores(
    models_lr,
    X_train_lr,
    y_train,
    add=trees_performance_train,
    color="orange",
    sort_by="ROC_AUC",
)
train_performance
--- Train ---
Table 3.5. Classification scores for the train set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM 6939617 0.950 0.997 0.998 0.996 0.967 0.998 0.999 0.996 0.936 1.000 1.000
LGBM (feat=9) 6939617 0.950 0.996 0.998 0.995 0.965 0.998 0.999 0.996 0.934 1.000 1.000
Logistic Regression 6939617 0.950 0.981 0.989 0.977 0.840 0.990 0.997 0.980 0.725 1.000 0.998
Naive Bayes 6939617 0.950 0.938 0.963 0.926 0.614 0.966 0.991 0.935 0.445 0.999 0.987
print("--- Validation ---")

trees_performance_val = ml.classification_scores(
    models_trees,
    X_validation_trees,
    y_validation,
    style=False,
)
validation_performance = ml.classification_scores(
    models_lr,
    X_validation_lr,
    y_validation,
    add=trees_performance_val,
    sort_by="ROC_AUC",
)
validation_performance
# Time: 8m 50.6s
--- Validation ---
Table 3.6. Classification scores for the validation set. The rows are sorted by ROC-AUC score. The best values in each column are highlighted.
  n No_info_rate Accuracy BAcc BAcc_01 F1 F1_neg TPR TNR PPV NPV ROC_AUC
LGBM 1484768 0.950 0.997 0.998 0.995 0.966 0.998 0.999 0.996 0.935 1.000 1.000
LGBM (feat=9) 1484768 0.950 0.996 0.998 0.995 0.965 0.998 0.999 0.996 0.933 1.000 1.000
Logistic Regression 1484768 0.950 0.981 0.989 0.978 0.841 0.990 0.997 0.980 0.726 1.000 0.998
Naive Bayes 1484768 0.950 0.938 0.963 0.927 0.614 0.966 0.991 0.935 0.445 1.000 0.988
Code
y_pred_validation_lgbm = models_trees["LGBM"].predict(X_validation_trees)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm);

Code
y_pred_validation_lgbm = models_trees["LGBM (feat=9)"].predict(X_validation_trees)
ml.plot_confusion_matrices(y_validation, y_pred_validation_lgbm);

Code
lgb.plot_importance(
    models_trees["LGBM (feat=9)"]["classifier"],
    max_num_features=50,
    figsize=(10, 4),
    height=0.8,
    title="LGBM Feature Importance",
);

Code
@my.cache_results(dir_interim + "task-1-shap_lgbm_k=9.pkl")
def get_shap_values_lgbm_k8():
    model = "LGBM (feat=9)"
    preproc = Pipeline(steps=models_trees[model].steps[:-1])
    classifier = models_trees[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation_trees)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)
    return shap_values, X_validation_preproc


shap_values_lgbm_k9, data_for_lgbm_k9 = get_shap_values_lgbm_k8()
# Time: 7m 13.7s
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
Code
shap.summary_plot(shap_values_lgbm_k9[1], data_for_lgbm_k9, plot_type="bar")

Code
shap.summary_plot(shap_values_lgbm_k9[1], data_for_lgbm_k9)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

3.6 Final Model for Deployment

The final model will be trained on the whole dataset and deployed on Google Cloud Platform (GCP).

Code
to_include = str_to_list(
    """
risk_score_is_na
risk_score
employment_length_num
debt_to_income_ratio
month_cos
month_sin
employment_length_is_na
title_len
loan_amount
"""
)

X = pd.concat(
    [
        X_train_trees[to_include],
        X_validation_trees[to_include],
        X_test_trees[to_include],
    ],
    axis="index",
)

y = pd.concat([y_train, y_validation, y_test], axis="index")
Code
@my.cache_results(dir_interim + "01--model_predict_loan_status.pkl")
def fit_lgbm_final():
    """Fit the final LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("selector", ColumnSelector(to_include)),
            ("preprocessor", clone(group_dependent_preprocessor_for_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="binary",
                    metric="binary_logloss",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X, y)
    return pipeline


loan_status_predictor_final = fit_lgbm_final()

4 Task 2: Predicting Loan Grade, Subgrade, and Interest Rate

4.1 Inspection, EDA and Pre-Processing

4.1.1 Import and Inspect Data

For this task, only data from the most recent loan issue year (2018) will be used to reflect the most recent trends. This decision is made based on findings comparing trends over the years in Task 1.

To remember what data in the file looks like, the first 10 rows will be printed:

Code
!cd data/raw/ &&\
    head -n 10 accepted_2007_to_2018Q4.csv
id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
68407277,,3600.0,3600.0,3600.0, 36 months,13.99,123.03,C,C4,leadman,10+ years,MORTGAGE,55000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68407277,,debt_consolidation,Debt consolidation,190xx,PA,5.91,0.0,Aug-2003,675.0,679.0,1.0,30.0,,7.0,0.0,2765.0,29.7,13.0,w,0.0,0.0,4421.723916800001,4421.72,3600.0,821.72,0.0,0.0,0.0,Jan-2019,122.67,,Mar-2019,564.0,560.0,0.0,30.0,1.0,Individual,,,,0.0,722.0,144904.0,2.0,2.0,0.0,1.0,21.0,4981.0,36.0,3.0,3.0,722.0,34.0,9300.0,3.0,1.0,4.0,4.0,20701.0,1506.0,37.2,0.0,0.0,148.0,128.0,3.0,3.0,1.0,4.0,69.0,4.0,69.0,2.0,2.0,4.0,2.0,5.0,3.0,4.0,9.0,4.0,7.0,0.0,0.0,0.0,3.0,76.9,0.0,0.0,0.0,178050.0,7746.0,2400.0,13734.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68355089,,24700.0,24700.0,24700.0, 36 months,11.99,820.28,C,C1,Engineer,10+ years,MORTGAGE,65000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68355089,,small_business,Business,577xx,SD,16.06,1.0,Dec-1999,715.0,719.0,4.0,6.0,,22.0,0.0,21470.0,19.2,38.0,w,0.0,0.0,25679.66,25679.66,24700.0,979.66,0.0,0.0,0.0,Jun-2016,926.35,,Mar-2019,699.0,695.0,0.0,,1.0,Individual,,,,0.0,0.0,204396.0,1.0,1.0,0.0,1.0,19.0,18005.0,73.0,2.0,3.0,6472.0,29.0,111800.0,0.0,0.0,6.0,4.0,9733.0,57830.0,27.1,0.0,0.0,113.0,192.0,2.0,2.0,4.0,2.0,,0.0,6.0,0.0,5.0,5.0,13.0,17.0,6.0,20.0,27.0,5.0,22.0,0.0,0.0,0.0,2.0,97.4,7.7,0.0,0.0,314017.0,39475.0,79300.0,24667.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68341763,,20000.0,20000.0,20000.0, 60 months,10.78,432.66,B,B4,truck driver,10+ years,MORTGAGE,63000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68341763,,home_improvement,,605xx,IL,10.78,0.0,Aug-2000,695.0,699.0,0.0,,,6.0,0.0,7869.0,56.2,18.0,w,0.0,0.0,22705.924293878397,22705.92,20000.0,2705.92,0.0,0.0,0.0,Jun-2017,15813.3,,Mar-2019,704.0,700.0,0.0,,1.0,Joint App,71000.0,13.85,Not Verified,0.0,0.0,189699.0,0.0,1.0,0.0,4.0,19.0,10827.0,73.0,0.0,2.0,2081.0,65.0,14000.0,2.0,5.0,1.0,6.0,31617.0,2737.0,55.9,0.0,0.0,125.0,184.0,14.0,14.0,5.0,101.0,,10.0,,0.0,2.0,3.0,2.0,4.0,6.0,4.0,7.0,3.0,6.0,0.0,0.0,0.0,0.0,100.0,50.0,0.0,0.0,218418.0,18696.0,6200.0,14877.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
66310712,,35000.0,35000.0,35000.0, 60 months,14.85,829.9,C,C5,Information Systems Officer,10+ years,MORTGAGE,110000.0,Source Verified,Dec-2015,Current,n,https://lendingclub.com/browse/loanDetail.action?loan_id=66310712,,debt_consolidation,Debt consolidation,076xx,NJ,17.06,0.0,Sep-2008,785.0,789.0,0.0,,,13.0,0.0,7802.0,11.6,17.0,w,15897.65,15897.65,31464.01,31464.01,19102.35,12361.66,0.0,0.0,0.0,Feb-2019,829.9,Apr-2019,Mar-2019,679.0,675.0,0.0,,1.0,Individual,,,,0.0,0.0,301500.0,1.0,1.0,0.0,1.0,23.0,12609.0,70.0,1.0,1.0,6987.0,45.0,67300.0,0.0,1.0,0.0,2.0,23192.0,54962.0,12.1,0.0,0.0,36.0,87.0,2.0,2.0,1.0,2.0,,,,0.0,4.0,5.0,8.0,10.0,2.0,10.0,13.0,5.0,13.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,381215.0,52226.0,62500.0,18000.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68476807,,10400.0,10400.0,10400.0, 60 months,22.45,289.91,F,F1,Contract Specialist,3 years,MORTGAGE,104433.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68476807,,major_purchase,Major purchase,174xx,PA,25.37,1.0,Jun-1998,695.0,699.0,3.0,12.0,,12.0,0.0,21929.0,64.5,35.0,w,0.0,0.0,11740.5,11740.5,10400.0,1340.5,0.0,0.0,0.0,Jul-2016,10128.96,,Mar-2018,704.0,700.0,0.0,,1.0,Individual,,,,0.0,0.0,331730.0,1.0,3.0,0.0,3.0,14.0,73839.0,84.0,4.0,7.0,9702.0,78.0,34000.0,2.0,1.0,3.0,10.0,27644.0,4567.0,77.5,0.0,0.0,128.0,210.0,4.0,4.0,6.0,4.0,12.0,1.0,12.0,0.0,4.0,6.0,5.0,9.0,10.0,7.0,19.0,6.0,12.0,0.0,0.0,0.0,4.0,96.6,60.0,0.0,0.0,439570.0,95768.0,20300.0,88097.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68426831,,11950.0,11950.0,11950.0, 36 months,13.44,405.18,C,C3,Veterinary Tecnician,4 years,RENT,34000.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68426831,,debt_consolidation,Debt consolidation,300xx,GA,10.2,0.0,Oct-1987,690.0,694.0,0.0,,,5.0,0.0,8822.0,68.4,6.0,w,0.0,0.0,13708.9485297572,13708.95,11950.0,1758.95,0.0,0.0,0.0,May-2017,7653.56,,May-2017,759.0,755.0,0.0,,1.0,Individual,,,,0.0,0.0,12798.0,0.0,1.0,0.0,0.0,338.0,3976.0,99.0,0.0,0.0,4522.0,76.0,12900.0,0.0,0.0,0.0,0.0,2560.0,844.0,91.0,0.0,0.0,338.0,54.0,32.0,32.0,0.0,36.0,,,,0.0,2.0,3.0,2.0,2.0,2.0,4.0,4.0,3.0,5.0,0.0,0.0,0.0,0.0,100.0,100.0,0.0,0.0,16900.0,12798.0,9400.0,4000.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68476668,,20000.0,20000.0,20000.0, 36 months,9.17,637.58,B,B2,Vice President of Recruiting Operations,10+ years,MORTGAGE,180000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68476668,,debt_consolidation,Debt consolidation,550xx,MN,14.67,0.0,Jun-1990,680.0,684.0,0.0,49.0,,12.0,0.0,87329.0,84.5,27.0,f,0.0,0.0,21393.800000011,21393.8,20000.0,1393.8,0.0,0.0,0.0,Nov-2016,15681.05,,Mar-2019,654.0,650.0,0.0,,1.0,Individual,,,,0.0,0.0,360358.0,0.0,2.0,0.0,2.0,18.0,29433.0,63.0,2.0,3.0,13048.0,74.0,94200.0,1.0,0.0,1.0,6.0,30030.0,0.0,102.9,0.0,0.0,142.0,306.0,10.0,10.0,4.0,12.0,,10.0,,0.0,4.0,6.0,4.0,5.0,7.0,9.0,16.0,6.0,12.0,0.0,0.0,0.0,2.0,96.3,100.0,0.0,0.0,388852.0,116762.0,31500.0,46452.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
67275481,,20000.0,20000.0,20000.0, 36 months,8.49,631.26,B,B1,road driver,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=67275481,,major_purchase,Major purchase,293xx,SC,17.61,1.0,Feb-1999,705.0,709.0,0.0,3.0,,8.0,0.0,826.0,5.7,15.0,w,0.0,0.0,21538.508976797,21538.51,20000.0,1538.51,0.0,0.0,0.0,Jan-2017,14618.23,,Mar-2019,674.0,670.0,0.0,3.0,1.0,Individual,,,,0.0,0.0,141601.0,0.0,3.0,0.0,4.0,13.0,27111.0,75.0,0.0,0.0,640.0,55.0,14500.0,1.0,0.0,2.0,4.0,17700.0,13674.0,5.7,0.0,0.0,149.0,55.0,32.0,13.0,3.0,32.0,,8.0,,1.0,2.0,2.0,3.0,3.0,9.0,3.0,3.0,2.0,8.0,0.0,0.0,1.0,0.0,93.3,0.0,0.0,0.0,193390.0,27937.0,14500.0,36144.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
68466926,,10000.0,10000.0,10000.0, 36 months,6.49,306.45,A,A2,SERVICE MANAGER,6 years,RENT,85000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=68466926,,credit_card,Credit card refinancing,160xx,PA,13.07,0.0,Apr-2002,685.0,689.0,1.0,,106.0,14.0,1.0,10464.0,34.5,23.0,w,0.0,0.0,10998.9715749644,10998.97,10000.0,998.97,0.0,0.0,0.0,Aug-2018,1814.48,,Mar-2019,719.0,715.0,0.0,,1.0,Individual,,,,0.0,8341.0,27957.0,2.0,1.0,0.0,0.0,35.0,17493.0,57.0,2.0,7.0,2524.0,46.0,30300.0,2.0,0.0,1.0,7.0,1997.0,8182.0,50.1,0.0,0.0,164.0,129.0,1.0,1.0,1.0,4.0,,1.0,,0.0,6.0,9.0,7.0,10.0,3.0,13.0,19.0,9.0,14.0,0.0,0.0,0.0,2.0,95.7,28.6,1.0,0.0,61099.0,27957.0,16400.0,30799.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
Code
!cd data/raw/ &&\
    head -n 10 accepted_2007_to_2018Q4.csv | csvlook
|         id | member_id | loan_amnt | funded_amnt | funded_amnt_inv |       term | int_rate | installment | grade | sub_grade | emp_title                               | emp_length | home_ownership | annual_inc | verification_status | issue_d  | loan_status | pymnt_plan | url                                                               | desc | purpose            | title                   | zip_code | addr_state |   dti | delinq_2yrs | earliest_cr_line | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | annual_inc_joint | dti_joint | verification_status_joint | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | revol_bal_joint | sec_app_fico_range_low | sec_app_fico_range_high | sec_app_earliest_cr_line | sec_app_inq_last_6mths | sec_app_mort_acc | sec_app_open_acc | sec_app_revol_util | sec_app_open_act_il | sec_app_num_rev_accts | sec_app_chargeoff_within_12_mths | sec_app_collections_12_mths_ex_med | sec_app_mths_since_last_major_derog | hardship_flag | hardship_type | hardship_reason | hardship_status | deferral_term | hardship_amount | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term |
| ---------- | --------- | --------- | ----------- | --------------- | ---------- | -------- | ----------- | ----- | --------- | --------------------------------------- | ---------- | -------------- | ---------- | ------------------- | -------- | ----------- | ---------- | ----------------------------------------------------------------- | ---- | ------------------ | ----------------------- | -------- | ---------- | ----- | ----------- | ---------------- | -------------- | --------------- | -------------- | ---------------------- | ---------------------- | -------- | ------- | --------- | ---------- | --------- | ------------------- | --------- | ------------- | ----------- | --------------- | --------------- | ------------- | ------------------ | ---------- | ----------------------- | ------------ | --------------- | ------------ | ------------------ | -------------------- | ------------------- | -------------------------- | --------------------------- | ----------- | ---------------- | ---------------- | --------- | ------------------------- | -------------- | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- | ------------------ | ------------ | ------- | ----------- | ----------- | ---------- | -------- | ---------------- | ------ | ----------- | ------------ | -------------------- | ----------- | -------------- | ------- | ------------------------ | ----------- | ------------------ | -------------------- | --------------------- | -------------- | -------- | -------------------- | ------------------------ | --------------------- | ------------------------------ | --------------------- | -------------- | --------------- | ----------- | --------- | --------- | ------------- | ------------- | ------------------- | -------- | ---------------- | ------------ | ------------------ | ------------------ | -------------- | ---------------- | -------------------- | --------- | --------------- | ----------------- | -------------- | -------------------------- | --------------- | ---------------------- | ----------------------- | ------------------------ | ---------------------- | ---------------- | ---------------- | ------------------ | ------------------- | --------------------- | -------------------------------- | ---------------------------------- | ----------------------------------- | ------------- | ------------- | --------------- | --------------- | ------------- | --------------- | ------------------- | ----------------- | ----------------------- | --------------- | ------------ | -------------------- | ------------------------------------------ | ------------------------------ | ---------------------------- | ------------------- | -------------------- | ------------------------- | ----------------- | --------------- | ----------------- | --------------------- | --------------- |
| 68,407,277 |           |     3,600 |       3,600 |           3,600 | 0004-01-01 |    13.99 |      123.03 | C     | C4        | leadman                                 | 10+ years  | MORTGAGE       |     55,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68407277 |      | debt_consolidation | Debt consolidation      | 190xx    | PA         |  5.91 |           0 | Aug-2003         |            675 |             679 |              1 |                     30 |                        |        7 |       0 |     2,765 |       29.7 |        13 | w                   |      0.00 |          0.00 |  4,421.724… |        4,421.72 |        3,600.00 |        821.72 |                  0 |          0 |                       0 | Jan-2019     |          122.67 |              | Mar-2019           |                  564 |                 560 |                          0 |                          30 |           1 | Individual       |                  |           |                           |              0 |          722 |     144,904 |           2 |           2 |           0 |           1 |                 21 |        4,981 |      36 |           3 |           3 |        722 |       34 |            9,300 |      3 |           1 |            4 |                    4 |      20,701 |          1,506 |    37.2 |                        0 |           0 |                148 |                  128 |                     3 |              3 |        1 |                    4 |                       69 |                     4 |                             69 |                     2 |              2 |               4 |           2 |         5 |         3 |             4 |             9 |                   4 |        7 |                0 |            0 |                  0 |                  3 |           76.9 |              0.0 |                    0 |         0 |         178,050 |             7,746 |          2,400 |                     13,734 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,355,089 |           |    24,700 |      24,700 |          24,700 | 0004-01-01 |    11.99 |      820.28 | C     | C1        | Engineer                                | 10+ years  | MORTGAGE       |     65,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68355089 |      | small_business     | Business                | 577xx    | SD         | 16.06 |           1 | Dec-1999         |            715 |             719 |              4 |                      6 |                        |       22 |       0 |    21,470 |       19.2 |        38 | w                   |      0.00 |          0.00 | 25,679.660… |       25,679.66 |       24,700.00 |        979.66 |                  0 |          0 |                       0 | Jun-2016     |          926.35 |              | Mar-2019           |                  699 |                 695 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     204,396 |           1 |           1 |           0 |           1 |                 19 |       18,005 |      73 |           2 |           3 |      6,472 |       29 |          111,800 |      0 |           0 |            6 |                    4 |       9,733 |         57,830 |    27.1 |                        0 |           0 |                113 |                  192 |                     2 |              2 |        4 |                    2 |                          |                     0 |                              6 |                     0 |              5 |               5 |          13 |        17 |         6 |            20 |            27 |                   5 |       22 |                0 |            0 |                  0 |                  2 |           97.4 |              7.7 |                    0 |         0 |         314,017 |            39,475 |         79,300 |                     24,667 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,341,763 |           |    20,000 |      20,000 |          20,000 | 0006-01-01 |    10.78 |      432.66 | B     | B4        | truck driver                            | 10+ years  | MORTGAGE       |     63,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68341763 |      | home_improvement   |                         | 605xx    | IL         | 10.78 |           0 | Aug-2000         |            695 |             699 |              0 |                        |                        |        6 |       0 |     7,869 |       56.2 |        18 | w                   |      0.00 |          0.00 | 22,705.924… |       22,705.92 |       20,000.00 |      2,705.92 |                  0 |          0 |                       0 | Jun-2017     |       15,813.30 |              | Mar-2019           |                  704 |                 700 |                          0 |                             |           1 | Joint App        |           71,000 |     13.85 | Not Verified              |              0 |            0 |     189,699 |           0 |           1 |           0 |           4 |                 19 |       10,827 |      73 |           0 |           2 |      2,081 |       65 |           14,000 |      2 |           5 |            1 |                    6 |      31,617 |          2,737 |    55.9 |                        0 |           0 |                125 |                  184 |                    14 |             14 |        5 |                  101 |                          |                    10 |                                |                     0 |              2 |               3 |           2 |         4 |         6 |             4 |             7 |                   3 |        6 |                0 |            0 |                  0 |                  0 |          100.0 |             50.0 |                    0 |         0 |         218,418 |            18,696 |          6,200 |                     14,877 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 66,310,712 |           |    35,000 |      35,000 |          35,000 | 0006-01-01 |    14.85 |      829.90 | C     | C5        | Information Systems Officer             | 10+ years  | MORTGAGE       |    110,000 | Source Verified     | Dec-2015 | Current     |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=66310712 |      | debt_consolidation | Debt consolidation      | 076xx    | NJ         | 17.06 |           0 | Sep-2008         |            785 |             789 |              0 |                        |                        |       13 |       0 |     7,802 |       11.6 |        17 | w                   | 15,897.65 |     15,897.65 | 31,464.010… |       31,464.01 |       19,102.35 |     12,361.66 |                  0 |          0 |                       0 | Feb-2019     |          829.90 | Apr-2019     | Mar-2019           |                  679 |                 675 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     301,500 |           1 |           1 |           0 |           1 |                 23 |       12,609 |      70 |           1 |           1 |      6,987 |       45 |           67,300 |      0 |           1 |            0 |                    2 |      23,192 |         54,962 |    12.1 |                        0 |           0 |                 36 |                   87 |                     2 |              2 |        1 |                    2 |                          |                       |                                |                     0 |              4 |               5 |           8 |        10 |         2 |            10 |            13 |                   5 |       13 |                0 |            0 |                  0 |                  1 |          100.0 |              0.0 |                    0 |         0 |         381,215 |            52,226 |         62,500 |                     18,000 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,476,807 |           |    10,400 |      10,400 |          10,400 | 0006-01-01 |    22.45 |      289.91 | F     | F1        | Contract Specialist                     | 3 years    | MORTGAGE       |    104,433 | Source Verified     | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68476807 |      | major_purchase     | Major purchase          | 174xx    | PA         | 25.37 |           1 | Jun-1998         |            695 |             699 |              3 |                     12 |                        |       12 |       0 |    21,929 |       64.5 |        35 | w                   |      0.00 |          0.00 | 11,740.500… |       11,740.50 |       10,400.00 |      1,340.50 |                  0 |          0 |                       0 | Jul-2016     |       10,128.96 |              | Mar-2018           |                  704 |                 700 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     331,730 |           1 |           3 |           0 |           3 |                 14 |       73,839 |      84 |           4 |           7 |      9,702 |       78 |           34,000 |      2 |           1 |            3 |                   10 |      27,644 |          4,567 |    77.5 |                        0 |           0 |                128 |                  210 |                     4 |              4 |        6 |                    4 |                       12 |                     1 |                             12 |                     0 |              4 |               6 |           5 |         9 |        10 |             7 |            19 |                   6 |       12 |                0 |            0 |                  0 |                  4 |           96.6 |             60.0 |                    0 |         0 |         439,570 |            95,768 |         20,300 |                     88,097 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,426,831 |           |    11,950 |      11,950 |          11,950 | 0004-01-01 |    13.44 |      405.18 | C     | C3        | Veterinary Tecnician                    | 4 years    | RENT           |     34,000 | Source Verified     | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68426831 |      | debt_consolidation | Debt consolidation      | 300xx    | GA         | 10.20 |           0 | Oct-1987         |            690 |             694 |              0 |                        |                        |        5 |       0 |     8,822 |       68.4 |         6 | w                   |      0.00 |          0.00 | 13,708.949… |       13,708.95 |       11,950.00 |      1,758.95 |                  0 |          0 |                       0 | May-2017     |        7,653.56 |              | May-2017           |                  759 |                 755 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |      12,798 |           0 |           1 |           0 |           0 |                338 |        3,976 |      99 |           0 |           0 |      4,522 |       76 |           12,900 |      0 |           0 |            0 |                    0 |       2,560 |            844 |    91.0 |                        0 |           0 |                338 |                   54 |                    32 |             32 |        0 |                   36 |                          |                       |                                |                     0 |              2 |               3 |           2 |         2 |         2 |             4 |             4 |                   3 |        5 |                0 |            0 |                  0 |                  0 |          100.0 |            100.0 |                    0 |         0 |          16,900 |            12,798 |          9,400 |                      4,000 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,476,668 |           |    20,000 |      20,000 |          20,000 | 0004-01-01 |     9.17 |      637.58 | B     | B2        | Vice President of Recruiting Operations | 10+ years  | MORTGAGE       |    180,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68476668 |      | debt_consolidation | Debt consolidation      | 550xx    | MN         | 14.67 |           0 | Jun-1990         |            680 |             684 |              0 |                     49 |                        |       12 |       0 |    87,329 |       84.5 |        27 | f                   |      0.00 |          0.00 | 21,393.800… |       21,393.80 |       20,000.00 |      1,393.80 |                  0 |          0 |                       0 | Nov-2016     |       15,681.05 |              | Mar-2019           |                  654 |                 650 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |            0 |     360,358 |           0 |           2 |           0 |           2 |                 18 |       29,433 |      63 |           2 |           3 |     13,048 |       74 |           94,200 |      1 |           0 |            1 |                    6 |      30,030 |              0 |   102.9 |                        0 |           0 |                142 |                  306 |                    10 |             10 |        4 |                   12 |                          |                    10 |                                |                     0 |              4 |               6 |           4 |         5 |         7 |             9 |            16 |                   6 |       12 |                0 |            0 |                  0 |                  2 |           96.3 |            100.0 |                    0 |         0 |         388,852 |           116,762 |         31,500 |                     46,452 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 67,275,481 |           |    20,000 |      20,000 |          20,000 | 0004-01-01 |     8.49 |      631.26 | B     | B1        | road driver                             | 10+ years  | MORTGAGE       |     85,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=67275481 |      | major_purchase     | Major purchase          | 293xx    | SC         | 17.61 |           1 | Feb-1999         |            705 |             709 |              0 |                      3 |                        |        8 |       0 |       826 |        5.7 |        15 | w                   |      0.00 |          0.00 | 21,538.509… |       21,538.51 |       20,000.00 |      1,538.51 |                  0 |          0 |                       0 | Jan-2017     |       14,618.23 |              | Mar-2019           |                  674 |                 670 |                          0 |                           3 |           1 | Individual       |                  |           |                           |              0 |            0 |     141,601 |           0 |           3 |           0 |           4 |                 13 |       27,111 |      75 |           0 |           0 |        640 |       55 |           14,500 |      1 |           0 |            2 |                    4 |      17,700 |         13,674 |     5.7 |                        0 |           0 |                149 |                   55 |                    32 |             13 |        3 |                   32 |                          |                     8 |                                |                     1 |              2 |               2 |           3 |         3 |         9 |             3 |             3 |                   2 |        8 |                0 |            0 |                  1 |                  0 |           93.3 |              0.0 |                    0 |         0 |         193,390 |            27,937 |         14,500 |                     36,144 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |
| 68,466,926 |           |    10,000 |      10,000 |          10,000 | 0004-01-01 |     6.49 |      306.45 | A     | A2        | SERVICE MANAGER                         | 6 years    | RENT           |     85,000 | Not Verified        | Dec-2015 | Fully Paid  |      False | https://lendingclub.com/browse/loanDetail.action?loan_id=68466926 |      | credit_card        | Credit card refinancing | 160xx    | PA         | 13.07 |           0 | Apr-2002         |            685 |             689 |              1 |                        |                    106 |       14 |       1 |    10,464 |       34.5 |        23 | w                   |      0.00 |          0.00 | 10,998.972… |       10,998.97 |       10,000.00 |        998.97 |                  0 |          0 |                       0 | Aug-2018     |        1,814.48 |              | Mar-2019           |                  719 |                 715 |                          0 |                             |           1 | Individual       |                  |           |                           |              0 |        8,341 |      27,957 |           2 |           1 |           0 |           0 |                 35 |       17,493 |      57 |           2 |           7 |      2,524 |       46 |           30,300 |      2 |           0 |            1 |                    7 |       1,997 |          8,182 |    50.1 |                        0 |           0 |                164 |                  129 |                     1 |              1 |        1 |                    4 |                          |                     1 |                                |                     0 |              6 |               9 |           7 |        10 |         3 |            13 |            19 |                   9 |       14 |                0 |            0 |                  0 |                  2 |           95.7 |             28.6 |                    1 |         0 |          61,099 |            27,957 |         16,400 |                     30,799 |                 |                        |                         |                          |                        |                  |                  |                    |                     |                       |                                  |                                    |                                     |         False |               |                 |                 |               |                 |                     |                   |                         |                 |              |                      |                                            |                                |                              | Cash                |                False |                           |                   |                 |                   |                       |                 |

Next, data will be imported, the year 2018 will be kept and data types will be automatically adjusted to take less memory. The results will be saved into a feather file.

Code
file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
if os.path.exists(file_path):
    # Restore from file, if present.
    accepted_2018 = pd.read_feather(file_path)
else:
    # Use Pyarrow backend to read data.
    # Leave only data of loans issued in year 2018.
    accepted_all = pd.read_csv("data/raw/accepted_2007_to_2018Q4.csv", engine="pyarrow")
    accepted_2018 = accepted_all[accepted_all["issue_d"].str.contains("2018", na=False)]
    accepted_2018 = klib.convert_datatypes(accepted_2018)
    accepted_2018.to_feather(file_path)
    del accepted_all

del file_path

The whole dataset took 2.5+ GB of memory (code is not shown here). After leaving only the year 2018, it used 574.3+ MB (code is not shown here). After automatic data type adjustment, the memory usage dropped almost 2 times to 267.4 MB.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB
<class 'pandas.core.frame.DataFrame'>
Index: 495242 entries, 421097 to 1611876
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 574.3+ MB
Code
accepted_2018.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
Index: 495242 entries, 421097 to 1611876
Columns: 151 entries, id to settlement_term
dtypes: category(35), float32(102), float64(11), string(3)
memory usage: 267.4 MB
Note

In this section, a general overview and inspection will be done to get more familiar with the data and to catch the most obvious discrepancies. A detailed EDA will be performed on the training set only.

To spot possible discrepancies, the columns were inspected and the results were sorted by the number of unique values (n_unique). Some columns have no non-missing values (member ID, description), some have only 1 non-missing value (hardship_type, deferral_term, etc.), and these should be excluded from the analysis.

Code
an.col_info(accepted_2018).sort_values("n_unique").pipe(an.style_col_info)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
2 member_id float32 2.0 MB 0 0% 495,242 100.0% 0 0% nan%
20 desc category 495.4 kB 0 0% 495,242 100.0% 0 0% nan%
130 hardship_type category 495.4 kB 1 <0.1% 494,874 99.9% 368 0.1% 100.0% INTEREST ONLY-3 MONTHS DEFERRAL
133 deferral_term float32 2.0 MB 1 <0.1% 494,874 99.9% 368 0.1% 100.0% 3.0
138 hardship_length float32 2.0 MB 1 <0.1% 494,874 99.9% 368 0.1% 100.0% 3.0
104 num_tl_120dpd_2m float32 2.0 MB 1 <0.1% 12,404 2.5% 482,838 97.5% 100.0% 0.0
56 policy_code float32 2.0 MB 1 <0.1% 0 0% 495,242 100.0% 100.0% 1.0
145 debt_settlement_flag category 495.5 kB 2 <0.1% 0 0% 494,762 99.9% 99.9% N
144 disbursement_method category 495.5 kB 2 <0.1% 0 0% 423,884 85.6% 85.6% Cash
105 num_tl_30dpd float32 2.0 MB 2 <0.1% 0 0% 495,219 >99.9% >99.9% 0.0
38 initial_list_status category 495.5 kB 2 <0.1% 0 0% 427,183 86.3% 86.3% w
129 hardship_flag category 495.5 kB 2 <0.1% 0 0% 495,053 >99.9% >99.9% N
57 application_type category 495.5 kB 2 <0.1% 0 0% 426,257 86.1% 86.1% Individual
61 acc_now_delinq float32 2.0 MB 2 <0.1% 0 0% 495,216 >99.9% >99.9% 0.0
6 term category 495.5 kB 2 <0.1% 0 0% 344,671 69.6% 69.6% 36 months
18 pymnt_plan category 495.5 kB 2 <0.1% 0 0% 495,091 >99.9% >99.9% n
50 next_pymnt_d category 495.5 kB 3 <0.1% 56,311 11.4% 438,776 88.6% >99.9% Apr-2019
132 hardship_status category 495.5 kB 3 <0.1% 494,874 99.9% 189 <0.1% 51.4% ACTIVE
60 verification_status_joint category 495.6 kB 3 <0.1% 431,231 87.1% 28,298 5.7% 44.2% Not Verified
147 settlement_status category 495.5 kB 3 <0.1% 494,762 99.9% 391 0.1% 81.5% ACTIVE
15 verification_status category 495.6 kB 3 <0.1% 0 0% 199,934 40.4% 40.4% Not Verified
13 home_ownership category 495.7 kB 4 <0.1% 0 0% 239,220 48.3% 48.3% MORTGAGE
140 hardship_loan_status category 495.8 kB 5 <0.1% 494,874 99.9% 150 <0.1% 40.8% Late (16-30 days)
30 inq_last_6mths float32 2.0 MB 6 <0.1% 0 0% 332,652 67.2% 67.2% 0.0
17 loan_status category 496.0 kB 7 <0.1% 0 0% 427,181 86.3% 86.3% Current
9 grade category 495.9 kB 7 <0.1% 0 0% 141,365 28.5% 28.5% B
120 sec_app_inq_last_6mths float32 2.0 MB 7 <0.1% 426,257 86.1% 42,442 8.6% 61.5% 0.0
66 open_il_12m float32 2.0 MB 8 <0.1% 0 0% 272,591 55.0% 55.0% 0.0
110 pub_rec_bankruptcies float32 2.0 MB 8 <0.1% 0 0% 434,943 87.8% 87.8% 0.0
54 collections_12_mths_ex_med float32 2.0 MB 9 <0.1% 0 0% 487,215 98.4% 98.4% 0.0
131 hardship_reason category 496.2 kB 9 <0.1% 494,874 99.9% 96 <0.1% 26.1% UNEMPLOYMENT
136 hardship_end_date category 496.1 kB 9 <0.1% 494,874 99.9% 87 <0.1% 23.6% May-2019
135 hardship_start_date category 496.1 kB 9 <0.1% 494,874 99.9% 98 <0.1% 26.6% Mar-2019
83 chargeoff_within_12_mths float32 2.0 MB 9 <0.1% 0 0% 492,165 99.4% 99.4% 0.0
137 payment_plan_start_date category 496.1 kB 9 <0.1% 494,874 99.9% 91 <0.1% 24.7% Mar-2019
146 debt_settlement_flag_date category 496.2 kB 10 <0.1% 494,762 99.9% 142 <0.1% 29.6% Mar-2019
148 settlement_date category 496.3 kB 11 <0.1% 494,762 99.9% 103 <0.1% 21.5% Jan-2019
12 emp_length category 496.2 kB 11 <0.1% 41,987 8.5% 160,382 32.4% 35.4% 10+ years
16 issue_d category 496.3 kB 12 <0.1% 0 0% 46,311 9.4% 9.4% May-2018
22 title category 496.4 kB 12 <0.1% 0 0% 259,642 52.4% 52.4% Debt consolidation
21 purpose category 496.7 kB 13 <0.1% 0 0% 259,642 52.4% 52.4% debt_consolidation
64 open_acc_6m float32 2.0 MB 15 <0.1% 0 0% 231,308 46.7% 46.7% 0.0
48 last_pymnt_d category 496.8 kB 15 <0.1% 640 0.1% 407,215 82.2% 82.3% Mar-2019
34 pub_rec float32 2.0 MB 16 <0.1% 0 0% 432,258 87.3% 87.3% 0.0
111 tax_liens float32 2.0 MB 16 <0.1% 0 0% 491,903 99.3% 99.3% 0.0
127 sec_app_collections_12_mths_ex_med float32 2.0 MB 17 <0.1% 426,257 86.1% 65,305 13.2% 94.7% 0.0
51 last_credit_pull_d category 497.0 kB 18 <0.1% 5 <0.1% 458,186 92.5% 92.5% Mar-2019
151 settlement_term float32 2.0 MB 19 <0.1% 494,762 99.9% 177 <0.1% 36.9% 24.0
67 open_il_24m float32 2.0 MB 21 <0.1% 0 0% 155,968 31.5% 31.5% 1.0
126 sec_app_chargeoff_within_12_mths float32 2.0 MB 21 <0.1% 426,257 86.1% 67,272 13.6% 97.5% 0.0
121 sec_app_mort_acc float32 2.0 MB 22 <0.1% 426,257 86.1% 27,029 5.5% 39.2% 0.0
71 open_rv_12m float32 2.0 MB 24 <0.1% 0 0% 195,241 39.4% 39.4% 0.0
106 num_tl_90g_dpd_24m float32 2.0 MB 25 <0.1% 0 0% 475,280 96.0% 96.0% 0.0
26 delinq_2yrs float32 2.0 MB 26 <0.1% 0 0% 422,718 85.4% 85.4% 0.0
107 num_tl_op_past_12m float32 2.0 MB 26 <0.1% 0 0% 127,126 25.7% 25.7% 1.0
92 mths_since_recent_inq float32 2.0 MB 26 <0.1% 61,305 12.4% 41,585 8.4% 9.6% 1.0
139 hardship_dpd float32 2.0 MB 30 <0.1% 494,874 99.9% 61 <0.1% 16.6% 0.0
89 mort_acc float32 2.0 MB 30 <0.1% 0 0% 226,841 45.8% 45.8% 0.0
76 inq_fi float32 2.0 MB 30 <0.1% 0 0% 227,365 45.9% 45.9% 0.0
10 sub_grade category 498.4 kB 35 <0.1% 0 0% 31,728 6.4% 6.4% B4
95 num_actv_bc_tl float32 2.0 MB 35 <0.1% 0 0% 101,502 20.5% 20.5% 2.0
124 sec_app_open_act_il float32 2.0 MB 37 <0.1% 426,257 86.1% 15,286 3.1% 22.2% 1.0
94 num_accts_ever_120_pd float32 2.0 MB 38 <0.1% 0 0% 391,216 79.0% 79.0% 0.0
29 fico_range_high float32 2.0 MB 38 <0.1% 0 0% 32,103 6.5% 6.5% 684.0
28 fico_range_low float32 2.0 MB 38 <0.1% 0 0% 32,103 6.5% 6.5% 680.0
72 open_rv_24m float32 2.0 MB 41 <0.1% 0 0% 110,850 22.4% 22.4% 1.0
102 num_rev_tl_bal_gt_0 float32 2.0 MB 45 <0.1% 0 0% 74,540 15.1% 15.1% 4.0
79 acc_open_past_24mths float32 2.0 MB 45 <0.1% 0 0% 74,308 15.0% 15.0% 3.0
78 inq_last_12m float32 2.0 MB 45 <0.1% 0 0% 153,733 31.0% 31.0% 0.0
150 settlement_percentage float32 2.0 MB 45 <0.1% 494,762 99.9% 145 <0.1% 30.2% 45.0
65 open_act_il float32 2.0 MB 48 <0.1% 0 0% 126,903 25.6% 25.6% 1.0
96 num_actv_rev_tl float32 2.0 MB 49 <0.1% 0 0% 74,082 15.0% 15.0% 4.0
24 addr_state category 500.3 kB 50 <0.1% 0 0% 67,267 13.6% 13.6% CA
77 total_cu_tl float32 2.0 MB 50 <0.1% 0 0% 265,016 53.5% 53.5% 0.0
97 num_bc_sats float32 2.0 MB 51 <0.1% 0 0% 82,277 16.6% 16.6% 3.0
118 sec_app_fico_range_high float32 2.0 MB 62 <0.1% 426,257 86.1% 3,843 0.8% 5.6% 674.0
117 sec_app_fico_range_low float32 2.0 MB 62 <0.1% 426,257 86.1% 3,843 0.8% 5.6% 670.0
122 sec_app_open_acc float32 2.0 MB 63 <0.1% 426,257 86.1% 4,906 1.0% 7.1% 9.0
98 num_bc_tl float32 2.0 MB 65 <0.1% 0 0% 56,940 11.5% 11.5% 4.0
53 last_fico_range_low float32 2.0 MB 71 <0.1% 0 0% 22,249 4.5% 4.5% 695.0
52 last_fico_range_high float32 2.0 MB 72 <0.1% 0 0% 22,249 4.5% 4.5% 699.0
100 num_op_rev_tl float32 2.0 MB 73 <0.1% 0 0% 51,758 10.5% 10.5% 5.0
103 num_sats float32 2.0 MB 75 <0.1% 0 0% 40,989 8.3% 8.3% 9.0
33 open_acc float32 2.0 MB 76 <0.1% 0 0% 40,942 8.3% 8.3% 9.0
125 sec_app_num_rev_accts float32 2.0 MB 81 <0.1% 426,257 86.1% 4,222 0.9% 6.1% 7.0
99 num_il_tl float32 2.0 MB 99 <0.1% 0 0% 44,138 8.9% 8.9% 3.0
101 num_rev_accts float32 2.0 MB 101 <0.1% 0 0% 32,567 6.6% 6.6% 8.0
7 int_rate float32 2.0 MB 110 <0.1% 0 0% 13,718 2.8% 2.8% 13.56
32 mths_since_last_record float32 2.0 MB 127 <0.1% 432,258 87.3% 1,168 0.2% 1.9% 94.0
37 total_acc float32 2.0 MB 130 <0.1% 0 0% 18,677 3.8% 3.8% 16.0
128 sec_app_mths_since_last_major_derog float32 2.0 MB 136 <0.1% 472,865 95.5% 614 0.1% 2.7% 1.0
31 mths_since_last_delinq float32 2.0 MB 158 <0.1% 276,652 55.9% 3,892 0.8% 1.8% 12.0
84 delinq_amnt float32 2.0 MB 159 <0.1% 0 0% 495,082 >99.9% >99.9% 0.0
91 mths_since_recent_bc_dlq float32 2.0 MB 159 <0.1% 397,132 80.2% 1,724 0.3% 1.8% 44.0
55 mths_since_last_major_derog float32 2.0 MB 164 <0.1% 380,409 76.8% 1,875 0.4% 1.6% 45.0
93 mths_since_recent_revol_delinq float32 2.0 MB 165 <0.1% 352,552 71.2% 2,535 0.5% 1.8% 24.0
74 all_util float32 2.0 MB 166 <0.1% 129 <0.1% 9,374 1.9% 1.9% 60.0
88 mo_sin_rcnt_tl float32 2.0 MB 204 <0.1% 0 0% 51,311 10.4% 10.4% 2.0
109 percent_bc_gt_75 float32 2.0 MB 204 <0.1% 6,596 1.3% 188,963 38.2% 38.7% 0.0
70 il_util float32 2.0 MB 235 <0.1% 80,824 16.3% 7,747 1.6% 1.9% 78.0
87 mo_sin_rcnt_rev_tl_op float32 2.0 MB 286 0.1% 0 0% 35,630 7.2% 7.2% 2.0
141 orig_projected_additional_accrued_interest float32 2.0 MB 321 0.1% 494,921 99.9% 1 <0.1% 0.3% 191.64
68 mths_since_rcnt_il float32 2.0 MB 352 0.1% 18,410 3.7% 21,698 4.4% 4.6% 7.0
143 hardship_last_payment_amount float32 2.0 MB 361 0.1% 494,874 99.9% 2 <0.1% 0.5% 0.81
134 hardship_amount float32 2.0 MB 367 0.1% 494,874 99.9% 2 <0.1% 0.5% 164.39
142 hardship_payoff_balance_amount float64 4.0 MB 368 0.1% 494,874 99.9% 1 <0.1% 0.3% 4238.9
90 mths_since_recent_bc float32 2.0 MB 475 0.1% 6,198 1.3% 22,244 4.5% 4.5% 3.0
149 settlement_amount float32 2.0 MB 477 0.1% 494,762 99.9% 2 <0.1% 0.4% 4272.0
85 mo_sin_old_il_acct float32 2.0 MB 507 0.1% 18,410 3.7% 5,195 1.0% 1.1% 130.0
108 pct_tl_nvr_dlq float32 2.0 MB 588 0.1% 2 <0.1% 276,697 55.9% 55.9% 100.0
119 sec_app_earliest_cr_line category 1.0 MB 645 0.1% 426,257 86.1% 641 0.1% 0.9% Aug-2006
27 earliest_cr_line category 1.1 MB 684 0.1% 0 0% 4,265 0.9% 0.9% Aug-2006
86 mo_sin_old_rev_tl_op float32 2.0 MB 731 0.1% 0 0% 3,040 0.6% 0.6% 136.0
23 zip_code category 1.1 MB 897 0.2% 0 0% 5,364 1.1% 1.1% 112xx
36 revol_util float32 2.0 MB 1,136 0.2% 592 0.1% 4,942 1.0% 1.0% 0.0
123 sec_app_revol_util float32 2.0 MB 1,164 0.2% 427,454 86.3% 813 0.2% 1.2% 0.0
82 bc_util float32 2.0 MB 1,210 0.2% 6,803 1.4% 9,601 1.9% 2.0% 0.0
3 loan_amnt float32 2.0 MB 1,559 0.3% 0 0% 55,645 11.2% 11.2% 10000.0
4 funded_amnt float32 2.0 MB 1,559 0.3% 0 0% 55,645 11.2% 11.2% 10000.0
5 funded_amnt_inv float64 4.0 MB 1,571 0.3% 0 0% 53,986 10.9% 10.9% 10000.0
46 recoveries float64 4.0 MB 1,998 0.4% 0 0% 493,047 99.6% 99.6% 0.0
47 collection_recovery_fee float32 2.0 MB 2,000 0.4% 0 0% 493,047 99.6% 99.6% 0.0
45 total_rec_late_fee float32 2.0 MB 2,334 0.5% 0 0% 488,297 98.6% 98.6% 0.0
59 dti_joint float32 2.0 MB 3,943 0.8% 426,257 86.1% 49 <0.1% 0.1% 21.64
114 total_bc_limit float32 2.0 MB 4,585 0.9% 0 0% 6,596 1.3% 1.3% 0.0
62 tot_coll_amt float32 2.0 MB 7,608 1.5% 0 0% 425,302 85.9% 85.9% 0.0
75 total_rev_hi_lim float32 2.0 MB 8,303 1.7% 0 0% 1,493 0.3% 0.3% 10000.0
25 dti float32 2.0 MB 9,464 1.9% 1,132 0.2% 819 0.2% 0.2% 0.0
58 annual_inc_joint float64 4.0 MB 10,952 2.2% 426,257 86.1% 1,216 0.2% 1.8% 100000.0
73 max_bal_bc float32 2.0 MB 28,148 5.7% 0 0% 15,355 3.1% 3.1% 0.0
14 annual_inc float64 4.0 MB 30,071 6.1% 0 0% 18,932 3.8% 3.8% 60000.0
8 installment float32 2.0 MB 35,303 7.1% 0 0% 1,463 0.3% 0.3% 304.72
116 revol_bal_joint float32 2.0 MB 44,730 9.0% 426,257 86.1% 69 <0.1% 0.1% 0.0
80 avg_cur_bal float32 2.0 MB 62,125 12.5% 40 <0.1% 540 0.1% 0.1% 0.0
35 revol_bal float32 2.0 MB 64,413 13.0% 0 0% 4,742 1.0% 1.0% 0.0
81 bc_open_to_buy float32 2.0 MB 68,060 13.7% 6,588 1.3% 3,297 0.7% 0.7% 0.0
49 last_pymnt_amnt float64 4.0 MB 86,195 17.4% 0 0% 1,382 0.3% 0.3% 304.72
69 total_bal_il float32 2.0 MB 115,553 23.3% 0 0% 62,158 12.6% 12.6% 0.0
115 total_il_high_credit_limit float32 2.0 MB 128,691 26.0% 0 0% 62,160 12.6% 12.6% 0.0
11 emp_title string 35.4 MB 129,449 26.1% 54,659 11.0% 8,679 1.8% 2.0% Teacher
43 total_rec_prncp float64 4.0 MB 129,889 26.2% 0 0% 4,736 1.0% 1.0% 10000.0
39 out_prncp float64 4.0 MB 136,867 27.6% 0 0% 57,348 11.6% 11.6% 0.0
113 total_bal_ex_mort float32 2.0 MB 139,900 28.2% 0 0% 839 0.2% 0.2% 0.0
40 out_prncp_inv float64 4.0 MB 143,128 28.9% 0 0% 57,348 11.6% 11.6% 0.0
44 total_rec_int float32 2.0 MB 172,941 34.9% 0 0% 732 0.1% 0.1% 0.0
41 total_pymnt float64 4.0 MB 234,997 47.5% 0 0% 448 0.1% 0.1% 0.0
42 total_pymnt_inv float64 4.0 MB 236,121 47.7% 0 0% 448 0.1% 0.1% 0.0
63 tot_cur_bal float32 2.0 MB 254,794 51.4% 0 0% 575 0.1% 0.1% 0.0
112 tot_hi_cred_lim float32 2.0 MB 266,282 53.8% 0 0% 207 <0.1% <0.1% 12500.0
19 url string 60.9 MB 495,242 100.0% 0 0% 1 <0.1% <0.1% https://lendingclub.com/browse/loanDetail.action?loan_id=130954621
1 id string 32.7 MB 495,242 100.0% 0 0% 1 <0.1% <0.1% 130954621

In the categorical target columns: first grades and sub-grades (e.g., A, B) seem to be more common than the last ones (e.g., F, G). This should be accounted for in the analysis. E.g., by using a stratified split.

Code
grade_counts = accepted_2018["grade"].value_counts().sort_index()
grade_counts.plot.bar(ylabel="counts")
grade_counts
grade
A    135177
B    141365
C    126850
D     69046
E     18958
F      3175
G       671
Name: count, dtype: int64

Code
sub_grade_counts = accepted_2018["sub_grade"].value_counts().sort_index()
sub_grade_counts.plot.bar(ylabel="counts")
pd.concat([sub_grade_counts.head(), pd.Series({"...": "..."}), sub_grade_counts.tail()])
A1     28887
A2     22871
A3     25264
A4     31648
A5     26507
...      ...
G1       487
G2        88
G3        44
G4        25
G5        27
dtype: object

Loan title and purpose contain similar information. Some categories are very rare so should be merged. One of the variables should be dropped.

Code
ct = an.CrossTab("title", "purpose", data=accepted_2018, margins=True)
ct.counts = (
    ct.counts.sort_values(by="All", axis=0, ascending=False)
    .sort_values(by="All", axis=1, ascending=False)
    .drop("All", axis=0)
    .drop("All", axis=1)
)
ct.heatmap(vmax=1, cbar=False, cmap="crest");

4.1.2 Group-Independent Pre-Processing

In this pre-processing phase:

  • columns with more than 70% of missing values will be dropped;
  • extract features (months from issue date, national area code zip_area from the first digit of the ZIP code, title length);
  • add missing value indicators.
  • some other columns after manual inspection will be dropped
    • irrelevant variables or variables after feature extraction;
    • date variables;
    • categorical variables with too many categories;
    • etc.
Code
file_path = dir_interim + "task-2--2-accepted_loans-2018--preprocessed.feather"

# Drop columns based on manual inspection
drop_manually = [
    # Irrelevant
    "url",
    "id",
    "policy_code",
    "loan_status",
    # Almost duplicates of other columns:
    "title",
    # Dates
    "last_pymnt_d",
    "next_pymnt_d",
    "last_credit_pull_d",
    "issue_d",
    "earliest_cr_line",
    # Too many categories
    "zip_code",
    "addr_state",
    "emp_title",
    # Extremely low variability
    "num_tl_120dpd_2m",
    "acc_now_delinq",
]


if os.path.exists(file_path):
    accepted_2018_preproc = pd.read_feather(file_path)
else:
    # Remove columns with more than 70% of missing values
    accepted_2018_preproc = accepted_2018.loc[:, accepted_2018.isnull().mean() <= 0.7]

    # Pre-process the data
    accepted_2018_preproc = accepted_2018_preproc.assign(
        # Extract features
        title_len=lambda df: df["title"].str.len().fillna(0).astype("float32"),
        issue_d=lambda df: pd.to_datetime(df["issue_d"], format="%b-%Y"),
        issue_month=lambda df: df["issue_d"].dt.month.astype("Int8"),
        zip_area=lambda x: x.zip_code.str[0].astype("category"),
        # Encode variables:
        # - Employment length classes as integers
        emp_length=lambda df: df["emp_length"]
        .replace(dict(zip(utils.work_categories, range(len(utils.work_categories)))))
        .astype("Int8"),
        # Add missing value indicators
        emp_length_is_na=lambda df: df["emp_length"].isna().astype("Int8"),
        dti_is_na=lambda df: df["dti"].isna().astype("Int8"),
        mths_since_last_delinq_is_na=(
            lambda df: df["mths_since_last_delinq"].isna().astype("Int8")
        ),
        revol_util_is_na=lambda df: df["revol_util"].isna().astype("Int8"),
        mths_since_rcnt_il_is_na=lambda df: df["mths_since_rcnt_il"]
        .isna()
        .astype("Int8"),
        il_util_is_na=lambda df: df["il_util"].isna().astype("Int8"),
        all_util_is_na=lambda df: df["all_util"].isna().astype("Int8"),
        bc_open_to_buy_is_na=lambda df: df["bc_open_to_buy"].isna().astype("Int8"),
        bc_util_is_na=lambda df: df["bc_util"].isna().astype("Int8"),
        mo_sin_old_il_acct_is_na=lambda df: df["mo_sin_old_il_acct"]
        .isna()
        .astype("Int8"),
        mths_since_recent_bc_is_na=(
            lambda df: df["mths_since_recent_bc"].isna().astype("Int8")
        ),
        mths_since_recent_inq_is_na=(
            lambda df: df["mths_since_recent_inq"].isna().astype("Int8")
        ),
        percent_bc_gt_75_is_na=lambda df: df["percent_bc_gt_75"].isna().astype("Int8"),
    )

    # Remove columns after manual inspection
    accepted_2018_preproc = accepted_2018_preproc.drop(columns=drop_manually)

    # Save as feather file
    accepted_2018_preproc.to_feather(file_path)

del file_path
if "accepted_2018" in locals():
    del accepted_2018

4.1.3 Create Pre-Processing Pipelines

In this section, pipelines for further processing of data will be created. Suffix _lr means that the pipeline (or other object) is for non-tree-based models (logistic regression and Naive Bayes) and _trees means that it is for tree-based models (LGBM):

  • _lr pipeline contains one extra step: numeric data scaling.
Code
# Numeric variables except missing value indicators
select_numeric = make_column_selector(dtype_include="number", pattern=".*(?!_is_na$)")
# Categorical variables
select_categorical = make_column_selector(dtype_include="category")

# Create the pipelines
numeric_transformer_lr = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
numeric_transformer_trees = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))]
)

# Merge pipelines of numeric and categorical variables
pre_processing_lr = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_lr, select_numeric),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)
pre_processing_trees = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer_trees, select_numeric),
        ("categorical", categorical_transformer, select_categorical),
    ],
    verbose_feature_names_out=False,
    remainder="passthrough",
)
Code
pre_processing_lr
ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001878499E590>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000187803F2450>)],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
pre_processing_trees
ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000001878499E590>),
                                ('categorical',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000187803F2450>)],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

4.1.4 Split to Train, Validation, and Test Sets

Split data into training, validation, and test sets (70%:15%:15%) stratified by loan sub-grade to better reflect this type of data diversity in each set.

Code
file_path = dir_interim + "task-2--2-accepted_loans-2018--preprocessed.feather"

if os.path.exists(file_path):
    accepted_2018_preproc = pd.read_feather(file_path)

del file_path
Code
# Train, validation, test split (stratified by sub-grade): 70%:15%:15%
accepted_train, accepted_validation = train_test_split(
    accepted_2018_preproc,
    test_size=0.3,
    random_state=42,
    stratify=accepted_2018_preproc.sub_grade,
)

accepted_validation, accepted_test = train_test_split(
    accepted_validation,
    test_size=0.5,
    random_state=42,
    stratify=accepted_validation.sub_grade,
)

The sizes of these sets are as follows:

Code
print(f"{accepted_train.shape[0]/1e3:.1f}k rows in training set.")
print(f"{accepted_validation.shape[0]/1e3: .1f}k rows in validation set.")
print(f"{accepted_test.shape[0]/1e3: .1f}k rows in test set.")
346.7k rows in training set.
 74.3k rows in validation set.
 74.3k rows in test set.

4.1.5 EDA on Training Set

EDA in this section will cover 2 parts: EDA of target variables and EDA of other variables.

No essential discrepancies were found. Some variables (e.g., pymnt_plan) have rare values (are almost constant), but some target variables also have rare values (e.g., G5 grade). So these variables will be kept for now and models will “decide” if they are important enough.

Variables such as emp_length, verification_status, or fico_range_low show a potential to be good predictors.

Please, find EDA details below.

Target variables

Correspondence between loan grade and sub-grade (blue cells indicate that the sub-grade is included in the grade) show no discrepancies:

Code
ct = an.CrossTab("grade", "sub_grade", data=accepted_train)
ct.heatmap(vmax=1, cbar=False, cmap="crest", annot=False)
plt.xlabel("Sub-grade")
plt.ylabel("Grade")
del ct

Code
counts_g = (
    accepted_train["grade"]
    .value_counts()
    .sort_index()
    .to_frame()
    .reset_index()
    .assign(
        percent=lambda df: my.format_percent(df["count"] / df["count"].sum() * 100),
    )
)

my.plot_counts_with_labels(counts_g, x="grade", y="count", label="percent", rot=0);

Code
subgrade_count_series = accepted_train["sub_grade"].value_counts().sort_index()
counts_s = (
    subgrade_count_series.to_frame()
    .reset_index()
    .assign(
        percent=lambda df: df["count"] / df["count"].sum() * 100,
        label=lambda df: my.format_percent(df["percent"]),
    )
)

my.plot_counts_with_labels(
    counts_s,
    x="sub_grade",
    y="count",
    label="label",
    label_rotation=90,
    y_lim_max=27_000,
);

Code
print("Counts of sub-grades for F and G grades:")
pd.concat([pd.Series({"Group": "Count"}), subgrade_count_series.tail(10)])
Counts of sub-grades for F and G grades:
Group    Count
F1        1063
F2         403
F3         305
F4         220
F5         231
G1         341
G2          62
G3          31
G4          17
G5          19
dtype: object

In most cases, grades/sub-grades are in alignment with interest rates: lower grades have higher interest rates and vice versa. But there are some exceptions: that not some loans have a 6% interest rate disregarding the grade/sub-grade (this is the case for all grades except grade G).

Please, find the details below.

Code
sns.boxplot(x="grade", y="int_rate", data=accepted_train)
plt.xlabel("Grade")
plt.ylabel("Interest rate");

Code
sns.boxplot(x="sub_grade", y="int_rate", data=accepted_train)
plt.xlabel("Sub-grade")
plt.ylabel("Interest rate");

Let’s look into the outlying visible in the plot above:

Code
accepted_train.query("sub_grade == 'B4'")["int_rate"].value_counts().sort_index()
int_rate
6.00        3
10.90    4935
10.91    2102
11.05    3300
11.55    8370
11.80    3499
Name: count, dtype: int64
Code
accepted_train.query("sub_grade == 'D5'")["int_rate"].value_counts().sort_index()
int_rate
6.00        1
21.45    1093
21.85    3018
22.35    3474
Name: count, dtype: int64

It seems that those values are 6%. Let’s look deeper into it:

Code
ct = an.CrossTab(
    "grade",
    "int_rate_is_6",
    data=accepted_train.assign(int_rate_is_6=lambda df: (df["int_rate"] == 6)),
)
ct.heatmap(vmax=1, cbar=False, cmap="crest")
plt.xlabel("Interest rate is 6%")
plt.ylabel("Grade")
del ct

General EDA

Code
an.col_info(accepted_train, style=True)
  column data_type memory_size n_unique p_unique n_missing p_missing n_dominant p_dominant p_dom_excl_na dominant
1 loan_amnt float32 1.4 MB 1,555 0.4% 0 0% 38,907 11.2% 11.2% 10000.0
2 funded_amnt float32 1.4 MB 1,555 0.4% 0 0% 38,907 11.2% 11.2% 10000.0
3 funded_amnt_inv float64 2.8 MB 1,569 0.5% 0 0% 37,718 10.9% 10.9% 10000.0
4 term category 346.9 kB 2 <0.1% 0 0% 241,239 69.6% 69.6% 36 months
5 int_rate float32 1.4 MB 110 <0.1% 0 0% 9,643 2.8% 2.8% 6.11
6 installment float32 1.4 MB 29,669 8.6% 0 0% 1,049 0.3% 0.3% 304.72
7 grade category 347.4 kB 7 <0.1% 0 0% 98,955 28.5% 28.5% B
8 sub_grade category 349.8 kB 35 <0.1% 0 0% 22,209 6.4% 6.4% B4
9 emp_length Int8 693.3 kB 11 <0.1% 29,368 8.5% 112,097 32.3% 35.3% 10
10 home_ownership category 347.1 kB 4 <0.1% 0 0% 167,124 48.2% 48.2% MORTGAGE
11 annual_inc float64 2.8 MB 22,950 6.6% 0 0% 13,265 3.8% 3.8% 60000.0
12 verification_status category 347.0 kB 3 <0.1% 0 0% 140,120 40.4% 40.4% Not Verified
13 pymnt_plan category 346.9 kB 2 <0.1% 0 0% 346,572 >99.9% >99.9% n
14 purpose category 348.1 kB 13 <0.1% 0 0% 181,886 52.5% 52.5% debt_consolidation
15 dti float32 1.4 MB 8,528 2.5% 795 0.2% 595 0.2% 0.2% 0.0
16 delinq_2yrs float32 1.4 MB 24 <0.1% 0 0% 296,006 85.4% 85.4% 0.0
17 fico_range_low float32 1.4 MB 38 <0.1% 0 0% 22,587 6.5% 6.5% 680.0
18 fico_range_high float32 1.4 MB 38 <0.1% 0 0% 22,587 6.5% 6.5% 684.0
19 inq_last_6mths float32 1.4 MB 6 <0.1% 0 0% 232,437 67.0% 67.0% 0.0
20 mths_since_last_delinq float32 1.4 MB 151 <0.1% 193,704 55.9% 2,758 0.8% 1.8% 12.0
21 open_acc float32 1.4 MB 72 <0.1% 0 0% 28,519 8.2% 8.2% 9.0
22 pub_rec float32 1.4 MB 15 <0.1% 0 0% 302,407 87.2% 87.2% 0.0
23 revol_bal float32 1.4 MB 57,488 16.6% 0 0% 3,336 1.0% 1.0% 0.0
24 revol_util float32 1.4 MB 1,116 0.3% 429 0.1% 3,448 1.0% 1.0% 0.0
25 total_acc float32 1.4 MB 128 <0.1% 0 0% 13,194 3.8% 3.8% 16.0
26 initial_list_status category 346.9 kB 2 <0.1% 0 0% 298,847 86.2% 86.2% w
27 out_prncp float64 2.8 MB 104,998 30.3% 0 0% 40,203 11.6% 11.6% 0.0
28 out_prncp_inv float64 2.8 MB 109,695 31.6% 0 0% 40,203 11.6% 11.6% 0.0
29 total_pymnt float64 2.8 MB 178,207 51.4% 0 0% 295 0.1% 0.1% 0.0
30 total_pymnt_inv float64 2.8 MB 179,921 51.9% 0 0% 295 0.1% 0.1% 0.0
31 total_rec_prncp float64 2.8 MB 101,938 29.4% 0 0% 3,321 1.0% 1.0% 10000.0
32 total_rec_int float32 1.4 MB 141,484 40.8% 0 0% 493 0.1% 0.1% 0.0
33 total_rec_late_fee float32 1.4 MB 1,901 0.5% 0 0% 341,735 98.6% 98.6% 0.0
34 recoveries float64 2.8 MB 1,405 0.4% 0 0% 345,148 99.6% 99.6% 0.0
35 collection_recovery_fee float32 1.4 MB 1,406 0.4% 0 0% 345,148 99.6% 99.6% 0.0
36 last_pymnt_amnt float64 2.8 MB 65,820 19.0% 0 0% 994 0.3% 0.3% 304.72
37 last_fico_range_high float32 1.4 MB 72 <0.1% 0 0% 15,672 4.5% 4.5% 699.0
38 last_fico_range_low float32 1.4 MB 71 <0.1% 0 0% 15,672 4.5% 4.5% 695.0
39 collections_12_mths_ex_med float32 1.4 MB 8 <0.1% 0 0% 341,024 98.4% 98.4% 0.0
40 application_type category 346.9 kB 2 <0.1% 0 0% 298,502 86.1% 86.1% Individual
41 tot_coll_amt float32 1.4 MB 6,504 1.9% 0 0% 297,695 85.9% 85.9% 0.0
42 tot_cur_bal float32 1.4 MB 204,761 59.1% 0 0% 428 0.1% 0.1% 0.0
43 open_acc_6m float32 1.4 MB 15 <0.1% 0 0% 161,945 46.7% 46.7% 0.0
44 open_act_il float32 1.4 MB 47 <0.1% 0 0% 88,854 25.6% 25.6% 1.0
45 open_il_12m float32 1.4 MB 7 <0.1% 0 0% 190,997 55.1% 55.1% 0.0
46 open_il_24m float32 1.4 MB 21 <0.1% 0 0% 109,259 31.5% 31.5% 1.0
47 mths_since_rcnt_il float32 1.4 MB 334 0.1% 12,828 3.7% 15,250 4.4% 4.6% 7.0
48 total_bal_il float32 1.4 MB 101,091 29.2% 0 0% 43,612 12.6% 12.6% 0.0
49 il_util float32 1.4 MB 223 0.1% 56,717 16.4% 5,435 1.6% 1.9% 78.0
50 open_rv_12m float32 1.4 MB 23 <0.1% 0 0% 136,493 39.4% 39.4% 0.0
51 open_rv_24m float32 1.4 MB 39 <0.1% 0 0% 77,516 22.4% 22.4% 1.0
52 max_bal_bc float32 1.4 MB 26,163 7.5% 0 0% 10,764 3.1% 3.1% 0.0
53 all_util float32 1.4 MB 159 <0.1% 99 <0.1% 6,566 1.9% 1.9% 60.0
54 total_rev_hi_lim float32 1.4 MB 6,799 2.0% 0 0% 1,063 0.3% 0.3% 17000.0
55 inq_fi float32 1.4 MB 29 <0.1% 0 0% 159,036 45.9% 45.9% 0.0
56 total_cu_tl float32 1.4 MB 50 <0.1% 0 0% 185,685 53.6% 53.6% 0.0
57 inq_last_12m float32 1.4 MB 42 <0.1% 0 0% 107,324 31.0% 31.0% 0.0
58 acc_open_past_24mths float32 1.4 MB 44 <0.1% 0 0% 52,169 15.0% 15.0% 3.0
59 avg_cur_bal float32 1.4 MB 55,981 16.1% 31 <0.1% 400 0.1% 0.1% 0.0
60 bc_open_to_buy float32 1.4 MB 60,535 17.5% 4,642 1.3% 2,278 0.7% 0.7% 0.0
61 bc_util float32 1.4 MB 1,178 0.3% 4,794 1.4% 6,709 1.9% 2.0% 0.0
62 chargeoff_within_12_mths float32 1.4 MB 8 <0.1% 0 0% 344,527 99.4% 99.4% 0.0
63 delinq_amnt float32 1.4 MB 109 <0.1% 0 0% 346,561 >99.9% >99.9% 0.0
64 mo_sin_old_il_acct float32 1.4 MB 489 0.1% 12,828 3.7% 3,683 1.1% 1.1% 130.0
65 mo_sin_old_rev_tl_op float32 1.4 MB 715 0.2% 0 0% 2,203 0.6% 0.6% 136.0
66 mo_sin_rcnt_rev_tl_op float32 1.4 MB 267 0.1% 0 0% 24,897 7.2% 7.2% 2.0
67 mo_sin_rcnt_tl float32 1.4 MB 190 0.1% 0 0% 35,838 10.3% 10.3% 2.0
68 mort_acc float32 1.4 MB 29 <0.1% 0 0% 159,064 45.9% 45.9% 0.0
69 mths_since_recent_bc float32 1.4 MB 445 0.1% 4,380 1.3% 15,561 4.5% 4.5% 3.0
70 mths_since_recent_inq float32 1.4 MB 26 <0.1% 42,734 12.3% 29,234 8.4% 9.6% 1.0
71 num_accts_ever_120_pd float32 1.4 MB 34 <0.1% 0 0% 273,859 79.0% 79.0% 0.0
72 num_actv_bc_tl float32 1.4 MB 33 <0.1% 0 0% 70,996 20.5% 20.5% 2.0
73 num_actv_rev_tl float32 1.4 MB 45 <0.1% 0 0% 51,745 14.9% 14.9% 4.0
74 num_bc_sats float32 1.4 MB 49 <0.1% 0 0% 57,711 16.6% 16.6% 3.0
75 num_bc_tl float32 1.4 MB 63 <0.1% 0 0% 39,957 11.5% 11.5% 4.0
76 num_il_tl float32 1.4 MB 98 <0.1% 0 0% 30,943 8.9% 8.9% 3.0
77 num_op_rev_tl float32 1.4 MB 69 <0.1% 0 0% 36,283 10.5% 10.5% 5.0
78 num_rev_accts float32 1.4 MB 99 <0.1% 0 0% 22,799 6.6% 6.6% 8.0
79 num_rev_tl_bal_gt_0 float32 1.4 MB 41 <0.1% 0 0% 52,026 15.0% 15.0% 4.0
80 num_sats float32 1.4 MB 72 <0.1% 0 0% 28,551 8.2% 8.2% 9.0
81 num_tl_30dpd float32 1.4 MB 2 <0.1% 0 0% 346,653 >99.9% >99.9% 0.0
82 num_tl_90g_dpd_24m float32 1.4 MB 24 <0.1% 0 0% 332,722 96.0% 96.0% 0.0
83 num_tl_op_past_12m float32 1.4 MB 26 <0.1% 0 0% 89,094 25.7% 25.7% 1.0
84 pct_tl_nvr_dlq float32 1.4 MB 561 0.2% 2 <0.1% 193,695 55.9% 55.9% 100.0
85 percent_bc_gt_75 float32 1.4 MB 193 0.1% 4,648 1.3% 132,349 38.2% 38.7% 0.0
86 pub_rec_bankruptcies float32 1.4 MB 8 <0.1% 0 0% 304,300 87.8% 87.8% 0.0
87 tax_liens float32 1.4 MB 15 <0.1% 0 0% 344,309 99.3% 99.3% 0.0
88 tot_hi_cred_lim float32 1.4 MB 209,235 60.4% 0 0% 150 <0.1% <0.1% 17000.0
89 total_bal_ex_mort float32 1.4 MB 123,038 35.5% 0 0% 611 0.2% 0.2% 0.0
90 total_bc_limit float32 1.4 MB 3,914 1.1% 0 0% 4,648 1.3% 1.3% 0.0
91 total_il_high_credit_limit float32 1.4 MB 111,710 32.2% 0 0% 43,614 12.6% 12.6% 0.0
92 hardship_flag category 346.9 kB 2 <0.1% 0 0% 346,544 >99.9% >99.9% N
93 disbursement_method category 346.9 kB 2 <0.1% 0 0% 296,679 85.6% 85.6% Cash
94 debt_settlement_flag category 346.9 kB 2 <0.1% 0 0% 346,353 99.9% 99.9% N
95 title_len float32 1.4 MB 10 <0.1% 0 0% 181,886 52.5% 52.5% 18.0
96 issue_month Int8 693.3 kB 12 <0.1% 0 0% 32,520 9.4% 9.4% 10
97 zip_area category 347.5 kB 10 <0.1% 0 0% 60,894 17.6% 17.6% 9
98 emp_length_is_na Int8 693.3 kB 2 <0.1% 0 0% 317,301 91.5% 91.5% 0
99 dti_is_na Int8 693.3 kB 2 <0.1% 0 0% 345,874 99.8% 99.8% 0
100 mths_since_last_delinq_is_na Int8 693.3 kB 2 <0.1% 0 0% 193,704 55.9% 55.9% 1
101 revol_util_is_na Int8 693.3 kB 2 <0.1% 0 0% 346,240 99.9% 99.9% 0
102 mths_since_rcnt_il_is_na Int8 693.3 kB 2 <0.1% 0 0% 333,841 96.3% 96.3% 0
103 il_util_is_na Int8 693.3 kB 2 <0.1% 0 0% 289,952 83.6% 83.6% 0
104 all_util_is_na Int8 693.3 kB 2 <0.1% 0 0% 346,570 >99.9% >99.9% 0
105 bc_open_to_buy_is_na Int8 693.3 kB 2 <0.1% 0 0% 342,027 98.7% 98.7% 0
106 bc_util_is_na Int8 693.3 kB 2 <0.1% 0 0% 341,875 98.6% 98.6% 0
107 mo_sin_old_il_acct_is_na Int8 693.3 kB 2 <0.1% 0 0% 333,841 96.3% 96.3% 0
108 mths_since_recent_bc_is_na Int8 693.3 kB 2 <0.1% 0 0% 342,289 98.7% 98.7% 0
109 mths_since_recent_inq_is_na Int8 693.3 kB 2 <0.1% 0 0% 303,935 87.7% 87.7% 0
110 percent_bc_gt_75_is_na Int8 693.3 kB 2 <0.1% 0 0% 342,021 98.7% 98.7% 0
Code
if do_eda:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        report_accepted_2018 = sweetviz.analyze(
            [accepted_train, "Accepted loans (train)"],
            target_feat="int_rate",
            pairwise_analysis="off",
        )
        report_accepted_2018.show_notebook()

4.1.6 Separate Target Variables

As there are 3 sub-tasks (predict grade, sub-grade, and interest rate), 3 separate target variables for each set of predictors will be created.

Code
X_train = accepted_train.drop(columns=["grade", "sub_grade", "int_rate"])
y_train_grades = accepted_train["grade"]
y_train_subgrades = accepted_train["sub_grade"]
y_train_int_rate = accepted_train["int_rate"]

X_validation = accepted_validation.drop(columns=["grade", "sub_grade", "int_rate"])
y_validation_grades = accepted_validation["grade"]
y_validation_subgrades = accepted_validation["sub_grade"]
y_validation_int_rate = accepted_validation["int_rate"]

X_test = accepted_test.drop(columns=["grade", "sub_grade", "int_rate"])
y_test_grades = accepted_test["grade"]
y_test_subgrades = accepted_test["sub_grade"]
y_test_int_rate = accepted_test["int_rate"]

4.2 Loan Grade Prediction

This section focuses on predicting loan grades.

4.2.1 Train Models

Here 5 models will be trained:

  • logistic regression (LR);
  • Naive Bayes (NB);
  • LGBM;
  • Combination of all 3 models (LR, NB, and LGBM) with soft voting;
  • Combination of all 3 models (LR, NB, and LGBM) with hard voting.
Code
# Dictionary to collect the results
models_grades = {}
Code
@my.cache_results(dir_interim + "models_2_01_naive_bayes.pickle")
def fit_nb_grades():
    """Fit a Naive Bayes model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", GaussianNB()),
        ]
    )
    pipeline.fit(X_train, y_train_grades)

    return pipeline


models_grades["Naive Bayes"] = fit_nb_grades()
Code
@my.cache_results(dir_interim + "models_2_02_logistic_regression_sgd.pickle")
def fit_lr_sgd_grades():
    """Fit a Logistic Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "classifier",
                SGDClassifier(
                    random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_grades)

    return pipeline


models_grades["Logistic Regression"] = fit_lr_sgd_grades()
Code
@my.cache_results(dir_interim + "models_2_03_lgbm--test.pickle")
def fit_lgbm_grades():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


# Time: 1m 1.9s
models_grades["LGBM"] = fit_lgbm_grades()
Code
@my.cache_results(dir_interim + "models_2_04_voting_soft.pickle")
def fit_voting_soft_grades():
    """Fit a voting classifier with soft voting."""

    classifiers = [
        ("GaussianNB", GaussianNB()),
        (
            "SGDClassifier",
            SGDClassifier(
                random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
            ),
        ),
        (
            "LGBMClassifier",
            LGBMClassifier(
                random_state=1,
                class_weight="balanced",
                objective="multiclass",
                n_jobs=-1,
                device="gpu",
                verbosity=1,
            ),
        ),
    ]

    voting_classifier = VotingClassifier(classifiers, voting="soft", n_jobs=-1)

    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", voting_classifier),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


# Time: 1m 7.9s
models_grades["Voting (soft)"] = fit_voting_soft_grades()
Code
@my.cache_results(dir_interim + "models_2_05_voting_hard.pickle")
def fit_voting_hard_grades():
    """Fit a voting classifier with hard voting."""

    classifiers = [
        ("GaussianNB", GaussianNB()),
        (
            "SGDClassifier",
            SGDClassifier(
                random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
            ),
        ),
        (
            "LGBMClassifier",
            LGBMClassifier(
                random_state=1,
                class_weight="balanced",
                objective="multiclass",
                n_jobs=-1,
                device="gpu",
                verbosity=1,
            ),
        ),
    ]

    voting_classifier = VotingClassifier(classifiers, voting="hard", n_jobs=-1)

    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", voting_classifier),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


# Time: 1m 7.6s
models_grades["Voting (hard)"] = fit_voting_hard_grades()

4.2.2 Evaluation

Next, the models will be evaluated. Macro-averaged ROC AUC (and F1 scores where ROC AUC is unavailable) will be used as the main performance metric. The results will be compared to the baseline model that always predicts the most frequent class. Two performance metrics are needed as for hard-voting ROC AUC cannot be calculated (it requires probabilities).

Results: the best-performing model was LGBM with ROC AUC of 0.9950 and F1 of 0.83 in the validation set.

Please, find the details below.

# Report that includes ROC AUC
def print_classification_report_for_grades(models, model_name, include_roc_auc=True):
    """Make a classification report for a given model.

    Creates classification reports on training and validation sets.
    Args:
        models (dict|scikit learn model): Either a dictionary with models or a model.
        model_name (str): The name of the model to be evaluated.
        include_roc_auc (bool): Whether to include ROC AUC in the report.
    """
    ml.print_classification_report(
        models,
        model_name,
        X_train,
        X_validation,
        y_train_grades,
        y_validation_grades,
        label_train_set="Train",
        label_test_set="Validation",
        label_model=model_name,
        include_roc_auc=include_roc_auc,
    )
Code
dummy_grades = DummyClassifier(strategy="most_frequent")
dummy_grades.fit(X_train, y_train_grades)

print_classification_report_for_grades(dummy_grades, "baseline")
--- Train (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.00      0.00      0.00     94625
           B       0.29      1.00      0.44     98955
           C       0.00      0.00      0.00     88794
           D       0.00      0.00      0.00     48333
           E       0.00      0.00      0.00     13270
           F       0.00      0.00      0.00      2222
           G       0.00      0.00      0.00       470

    accuracy                           0.29    346669
   macro avg       0.04      0.14      0.06    346669
weighted avg       0.08      0.29      0.13    346669


--- Validation (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.00      0.00      0.00     20275
           B       0.29      1.00      0.44     21203
           C       0.00      0.00      0.00     19027
           D       0.00      0.00      0.00     10355
           E       0.00      0.00      0.00      2846
           F       0.00      0.00      0.00       479
           G       0.00      0.00      0.00       101

    accuracy                           0.29     74286
   macro avg       0.04      0.14      0.06     74286
weighted avg       0.08      0.29      0.13     74286

Evaluation of Naive Bayes model:

Code
print_classification_report_for_grades(models_grades, "Naive Bayes")
--- Train (Naive Bayes) ---
ROC AUC: 0.7387 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.58      0.71      0.64     94625
           B       0.32      0.57      0.41     98955
           C       0.35      0.02      0.03     88794
           D       0.38      0.09      0.14     48333
           E       0.14      0.06      0.08     13270
           F       0.04      0.31      0.07      2222
           G       0.01      0.29      0.02       470

    accuracy                           0.38    346669
   macro avg       0.26      0.29      0.20    346669
weighted avg       0.40      0.38      0.32    346669


--- Validation (Naive Bayes) ---
ROC AUC: 0.7440 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.58      0.71      0.64     20275
           B       0.32      0.57      0.41     21203
           C       0.36      0.02      0.04     19027
           D       0.39      0.09      0.15     10355
           E       0.14      0.06      0.09      2846
           F       0.04      0.32      0.08       479
           G       0.01      0.24      0.01       101

    accuracy                           0.38     74286
   macro avg       0.26      0.29      0.20     74286
weighted avg       0.40      0.38      0.32     74286

Evaluation of Logistic Regression model:

Code
print_classification_report_for_grades(models_grades, "Logistic Regression")
--- Train (Logistic Regression) ---
ROC AUC: 0.9062 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.76      0.89      0.82     94625
           B       0.61      0.50      0.55     98955
           C       0.56      0.49      0.52     88794
           D       0.53      0.51      0.52     48333
           E       0.33      0.57      0.42     13270
           F       0.16      0.14      0.15      2222
           G       0.07      0.72      0.12       470

    accuracy                           0.61    346669
   macro avg       0.43      0.55      0.44    346669
weighted avg       0.61      0.61      0.61    346669


--- Validation (Logistic Regression) ---
ROC AUC: 0.9036 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.76      0.89      0.82     20275
           B       0.60      0.50      0.55     21203
           C       0.56      0.48      0.52     19027
           D       0.53      0.51      0.52     10355
           E       0.33      0.56      0.42      2846
           F       0.17      0.15      0.16       479
           G       0.07      0.76      0.13       101

    accuracy                           0.60     74286
   macro avg       0.43      0.55      0.44     74286
weighted avg       0.61      0.60      0.60     74286

Evaluation of LGBM model:

Code
print_classification_report_for_grades(models_grades, "LGBM")
--- Train (LGBM) ---
ROC AUC: 0.9963 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     94625
           B       0.93      0.92      0.92     98955
           C       0.94      0.90      0.92     88794
           D       0.93      0.92      0.92     48333
           E       0.85      0.96      0.90     13270
           F       0.91      1.00      0.95      2222
           G       0.80      1.00      0.89       470

    accuracy                           0.93    346669
   macro avg       0.90      0.95      0.92    346669
weighted avg       0.93      0.93      0.93    346669


--- Validation (LGBM) ---
ROC AUC: 0.9950 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     20275
           B       0.92      0.92      0.92     21203
           C       0.93      0.90      0.91     19027
           D       0.92      0.91      0.92     10355
           E       0.84      0.93      0.88      2846
           F       0.74      0.86      0.80       479
           G       0.38      0.45      0.41       101

    accuracy                           0.93     74286
   macro avg       0.81      0.85      0.83     74286
weighted avg       0.93      0.93      0.93     74286

Evaluate the soft voting ensemble model:

Code
print_classification_report_for_grades(models_grades, "Voting (soft)")
--- Train (Voting (soft)) ---
ROC AUC: 0.9858 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.76      0.95      0.84     94625
           B       0.98      0.65      0.78     98955
           C       0.98      0.63      0.77     88794
           D       0.90      0.65      0.75     48333
           E       0.78      0.64      0.70     13270
           F       0.54      0.22      0.31      2222
           G       0.01      0.99      0.02       470

    accuracy                           0.72    346669
   macro avg       0.70      0.67      0.60    346669
weighted avg       0.89      0.72      0.78    346669


--- Validation (Voting (soft)) ---
ROC AUC: 0.9823 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.75      0.94      0.84     20275
           B       0.98      0.64      0.77     21203
           C       0.98      0.63      0.76     19027
           D       0.90      0.64      0.75     10355
           E       0.76      0.61      0.68      2846
           F       0.47      0.16      0.24       479
           G       0.01      0.97      0.02       101

    accuracy                           0.71     74286
   macro avg       0.69      0.66      0.58     74286
weighted avg       0.89      0.71      0.78     74286

Evaluate the hard voting ensemble model (no AUC is calculated):

Code
# NOTE: for hard voting `.predict_proba` is not available.
print_classification_report_for_grades(
    models_grades, "Voting (hard)", include_roc_auc=False
)
--- Train (Voting (hard)) ---
              precision    recall  f1-score   support

           A       0.67      0.99      0.80     94625
           B       0.75      0.67      0.71     98955
           C       0.83      0.64      0.73     88794
           D       0.86      0.59      0.70     48333
           E       0.75      0.62      0.68     13270
           F       0.89      0.15      0.26      2222
           G       0.09      0.99      0.16       470

    accuracy                           0.73    346669
   macro avg       0.69      0.66      0.58    346669
weighted avg       0.76      0.73      0.73    346669


--- Validation (Voting (hard)) ---
              precision    recall  f1-score   support

           A       0.67      0.99      0.80     20275
           B       0.74      0.66      0.70     21203
           C       0.82      0.64      0.72     19027
           D       0.85      0.59      0.70     10355
           E       0.75      0.60      0.67      2846
           F       0.84      0.15      0.25       479
           G       0.08      0.83      0.14       101

    accuracy                           0.73     74286
   macro avg       0.68      0.64      0.57     74286
weighted avg       0.76      0.73      0.73     74286

The best-performing model is LGBM. It will be explored in more detail.

Code
y_pred_validation_lgbm_grades = models_grades["LGBM"].predict(X_validation)
Code
sns.set_style("white")
ml.plot_confusion_matrices(
    y_validation_grades,
    y_pred_validation_lgbm_grades,
    figsize=(15, 4),
    text_kw={"size": 9},
);

It seems that the highest confusion is between F and G grades.

4.2.3 Feature Importance (General)

Internal LGBM method as well as SHAP values were used to evaluate feature importance. Variables total_rec_int, total_rec_prncp and installment were identified as the most important ones by both methods just in different order.

Please, find the details below.

Code
@my.cache_results(dir_interim + "task-2-grades--shap_lgbm_k=all.pkl")
def get_shap_values_grades_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_grades[model].steps[:-1])
    classifier = models_grades[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)

    return shap_values, X_validation_preproc


shap_values_lgbm_grades, data_for_lgbm_grades = get_shap_values_grades_lgbm()
# Time: 3m 17.8s
Code
sns.set_style("white")
lgb.plot_importance(
    models_grades["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance (Grades Prediction)",
);

Code
vals = np.abs(shap_values_lgbm_grades).mean(0).mean(0)
feature_importance = (
    pd.DataFrame(
        list(zip(data_for_lgbm_grades.columns, vals)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)

feature_importance.query("importance > 0.01").head().style.format(precision=4)
  index col_name importance
0 22 total_rec_int 0.7640
1 21 total_rec_prncp 0.4286
2 3 installment 0.3728
3 2 funded_amnt_inv 0.1921
4 120 initial_list_status_f 0.1906
Code
shap.summary_plot(
    shap_values_lgbm_grades, data_for_lgbm_grades, plot_type="bar", max_display=150
)

4.2.4 Feature Importance (by Grade)

Analyzing feature importance by grade, it seems that for higher grade prediction, total interest received (total_rec_int) is the most important feature, while for lower grade prediction, the most important feature monthly installment size (installment).

Find the details below.

def shap_summary_plot__grades(gr_index):
    """Convenience function to plot SHAP summary plot for a given grade.

    Args:
        gr_index (int): The index of the grade to plot.
    """
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        shap.summary_plot(
            shap_values_lgbm_grades[gr_index],
            data_for_lgbm_grades,
            plot_type="dot",
            max_display=20,
            plot_size=(10, 6),
        )

Feature importances for Grade A:

Code
shap_summary_plot__grades(0)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade B:

Code
shap_summary_plot__grades(1)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade C:

Code
shap_summary_plot__grades(2)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade D:

Code
shap_summary_plot__grades(3)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade E:

Code
shap_summary_plot__grades(4)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade F:

Code
shap_summary_plot__grades(5)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Feature importances for Grade G:

Code
shap_summary_plot__grades(6)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

4.2.5 Re-Train Models on the Most Important Features

The LGBM model will be trained on a smaller number of features (41, 30, and 13). This number of features was selected using 0.010, 0.025, and 0.090 cut-off thresholds of SHAP feature importance. The model with 13 features showed the best overall performance (see Table 4.1) and improvement of classification accuracy for most of the grades except grade F (see Table 4.2).

Table 4.1. Overall classification performance (validation set) of the LGBM models with all 140 (all) and 13 features.
Number of features ROC AUC (validation)
140 (all) 0.9950
45 0.9950
30 0.9951
13 0.9967
Table 4.2. Classification performance by grade (validation set) of the LGBM models with all 140 (all) and 13 features.
Grade F1 (all features) F1 (13 features)
A 0.97 0.98
B 0.92 0.94
C 0.91 0.93
D 0.92 0.94
E 0.88 0.90
F 0.80 0.78
G 0.41 0.49

Please, find more details below.

def fit_lgbm_grades_with_selection(features):
    """Template to fit a LGBM model with a smalled number of features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(features)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_grades)
    return pipeline


@my.cache_results(dir_interim + "models_2_03_lgbm-45-features.pickle")
def fit_lgbm_grades_45():
    features = feature_importance.query("importance > 0.010").col_name.to_list()
    return fit_lgbm_grades_with_selection(features)


@my.cache_results(dir_interim + "models_2_03_lgbm-30-features.pickle")
def fit_lgbm_grades_30():
    features = feature_importance.query("importance > 0.025").col_name.to_list()
    return fit_lgbm_grades_with_selection(features)


@my.cache_results(dir_interim + "models_2_03_lgbm-13-features.pickle")
def fit_lgbm_grades_13():
    features = feature_importance.query("importance > 0.090").col_name.to_list()
    return fit_lgbm_grades_with_selection(features)
Code
models_grades["LGBM (45 features)"] = fit_lgbm_grades_45()
Code
models_grades["LGBM (30 features)"] = fit_lgbm_grades_30()
Code
models_grades["LGBM (13 features)"] = fit_lgbm_grades_13()
Code
print_classification_report_for_grades(models_grades, "LGBM (45 features)")
--- Train (LGBM (45 features)) ---
ROC AUC: 0.9964 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     94625
           B       0.93      0.92      0.92     98955
           C       0.94      0.91      0.92     88794
           D       0.93      0.92      0.92     48333
           E       0.85      0.96      0.90     13270
           F       0.91      1.00      0.95      2222
           G       0.77      1.00      0.87       470

    accuracy                           0.94    346669
   macro avg       0.90      0.96      0.92    346669
weighted avg       0.94      0.94      0.94    346669

--- Validation (LGBM (45 features)) ---
ROC AUC: 0.9950 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     20275
           B       0.92      0.92      0.92     21203
           C       0.93      0.90      0.91     19027
           D       0.92      0.92      0.92     10355
           E       0.84      0.94      0.89      2846
           F       0.77      0.85      0.81       479
           G       0.39      0.49      0.43       101

    accuracy                           0.93     74286
   macro avg       0.82      0.86      0.84     74286
weighted avg       0.93      0.93      0.93     74286
Code
print_classification_report_for_grades(models_grades, "LGBM (30 features)")
--- Train (LGBM (30 features)) ---
ROC AUC: 0.9965 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     94625
           B       0.93      0.93      0.93     98955
           C       0.94      0.91      0.92     88794
           D       0.93      0.92      0.93     48333
           E       0.85      0.96      0.90     13270
           F       0.87      0.99      0.93      2222
           G       0.72      1.00      0.84       470

    accuracy                           0.94    346669
   macro avg       0.89      0.95      0.92    346669
weighted avg       0.94      0.94      0.94    346669

--- Validation (LGBM (30 features)) ---
ROC AUC: 0.9951 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.96      0.98      0.97     20275
           B       0.92      0.92      0.92     21203
           C       0.93      0.90      0.92     19027
           D       0.93      0.92      0.92     10355
           E       0.84      0.93      0.88      2846
           F       0.74      0.85      0.79       479
           G       0.35      0.52      0.42       101

    accuracy                           0.93     74286
   macro avg       0.81      0.86      0.83     74286
weighted avg       0.93      0.93      0.93     74286
Code
print_classification_report_for_grades(models_grades, "LGBM (13 features)")
--- Train (LGBM (13 features)) ---
ROC AUC: 0.9976 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98     94625
           B       0.95      0.94      0.94     98955
           C       0.95      0.92      0.94     88794
           D       0.95      0.93      0.94     48333
           E       0.88      0.96      0.92     13270
           F       0.87      0.94      0.90      2222
           G       0.64      1.00      0.78       470

    accuracy                           0.95    346669
   macro avg       0.89      0.96      0.91    346669
weighted avg       0.95      0.95      0.95    346669

--- Validation (LGBM (13 features)) ---
ROC AUC: 0.9967 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98     20275
           B       0.94      0.94      0.94     21203
           C       0.95      0.92      0.93     19027
           D       0.95      0.93      0.94     10355
           E       0.87      0.94      0.90      2846
           F       0.74      0.83      0.78       479
           G       0.40      0.62      0.49       101

    accuracy                           0.95     74286
   macro avg       0.83      0.88      0.85     74286
weighted avg       0.95      0.95      0.95     74286

The model with 13 features performed best. Let’s analyze it in more detail.

The 13 features are:

Code
feature_importance.head(13).col_name.to_list()
['total_rec_int',
 'total_rec_prncp',
 'installment',
 'funded_amnt_inv',
 'initial_list_status_f',
 'out_prncp_inv',
 'issue_month',
 'loan_amnt',
 'funded_amnt',
 'disbursement_method_Cash',
 'out_prncp',
 'term_ 36 months',
 'fico_range_high']
Code
y_pred_validation_lgbm_grades_13 = models_grades["LGBM (13 features)"].predict(
    X_validation
)
sns.set_style("white")
ml.plot_confusion_matrices(
    y_validation_grades,
    y_pred_validation_lgbm_grades_13,
    figsize=(15, 4),
    text_kw={"size": 9},
);

LGBM model with 13 features showed the best ROC AUC value of 0.9967 (validation). Comparing each grade separately, in the 13-feature model, F1 scores improved in all grades except Grade F (see Table 4.2).

4.2.6 Evaluation on Test Set

The final model evaluation on the test set shows ROC AUC of 0.9968.

Please, find the details below.

Code
# Model input variables
grade_model_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "initial_list_status_f",
    "out_prncp_inv",
    "issue_month",
    "loan_amnt",
    "funded_amnt",
    "disbursement_method_Cash",
    "out_prncp",
    "term_ 36 months",
    "fico_range_high",
]
Code
X_train_validation = pd.concat([X_train, X_validation], axis="index")
y_train_validation_grades = pd.concat(
    [y_train_grades, y_validation_grades], axis="index"
)


@my.cache_results(dir_interim + "models_2_03_lgbm--final-evaluation.pickle")
def fit_lgbm_grades_final_evaluation():
    """Fit a LGBM model for final testing."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(grade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_validation, y_train_validation_grades)
    return pipeline


models_grades["LGBM (final evaluation)"] = fit_lgbm_grades_final_evaluation()
Code
ml.print_classification_report(
    models=models_grades,
    model_name="LGBM (final evaluation)",
    X_train=X_train_validation,
    X_test=X_test,
    y_train=y_train_validation_grades,
    y_test=y_test_grades,
    label_train_set="Train + Validation",
    label_test_set="Test",
    label_model="Final evaluation",
)
--- Train + Validation (Final evaluation) ---
ROC AUC: 0.9976 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98    114900
           B       0.94      0.94      0.94    120158
           C       0.95      0.92      0.93    107821
           D       0.95      0.93      0.94     58688
           E       0.88      0.96      0.92     16116
           F       0.87      0.92      0.90      2701
           G       0.60      1.00      0.75       571

    accuracy                           0.95    420955
   macro avg       0.88      0.95      0.91    420955
weighted avg       0.95      0.95      0.95    420955


--- Test (Final evaluation) ---
ROC AUC: 0.9968 (one-versus-rest macro average)

              precision    recall  f1-score   support

           A       0.97      0.99      0.98     20277
           B       0.94      0.94      0.94     21207
           C       0.94      0.92      0.93     19029
           D       0.94      0.93      0.93     10358
           E       0.87      0.95      0.91      2842
           F       0.78      0.80      0.79       474
           G       0.38      0.66      0.49       100

    accuracy                           0.94     74287
   macro avg       0.83      0.88      0.85     74287
weighted avg       0.94      0.94      0.94     74287

Confusion matrices for the test set:

Code
y_pred_test = models_grades["LGBM (final evaluation)"].predict(X_test)

sns.set_style("white")
ml.plot_confusion_matrices(y_test, y_pred_test, figsize=(15, 4), text_kw={"size": 9});

4.2.7 Final Model for Deployment

In this section, the final pre-processing and prediction pipeline will be created and trained on the whole dataset. The model will be saved into a pickle file and deployed on Google Cloud Platform (GCP).

Code
# Pipeline input variables
grade_pipeline_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "initial_list_status",
    "out_prncp_inv",
    "issue_d",
    "loan_amnt",
    "funded_amnt",
    "disbursement_method",
    "out_prncp",
    "term",
    "fico_range_high",
]


@my.cache_results(dir_interim + "02--model_predict_grade.pkl")
def fit_lgbm_grades_final():
    """Final LGBM model for grades prediction."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(grade_pipeline_input)),
            ("preprocessor_1", PreprocessorForGrades()),
            ("preprocessor_2", clone(pre_processing_trees)),
            ("selector_2", ColumnSelector(grade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    # Load required dataset
    data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
    accepted_2018 = pd.read_feather(data_file_path)

    pipeline.set_output(transform="pandas")
    pipeline.fit(accepted_2018, accepted_2018["grade"])
    return pipeline


# Fit the model
loan_grade_predictor_final = fit_lgbm_grades_final()
loan_grade_predictor_final

4.3 Loan Sub-Grade Prediction

This section focuses on predicting loan sub-grades.

4.3.1 Train Models

Here 3 models will be trained:

  • logistic regression (LR);
  • Naive Bayes (NB);
  • LGBM.

The voting ensemble model will not be trained as it did not improve the performance in the grade prediction task.

Code
# Dictionary to collect the results
models_subgrades = {}
Code
@my.cache_results(dir_interim + "models_3_01_naive_bayes.pickle")
def fit_nb_subgrades():
    """Fit a Naive Bayes model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            ("classifier", GaussianNB()),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


models_subgrades["Naive Bayes"] = fit_nb_subgrades()
# Time: 13.3s
Code
@my.cache_results(dir_interim + "models_3_02_logistic_regression_sgd.pickle")
def fit_lr_sgd_subgrades():
    """Fit a Logistic Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "classifier",
                SGDClassifier(
                    random_state=1, loss="log_loss", n_jobs=-1, class_weight="balanced"
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


models_subgrades["Logistic Regression"] = fit_lr_sgd_subgrades()
# Time: 50.5s
Code
@my.cache_results(dir_interim + "models_3_03_lgbm.pickle")
def fit_lgbm_subgrades():
    """Fit a LGBM model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


models_subgrades["LGBM"] = fit_lgbm_subgrades()
# Time: 3m 23.9s

4.3.2 Evaluation

Next, the models will be evaluated. Macro-averaged ROC AUC will be used as the main performance metric and F1 score as an additional one. The results will be compared to the baseline model that always predicts the most frequent class.

Results: the best-performing model was LGBM with ROC AUC of 0.9668 and F1 of 0.58 in the validation set. Most correctly classified are subgrades from A1 to D5 and the highest confusion is between the lowest grades (F4 to G5), and for sug-brades from G2 to G5 there were no correct predictions at all.

In the baseline model, ROC AUC was 0.5 and F1 was 0.0.

Find the details below.

def print_classification_report_for_subgrades(models, model_name, include_roc_auc=True):
    """Make a classification report for a given model.

    Creates classification reports on training and validation sets.
    Args:
        models (dict|scikit learn model): Either a dictionary with models or a model.
        model_name (str): The name of the model to be evaluated.
        include_roc_auc (bool): Whether to include ROC AUC in the report.
    """
    ml.print_classification_report(
        models,
        model_name,
        X_train,
        X_validation,
        y_train_subgrades,
        y_validation_subgrades,
        label_train_set="Train",
        label_test_set="Validation",
        label_model=model_name,
        include_roc_auc=include_roc_auc,
    )
Code
dummy_subgrades = DummyClassifier(strategy="most_frequent")
dummy_subgrades.fit(X_train, y_train_subgrades)

print_classification_report_for_subgrades(dummy_subgrades, "baseline")
--- Train (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.00      0.00      0.00     20221
          A2       0.00      0.00      0.00     16010
          A3       0.00      0.00      0.00     17685
          A4       0.00      0.00      0.00     22154
          A5       0.00      0.00      0.00     18555
          B1       0.00      0.00      0.00     19626
          B2       0.00      0.00      0.00     20671
          B3       0.00      0.00      0.00     16223
          B4       0.06      1.00      0.12     22209
          B5       0.00      0.00      0.00     20226
          C1       0.00      0.00      0.00     19874
          C2       0.00      0.00      0.00     17635
          C3       0.00      0.00      0.00     18397
          C4       0.00      0.00      0.00     16885
          C5       0.00      0.00      0.00     16003
          D1       0.00      0.00      0.00     11457
          D2       0.00      0.00      0.00     10801
          D3       0.00      0.00      0.00      9859
          D4       0.00      0.00      0.00      8630
          D5       0.00      0.00      0.00      7586
          E1       0.00      0.00      0.00      2796
          E2       0.00      0.00      0.00      2234
          E3       0.00      0.00      0.00      2663
          E4       0.00      0.00      0.00      2299
          E5       0.00      0.00      0.00      3278
          F1       0.00      0.00      0.00      1063
          F2       0.00      0.00      0.00       403
          F3       0.00      0.00      0.00       305
          F4       0.00      0.00      0.00       220
          F5       0.00      0.00      0.00       231
          G1       0.00      0.00      0.00       341
          G2       0.00      0.00      0.00        62
          G3       0.00      0.00      0.00        31
          G4       0.00      0.00      0.00        17
          G5       0.00      0.00      0.00        19

    accuracy                           0.06    346669
   macro avg       0.00      0.03      0.00    346669
weighted avg       0.00      0.06      0.01    346669


--- Validation (baseline) ---
ROC AUC: 0.5000 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.00      0.00      0.00      4333
          A2       0.00      0.00      0.00      3430
          A3       0.00      0.00      0.00      3789
          A4       0.00      0.00      0.00      4747
          A5       0.00      0.00      0.00      3976
          B1       0.00      0.00      0.00      4205
          B2       0.00      0.00      0.00      4429
          B3       0.00      0.00      0.00      3476
          B4       0.06      1.00      0.12      4759
          B5       0.00      0.00      0.00      4334
          C1       0.00      0.00      0.00      4259
          C2       0.00      0.00      0.00      3779
          C3       0.00      0.00      0.00      3942
          C4       0.00      0.00      0.00      3618
          C5       0.00      0.00      0.00      3429
          D1       0.00      0.00      0.00      2455
          D2       0.00      0.00      0.00      2314
          D3       0.00      0.00      0.00      2112
          D4       0.00      0.00      0.00      1849
          D5       0.00      0.00      0.00      1625
          E1       0.00      0.00      0.00       600
          E2       0.00      0.00      0.00       479
          E3       0.00      0.00      0.00       571
          E4       0.00      0.00      0.00       493
          E5       0.00      0.00      0.00       703
          F1       0.00      0.00      0.00       228
          F2       0.00      0.00      0.00        87
          F3       0.00      0.00      0.00        66
          F4       0.00      0.00      0.00        48
          F5       0.00      0.00      0.00        50
          G1       0.00      0.00      0.00        73
          G2       0.00      0.00      0.00        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.06     74286
   macro avg       0.00      0.03      0.00     74286
weighted avg       0.00      0.06      0.01     74286

Evaluation of Naive Bayes model:

Code
print_classification_report_for_subgrades(models_subgrades, "Naive Bayes")
--- Train (Naive Bayes) ---
ROC AUC: 0.5804 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.14      0.19      0.16     20221
          A2       0.15      0.01      0.02     16010
          A3       0.08      0.00      0.00     17685
          A4       0.00      0.00      0.00     22154
          A5       0.00      0.00      0.00     18555
          B1       0.06      0.00      0.00     19626
          B2       0.00      0.00      0.00     20671
          B3       0.00      0.00      0.00     16223
          B4       0.07      0.00      0.00     22209
          B5       0.00      0.00      0.00     20226
          C1       0.04      0.00      0.00     19874
          C2       0.03      0.00      0.00     17635
          C3       0.00      0.00      0.00     18397
          C4       0.05      0.00      0.00     16885
          C5       0.33      0.00      0.00     16003
          D1       0.04      0.00      0.00     11457
          D2       0.03      0.00      0.00     10801
          D3       0.00      0.00      0.00      9859
          D4       0.00      0.00      0.00      8630
          D5       0.03      0.00      0.00      7586
          E1       0.06      0.03      0.04      2796
          E2       0.05      0.00      0.01      2234
          E3       0.04      0.00      0.01      2663
          E4       0.05      0.01      0.01      2299
          E5       0.03      0.21      0.05      3278
          F1       0.01      0.00      0.00      1063
          F2       0.01      0.00      0.01       403
          F3       0.00      0.01      0.00       305
          F4       0.00      0.05      0.01       220
          F5       0.00      0.00      0.00       231
          G1       0.00      0.04      0.01       341
          G2       0.00      0.24      0.00        62
          G3       0.00      0.19      0.00        31
          G4       0.00      0.76      0.00        17
          G5       0.00      1.00      0.00        19

    accuracy                           0.01    346669
   macro avg       0.04      0.08      0.01    346669
weighted avg       0.05      0.01      0.01    346669

--- Validation (Naive Bayes) ---
ROC AUC: 0.5474 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.14      0.20      0.17      4333
          A2       0.20      0.01      0.03      3430
          A3       0.08      0.00      0.00      3789
          A4       0.00      0.00      0.00      4747
          A5       0.00      0.00      0.00      3976
          B1       0.06      0.00      0.00      4205
          B2       0.00      0.00      0.00      4429
          B3       0.00      0.00      0.00      3476
          B4       0.08      0.00      0.00      4759
          B5       0.00      0.00      0.00      4334
          C1       0.09      0.00      0.00      4259
          C2       0.10      0.00      0.00      3779
          C3       0.00      0.00      0.00      3942
          C4       0.00      0.00      0.00      3618
          C5       0.00      0.00      0.00      3429
          D1       0.09      0.00      0.00      2455
          D2       0.04      0.00      0.00      2314
          D3       0.00      0.00      0.00      2112
          D4       0.00      0.00      0.00      1849
          D5       0.02      0.00      0.00      1625
          E1       0.08      0.03      0.05       600
          E2       0.00      0.00      0.00       479
          E3       0.04      0.01      0.01       571
          E4       0.00      0.00      0.00       493
          E5       0.03      0.20      0.05       703
          F1       0.00      0.00      0.00       228
          F2       0.00      0.00      0.00        87
          F3       0.00      0.00      0.00        66
          F4       0.00      0.02      0.00        48
          F5       0.00      0.00      0.00        50
          G1       0.00      0.03      0.01        73
          G2       0.00      0.08      0.00        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.50      0.00         4

    accuracy                           0.01     74286
   macro avg       0.03      0.03      0.01     74286
weighted avg       0.05      0.01      0.01     74286

Evaluation of Logistic Regression model:

Code
print_classification_report_for_subgrades(models_subgrades, "Logistic Regression")
--- Train (Logistic Regression) ---
ROC AUC: 0.8560 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.46      0.76      0.58     20221
          A2       0.31      0.06      0.10     16010
          A3       0.21      0.30      0.24     17685
          A4       0.21      0.25      0.23     22154
          A5       0.17      0.30      0.22     18555
          B1       0.20      0.05      0.08     19626
          B2       0.18      0.16      0.17     20671
          B3       0.11      0.20      0.14     16223
          B4       0.17      0.04      0.07     22209
          B5       0.15      0.07      0.10     20226
          C1       0.15      0.18      0.16     19874
          C2       0.12      0.03      0.05     17635
          C3       0.14      0.01      0.02     18397
          C4       0.13      0.02      0.03     16885
          C5       0.11      0.15      0.13     16003
          D1       0.09      0.10      0.09     11457
          D2       0.09      0.02      0.03     10801
          D3       0.09      0.15      0.11      9859
          D4       0.09      0.09      0.09      8630
          D5       0.11      0.17      0.14      7586
          E1       0.07      0.34      0.12      2796
          E2       0.06      0.26      0.09      2234
          E3       0.05      0.02      0.03      2663
          E4       0.12      0.07      0.08      2299
          E5       0.06      0.15      0.09      3278
          F1       0.08      0.16      0.11      1063
          F2       0.02      0.19      0.04       403
          F3       0.05      0.20      0.09       305
          F4       0.02      0.13      0.04       220
          F5       0.05      0.09      0.06       231
          G1       0.09      0.06      0.07       341
          G2       0.01      0.69      0.03        62
          G3       0.01      0.45      0.02        31
          G4       0.01      0.65      0.02        17
          G5       0.01      0.68      0.01        19

    accuracy                           0.17    346669
   macro avg       0.11      0.21      0.11    346669
weighted avg       0.17      0.17      0.15    346669

--- Validation (Logistic Regression) ---
ROC AUC: 0.8422 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.46      0.75      0.57      4333
          A2       0.29      0.06      0.10      3430
          A3       0.21      0.30      0.24      3789
          A4       0.21      0.24      0.22      4747
          A5       0.17      0.30      0.22      3976
          B1       0.21      0.05      0.09      4205
          B2       0.18      0.16      0.17      4429
          B3       0.11      0.21      0.15      3476
          B4       0.16      0.04      0.06      4759
          B5       0.14      0.07      0.09      4334
          C1       0.15      0.18      0.16      4259
          C2       0.11      0.03      0.05      3779
          C3       0.13      0.01      0.02      3942
          C4       0.12      0.02      0.03      3618
          C5       0.11      0.14      0.12      3429
          D1       0.08      0.09      0.08      2455
          D2       0.08      0.02      0.03      2314
          D3       0.08      0.13      0.10      2112
          D4       0.09      0.09      0.09      1849
          D5       0.12      0.19      0.15      1625
          E1       0.08      0.36      0.13       600
          E2       0.05      0.22      0.08       479
          E3       0.04      0.02      0.02       571
          E4       0.10      0.05      0.07       493
          E5       0.05      0.14      0.08       703
          F1       0.04      0.07      0.05       228
          F2       0.02      0.15      0.03        87
          F3       0.05      0.18      0.07        66
          F4       0.03      0.17      0.06        48
          F5       0.02      0.04      0.03        50
          G1       0.10      0.07      0.08        73
          G2       0.01      0.38      0.01        13
          G3       0.01      0.29      0.01         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.25      0.00         4

    accuracy                           0.16     74286
   macro avg       0.11      0.16      0.10     74286
weighted avg       0.17      0.16      0.14     74286

Evaluation of LGBM model:

Code
print_classification_report_for_subgrades(models_subgrades, "LGBM")
--- Train (LGBM) ---
ROC AUC: 0.9933 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.86      0.91      0.89     20221
          A2       0.72      0.83      0.77     16010
          A3       0.75      0.79      0.77     17685
          A4       0.82      0.72      0.77     22154
          A5       0.74      0.85      0.79     18555
          B1       0.78      0.74      0.76     19626
          B2       0.77      0.70      0.73     20671
          B3       0.65      0.75      0.70     16223
          B4       0.80      0.75      0.78     22209
          B5       0.83      0.78      0.81     20226
          C1       0.82      0.80      0.81     19874
          C2       0.78      0.75      0.77     17635
          C3       0.80      0.78      0.79     18397
          C4       0.83      0.78      0.80     16885
          C5       0.85      0.80      0.83     16003
          D1       0.83      0.84      0.83     11457
          D2       0.84      0.79      0.82     10801
          D3       0.86      0.81      0.83      9859
          D4       0.88      0.82      0.85      8630
          D5       0.87      0.89      0.88      7586
          E1       0.55      0.95      0.70      2796
          E2       0.65      0.94      0.77      2234
          E3       0.89      0.95      0.91      2663
          E4       0.87      0.95      0.91      2299
          E5       0.93      0.95      0.94      3278
          F1       0.95      1.00      0.97      1063
          F2       0.91      1.00      0.95       403
          F3       0.97      1.00      0.99       305
          F4       0.96      1.00      0.98       220
          F5       0.99      1.00      1.00       231
          G1       0.97      1.00      0.98       341
          G2       1.00      1.00      1.00        62
          G3       1.00      1.00      1.00        31
          G4       1.00      1.00      1.00        17
          G5       1.00      1.00      1.00        19

    accuracy                           0.80    346669
   macro avg       0.85      0.87      0.86    346669
weighted avg       0.80      0.80      0.80    346669

--- Validation (LGBM) ---
ROC AUC: 0.9668 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.84      0.89      0.86      4333
          A2       0.67      0.78      0.72      3430
          A3       0.70      0.75      0.73      3789
          A4       0.79      0.68      0.73      4747
          A5       0.71      0.82      0.76      3976
          B1       0.74      0.69      0.72      4205
          B2       0.72      0.66      0.69      4429
          B3       0.60      0.69      0.64      3476
          B4       0.76      0.73      0.74      4759
          B5       0.79      0.75      0.77      4334
          C1       0.78      0.75      0.77      4259
          C2       0.74      0.70      0.72      3779
          C3       0.74      0.74      0.74      3942
          C4       0.79      0.74      0.76      3618
          C5       0.82      0.74      0.78      3429
          D1       0.78      0.78      0.78      2455
          D2       0.78      0.70      0.73      2314
          D3       0.76      0.71      0.73      2112
          D4       0.79      0.73      0.76      1849
          D5       0.75      0.80      0.78      1625
          E1       0.37      0.69      0.48       600
          E2       0.37      0.60      0.46       479
          E3       0.62      0.60      0.61       571
          E4       0.58      0.59      0.58       493
          E5       0.69      0.71      0.70       703
          F1       0.65      0.85      0.74       228
          F2       0.49      0.54      0.52        87
          F3       0.53      0.48      0.51        66
          F4       0.29      0.17      0.21        48
          F5       0.21      0.08      0.12        50
          G1       0.39      0.53      0.45        73
          G2       0.00      0.00      0.00        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.74     74286
   macro avg       0.58      0.59      0.58     74286
weighted avg       0.74      0.74      0.74     74286

The best-performing model is LGBM. It will be explored in more detail.

Code
y_pred_subgrades_validation_lgbm = models_subgrades["LGBM"].predict(X_validation)
Code
ml.plot_confusion_matrices(
    y_validation_subgrades,
    y_pred_subgrades_validation_lgbm,
    figsize=(15, 30),
    text_kw={"size": 5},
    layout="vertical",
);

It seems that the highest confusion is between the lowest grades: F4 to G5. In the group of the lowest grades from G2 to G5 no correct prediction at all.

4.3.3 Feature Importance

Internal LGBM method as well as SHAP values were used to evaluate feature importance. This time it was decided to keep 27 features with the highest SHAP values.

Please, find the details below.

Code
@my.cache_results(dir_interim + "task-2-subgrades--shap_lgbm_k=all.pkl")
def get_shap_values_subgrades_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_subgrades[model].steps[:-1])
    classifier = models_subgrades[model]["classifier"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(classifier)
    shap_values = tree_explainer.shap_values(X_validation_preproc)

    return shap_values, X_validation_preproc


shap_values_lgbm_subgrades, data_for_lgbm_subgrades = get_shap_values_subgrades_lgbm()
# Time: 18m 30.0s
Code
lgb.plot_importance(
    models_subgrades["LGBM"]["classifier"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance (Sub-Grade Prediction)",
);

Code
vals_subgrades = np.abs(shap_values_lgbm_subgrades).mean(0).mean(0)
feature_importance_subgrades = (
    pd.DataFrame(
        list(zip(data_for_lgbm_subgrades.columns, vals_subgrades)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)
Code
feature_importance_subgrades.head(30).style.format(precision=4)
  index col_name importance
0 22 total_rec_int 0.6425
1 21 total_rec_prncp 0.3997
2 3 installment 0.3133
3 2 funded_amnt_inv 0.1600
4 0 loan_amnt 0.1322
5 1 funded_amnt 0.1273
6 20 total_pymnt_inv 0.1253
7 126 disbursement_method_Cash 0.1206
8 120 initial_list_status_f 0.1086
9 18 out_prncp_inv 0.1041
10 8 fico_range_low 0.1038
11 19 total_pymnt 0.1011
12 9 fico_range_high 0.0967
13 82 issue_month 0.0951
14 96 term_ 36 months 0.0949
15 17 out_prncp 0.0908
16 97 term_ 60 months 0.0802
17 26 last_pymnt_amnt 0.0652
18 6 dti 0.0626
19 49 bc_open_to_buy 0.0625
20 28 last_fico_range_low 0.0557
21 27 last_fico_range_high 0.0492
22 42 all_util 0.0378
23 74 percent_bc_gt_75 0.0370
24 81 title_len 0.0358
25 5 annual_inc 0.0297
26 57 mort_acc 0.0237
27 36 mths_since_rcnt_il 0.0227
28 43 total_rev_hi_lim 0.0222
29 58 mths_since_recent_bc 0.0174
Code
shap.summary_plot(
    shap_values_lgbm_subgrades, data_for_lgbm_subgrades, plot_type="bar", max_display=30
)

4.3.4 Re-Train Model on the Most Important Features

By using 27 features instead of 140, the classification performance slightly increased in ROC AUC (validation) from 0.9668 to 0.9688 and in F1 from 0.58 to 0.60 Still, the issues with the lowest subgrades persist.

Find the details below.

def fit_lgbm_subgrades_with_selection(features):
    """Template to fit a LGBM model with a smaller number of features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(features)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_subgrades)
    return pipeline


@my.cache_results(dir_interim + "models_3_03_lgbm-27-features.pickle")
def fit_lgbm_subgrades_27():
    features = feature_importance.head(27).col_name.to_list()
    return fit_lgbm_subgrades_with_selection(features)
Code
models_subgrades["LGBM (27 features)"] = fit_lgbm_subgrades_27()
# Time: 2m 5.0 s
Code
print_classification_report_for_subgrades(models_subgrades, "LGBM (27 features)")
--- Train (LGBM (27 features)) ---
ROC AUC: 0.9934 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.88      0.92      0.90     20221
          A2       0.74      0.84      0.79     16010
          A3       0.77      0.81      0.79     17685
          A4       0.82      0.74      0.78     22154
          A5       0.76      0.86      0.81     18555
          B1       0.78      0.74      0.76     19626
          B2       0.77      0.72      0.75     20671
          B3       0.65      0.77      0.70     16223
          B4       0.82      0.75      0.78     22209
          B5       0.84      0.78      0.81     20226
          C1       0.81      0.80      0.80     19874
          C2       0.78      0.75      0.77     17635
          C3       0.80      0.78      0.79     18397
          C4       0.83      0.80      0.82     16885
          C5       0.86      0.80      0.83     16003
          D1       0.85      0.83      0.84     11457
          D2       0.86      0.81      0.83     10801
          D3       0.85      0.81      0.83      9859
          D4       0.86      0.81      0.84      8630
          D5       0.87      0.88      0.88      7586
          E1       0.53      0.92      0.67      2796
          E2       0.64      0.91      0.75      2234
          E3       0.87      0.90      0.88      2663
          E4       0.87      0.92      0.89      2299
          E5       0.93      0.92      0.93      3278
          F1       0.93      1.00      0.96      1063
          F2       0.86      1.00      0.92       403
          F3       0.95      1.00      0.97       305
          F4       0.92      1.00      0.96       220
          F5       0.92      1.00      0.96       231
          G1       0.93      1.00      0.97       341
          G2       1.00      1.00      1.00        62
          G3       1.00      1.00      1.00        31
          G4       1.00      1.00      1.00        17
          G5       1.00      1.00      1.00        19

    accuracy                           0.80    346669
   macro avg       0.84      0.87      0.86    346669
weighted avg       0.80      0.80      0.80    346669

--- Validation (LGBM (27 features)) ---
ROC AUC: 0.9688 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.86      0.90      0.88      4333
          A2       0.70      0.81      0.75      3430
          A3       0.75      0.79      0.77      3789
          A4       0.79      0.71      0.74      4747
          A5       0.74      0.84      0.78      3976
          B1       0.76      0.72      0.74      4205
          B2       0.73      0.68      0.71      4429
          B3       0.60      0.71      0.65      3476
          B4       0.79      0.73      0.76      4759
          B5       0.80      0.75      0.78      4334
          C1       0.78      0.76      0.77      4259
          C2       0.74      0.72      0.73      3779
          C3       0.76      0.74      0.75      3942
          C4       0.80      0.76      0.78      3618
          C5       0.82      0.75      0.79      3429
          D1       0.81      0.78      0.79      2455
          D2       0.81      0.73      0.77      2314
          D3       0.76      0.74      0.75      2112
          D4       0.80      0.75      0.77      1849
          D5       0.81      0.82      0.81      1625
          E1       0.40      0.72      0.52       600
          E2       0.42      0.66      0.51       479
          E3       0.64      0.64      0.64       571
          E4       0.63      0.62      0.63       493
          E5       0.74      0.71      0.73       703
          F1       0.67      0.82      0.74       228
          F2       0.39      0.53      0.45        87
          F3       0.54      0.52      0.53        66
          F4       0.38      0.31      0.34        48
          F5       0.14      0.14      0.14        50
          G1       0.37      0.52      0.43        73
          G2       0.17      0.08      0.11        13
          G3       0.00      0.00      0.00         7
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.75     74286
   macro avg       0.60      0.61      0.60     74286
weighted avg       0.76      0.75      0.75     74286

Confusion matrices for the test set:

Code
y_pred_subgrades_validation_lgbm_27 = models_subgrades["LGBM (27 features)"].predict(
    X_validation
)
ml.plot_confusion_matrices(
    y_validation_subgrades,
    y_pred_subgrades_validation_lgbm_27,
    figsize=(15, 30),
    text_kw={"size": 5},
    layout="vertical",
);

4.3.5 Evaluation on Test Set

The final model evaluation on the test set shows ROC AUC of 0.9645 and F1 score of 0.61.

Please, find the details below.

Code
# Model input variables
subgrade_model_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "loan_amnt",
    "funded_amnt",
    "total_pymnt_inv",
    "disbursement_method_Cash",
    "initial_list_status_f",
    "out_prncp_inv",
    "fico_range_low",
    "total_pymnt",
    "fico_range_high",
    "issue_month",
    "term_ 36 months",
    "out_prncp",
    "term_ 60 months",
    "last_pymnt_amnt",
    "dti",
    "bc_open_to_buy",
    "last_fico_range_low",
    "last_fico_range_high",
    "all_util",
    "percent_bc_gt_75",
    "title_len",
    "annual_inc",
    "mort_acc",
]
Code
X_train_validation = pd.concat([X_train, X_validation], axis="index")
y_train_validation_subgrades = pd.concat(
    [y_train_subgrades, y_validation_subgrades], axis="index"
)


@my.cache_results(dir_interim + "models_3_03_lgbm--final-evaluation.pickle")
def fit_lgbm_subgrades_final_evaluation():
    """Fit a LGBM model for final testing."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(subgrade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_validation, y_train_validation_subgrades)
    return pipeline


models_subgrades["LGBM (final evaluation)"] = fit_lgbm_subgrades_final_evaluation()
Code
ml.print_classification_report(
    models=models_subgrades,
    model_name="LGBM (final evaluation)",
    X_train=X_train_validation,
    X_test=X_test,
    y_train=y_train_validation_subgrades,
    y_test=y_test_subgrades,
    label_train_set="Train + Validation",
    label_test_set="Test",
    label_model="Final evaluation",
)
--- Train + Validation (Final evaluation) ---
ROC AUC: 0.9932 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.87      0.92      0.89     24554
          A2       0.73      0.83      0.77     19440
          A3       0.76      0.80      0.78     21474
          A4       0.82      0.74      0.78     26901
          A5       0.74      0.84      0.79     22531
          B1       0.79      0.74      0.76     23831
          B2       0.74      0.72      0.73     25100
          B3       0.64      0.76      0.69     19699
          B4       0.83      0.74      0.79     26968
          B5       0.83      0.78      0.80     24560
          C1       0.83      0.77      0.80     24133
          C2       0.75      0.75      0.75     21414
          C3       0.79      0.76      0.77     22339
          C4       0.84      0.78      0.81     20503
          C5       0.85      0.82      0.84     19432
          D1       0.84      0.83      0.83     13912
          D2       0.85      0.82      0.83     13115
          D3       0.85      0.80      0.82     11971
          D4       0.86      0.82      0.84     10479
          D5       0.87      0.87      0.87      9211
          E1       0.52      0.90      0.66      3396
          E2       0.62      0.90      0.73      2713
          E3       0.87      0.88      0.88      3234
          E4       0.85      0.89      0.87      2792
          E5       0.93      0.91      0.92      3981
          F1       0.92      1.00      0.96      1291
          F2       0.87      1.00      0.93       490
          F3       0.93      1.00      0.96       371
          F4       0.90      1.00      0.95       268
          F5       0.90      1.00      0.95       281
          G1       0.92      1.00      0.96       414
          G2       0.96      1.00      0.98        75
          G3       1.00      1.00      1.00        38
          G4       1.00      1.00      1.00        21
          G5       1.00      1.00      1.00        23

    accuracy                           0.79    420955
   macro avg       0.84      0.87      0.85    420955
weighted avg       0.80      0.79      0.79    420955


--- Test (Final evaluation) ---
ROC AUC: 0.9645 (one-versus-rest macro average)

              precision    recall  f1-score   support

          A1       0.86      0.89      0.88      4333
          A2       0.69      0.81      0.75      3431
          A3       0.73      0.77      0.75      3790
          A4       0.80      0.73      0.76      4747
          A5       0.73      0.81      0.77      3976
          B1       0.76      0.71      0.74      4206
          B2       0.71      0.69      0.70      4430
          B3       0.60      0.71      0.65      3477
          B4       0.80      0.72      0.76      4760
          B5       0.80      0.76      0.78      4334
          C1       0.81      0.75      0.78      4259
          C2       0.72      0.73      0.73      3779
          C3       0.75      0.73      0.74      3943
          C4       0.82      0.75      0.78      3618
          C5       0.83      0.79      0.81      3430
          D1       0.80      0.78      0.79      2455
          D2       0.78      0.77      0.77      2315
          D3       0.80      0.73      0.77      2113
          D4       0.81      0.74      0.77      1849
          D5       0.80      0.81      0.81      1626
          E1       0.40      0.73      0.52       599
          E2       0.42      0.63      0.50       479
          E3       0.67      0.64      0.65       570
          E4       0.61      0.62      0.61       492
          E5       0.78      0.75      0.77       702
          F1       0.73      0.85      0.78       227
          F2       0.41      0.43      0.42        86
          F3       0.54      0.62      0.58        65
          F4       0.42      0.38      0.40        47
          F5       0.26      0.33      0.29        49
          G1       0.34      0.48      0.40        73
          G2       0.25      0.15      0.19        13
          G3       0.00      0.00      0.00         6
          G4       0.00      0.00      0.00         4
          G5       0.00      0.00      0.00         4

    accuracy                           0.75     74287
   macro avg       0.61      0.62      0.61     74287
weighted avg       0.76      0.75      0.75     74287

Confusion matrices for the test set:

Code
y_pred_test = models_grades["LGBM (final evaluation)"].predict(X_test)

sns.set_style("white")
ml.plot_confusion_matrices(y_test, y_pred_test, figsize=(15, 4), text_kw={"size": 9});

4.3.6 Final Model for Deployment

In this section, the final pre-processing and prediction pipeline will be created and trained on the whole dataset. The model will be saved into a pickle file and deployed on Google Cloud Platform (GCP).

Code
# Pipeline input variables
subgrade_pipeline_input = [
    "total_rec_int",
    "total_rec_prncp",
    "installment",
    "funded_amnt_inv",
    "loan_amnt",
    "funded_amnt",
    "total_pymnt_inv",
    "disbursement_method",
    "initial_list_status",
    "out_prncp_inv",
    "fico_range_low",
    "total_pymnt",
    "fico_range_high",
    "issue_d",
    "term",
    "out_prncp",
    "last_pymnt_amnt",
    "dti",
    "bc_open_to_buy",
    "last_fico_range_low",
    "last_fico_range_high",
    "all_util",
    "percent_bc_gt_75",
    "title",
    "annual_inc",
    "mort_acc",
]


@my.cache_results(dir_interim + "03--model_predict_subgrade.pkl")
def fit_lgbm_subgrades_final():
    """Final LGBM model for sub-grades prediction."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(subgrade_pipeline_input)),
            ("preprocessor_1", PreprocessorForSubgrades()),
            ("preprocessor_2", clone(pre_processing_trees)),
            ("selector_2", ColumnSelector(subgrade_model_input)),
            (
                "classifier",
                LGBMClassifier(
                    random_state=1,
                    class_weight="balanced",
                    objective="multiclass",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    # Load required dataset
    data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
    accepted_2018 = pd.read_feather(data_file_path)

    pipeline.set_output(transform="pandas")
    pipeline.fit(accepted_2018, accepted_2018["sub_grade"])
    return pipeline


# Fit the model
loan_subgrade_predictor_final = fit_lgbm_subgrades_final()
loan_subgrade_predictor_final

# Time: 5m 25.1s
Pipeline(steps=[('selector_1',
                 ColumnSelector(keep=['total_rec_int', 'total_rec_prncp',
                                      'installment', 'funded_amnt_inv',
                                      'loan_amnt', 'funded_amnt',
                                      'total_pymnt_inv', 'disbursement_method',
                                      'initial_list_status', 'out_prncp_inv',
                                      'fico_range_low', 'total_pymnt',
                                      'fico_range_high', 'issue_d', 'term',
                                      'out_prncp', 'last_pymnt_amnt', 'dti',
                                      'bc_open_to_buy', 'last_fico...
                                      'fico_range_high', 'issue_month',
                                      'term_ 36 months', 'out_prncp',
                                      'term_ 60 months', 'last_pymnt_amnt',
                                      'dti', 'bc_open_to_buy',
                                      'last_fico_range_low',
                                      'last_fico_range_high', 'all_util',
                                      'percent_bc_gt_75', 'title_len',
                                      'annual_inc', 'mort_acc'])),
                ('classifier',
                 LGBMClassifier(class_weight='balanced', device='gpu',
                                n_jobs=-1, objective='multiclass',
                                random_state=1, verbosity=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
accepted_2018 = pd.read_feather(data_file_path)

accepted_2018[subgrade_pipeline_input].iloc[[23486, 4, 300717]].to_dict(orient="list")
{'total_rec_int': [728.280029296875, 11.579999923706055, 1397.8199462890625],
 'total_rec_prncp': [2723.12, 3000.0, 1399.58],
 'installment': [288.010009765625, 93.0999984741211, 705.3800048828125],
 'funded_amnt_inv': [9000.0, 3000.0, 30000.0],
 'loan_amnt': [9000.0, 3000.0, 30000.0],
 'funded_amnt': [9000.0, 3000.0, 30000.0],
 'total_pymnt_inv': [3451.4, 3011.58, 2797.4],
 'disbursement_method': ['Cash', 'Cash', 'Cash'],
 'initial_list_status': ['w', 'w', 'w'],
 'out_prncp_inv': [6276.88, 0.0, 28600.42],
 'fico_range_low': [715.0, 760.0, 715.0],
 'total_pymnt': [3451.4, 3011.5772850636, 2797.4],
 'fico_range_high': [719.0, 764.0, 719.0],
 'issue_d': ['Mar-2018', 'Mar-2018', 'Nov-2018'],
 'term': [' 36 months', ' 36 months', ' 60 months'],
 'out_prncp': [6276.88, 0.0, 28600.42],
 'last_pymnt_amnt': [288.01, 614.03, 705.38],
 'dti': [14.510000228881836, 0.5799999833106995, 32.220001220703125],
 'bc_open_to_buy': [455.0, 30359.0, 17345.0],
 'last_fico_range_low': [805.0, 760.0, 755.0],
 'last_fico_range_high': [809.0, 764.0, 759.0],
 'all_util': [74.0, 1.0, 89.0],
 'percent_bc_gt_75': [100.0, 0.0, 25.0],
 'title': ['Credit card refinancing', 'Major purchase', 'Debt consolidation'],
 'annual_inc': [67000.0, 52000.0, 60000.0],
 'mort_acc': [0.0, 4.0, 0.0]}

4.4 Interest Rate Prediction

This section focuses on predicting loan sub-grades.

4.4.1 Train Models

Here 4 models will be trained:

  • linear regression;
  • LASSO regression;
  • Ridge regression;
  • LGBM (regular gradient boosting).
Code
# Dictionary to collect the results
models_int_rate = {}
Code
@my.cache_results(dir_interim + "models_4_01_linear_regression_sgd.pickle")
def fit_lr_sgd_int_rate():
    """Fit a Linear Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "regressor",
                SGDRegressor(
                    random_state=1,
                    loss="squared_error",  # Use 'squared_error' for regression
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["Linear Regression"] = fit_lr_sgd_int_rate()
Code
# Function for LASSO Regression
@my.cache_results(dir_interim + "models_4_02_lasso_regression_sgd.pickle")
def fit_lasso_regression():
    """Fit a LASSO Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "regressor",
                SGDRegressor(
                    random_state=1,
                    penalty="l1",  # L1 regularization for LASSO
                    alpha=1.0,
                    loss="squared_error",
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["LASSO Regression"] = fit_lasso_regression()
Code
# Function for Ridge Regression
@my.cache_results(dir_interim + "models_4_03_ridge_regression_sgd.pickle")
def fit_ridge_regression():
    """Fit a Ridge Regression model (via SGD training)."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_lr)),
            (
                "regressor",
                SGDRegressor(
                    random_state=1,
                    penalty="l2",  # L2 regularization for Ridge
                    alpha=1.0,
                    loss="squared_error",
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["Ridge Regression"] = fit_ridge_regression()
Code
@my.cache_results(dir_interim + "models_4_04_lgbm_regression.pickle")
def fit_lgbm_int_rate_regression():
    """Fit a LGBM regression model."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


models_int_rate["LGBM"] = fit_lgbm_int_rate_regression()

4.4.2 Evaluation

The LGBM model outperformed the remaining models with RMSE=1.416 (while SD=5.153) and MAE=0.897 (while median absolute deviation, MedianAD, is 3.610).

For linear regression, errors are extremely large and R² is calculated incorrectly (negative value while should be in the range between 0 and 1).

Code
print("--- Train ---")
ml.regression_scores(
    models_int_rate, X_train, y_train_int_rate, sort_by="RMSE", color="orange"
)
--- Train ---
  n SD RMSE MedianAD MAE SD_RMSE_ratio MedianAD_MAE_ratio
LGBM 346669 5.149 1.382 3.610 0.881 0.928 3.727 4.096
Ridge Regression 346669 5.149 3.682 3.610 2.811 0.489 1.398 1.284
LASSO Regression 346669 5.149 4.329 3.610 3.401 0.293 1.189 1.061
Linear Regression 346669 5.149 53939835148.674 3.610 1450519710.172 -109724639462755778560.000 0.000 0.000
Code
print("--- Validation ---")
ml.regression_scores(
    models_int_rate, X_validation, y_validation_int_rate, sort_by="RMSE"
)
--- Validation ---
  n SD RMSE MedianAD MAE SD_RMSE_ratio MedianAD_MAE_ratio
LGBM 74286 5.153 1.416 3.610 0.897 0.925 3.640 4.024
Ridge Regression 74286 5.153 3.692 3.610 2.822 0.487 1.396 1.279
LASSO Regression 74286 5.153 4.338 3.610 3.410 0.291 1.188 1.059
Linear Regression 74286 5.153 34670485474.882 3.610 1397356309.915 -45272654153225068544.000 0.000 0.000

4.4.3 Feature Importance

Code
@my.cache_results(dir_interim + "task-2-int_rate--shap_lgbm_k=all.pkl")
def get_shap_values_int_rate_lgbm():
    model = "LGBM"
    preproc = Pipeline(steps=models_int_rate[model].steps[:-1])
    regressor = models_int_rate[model]["regressor"]
    X_validation_preproc = preproc.transform(X_validation)

    tree_explainer = shap.TreeExplainer(regressor)
    shap_values = tree_explainer.shap_values(X_validation_preproc)

    return shap_values, X_validation_preproc


shap_values_lgbm_int_rate, data_for_lgbm_int_rate = get_shap_values_int_rate_lgbm()
# Time: 21.0s
Code
sns.set_style("white")
lgb.plot_importance(
    models_int_rate["LGBM"]["regressor"],
    max_num_features=50,
    figsize=(10, 10),
    height=0.8,
    title="LGBM Feature Importance (Interest Rate Prediction)",
);

Code
shap.summary_plot(
    shap_values_lgbm_int_rate, data_for_lgbm_int_rate, plot_type="bar", max_display=50
)

Code
shap.summary_plot(shap_values_lgbm_int_rate, data_for_lgbm_int_rate, max_display=30)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Code
vals_int_rate = np.abs(shap_values_lgbm_int_rate).mean(0)
feature_importance_int_rate = (
    pd.DataFrame(
        list(zip(data_for_lgbm_int_rate.columns, vals_int_rate)),
        columns=["col_name", "importance"],
    )
    .sort_values(by=["importance"], ascending=False)
    .reset_index()
)
Code
feature_importance_int_rate.head(40).style.format(precision=3)
  index col_name importance
0 22 total_rec_int 2.881
1 21 total_rec_prncp 1.945
2 82 issue_month 0.501
3 9 fico_range_high 0.468
4 3 installment 0.463
5 17 out_prncp 0.452
6 1 funded_amnt 0.348
7 126 disbursement_method_Cash 0.340
8 0 loan_amnt 0.307
9 49 bc_open_to_buy 0.269
10 27 last_fico_range_high 0.244
11 18 out_prncp_inv 0.209
12 97 term_ 60 months 0.207
13 2 funded_amnt_inv 0.193
14 96 term_ 36 months 0.186
15 6 dti 0.178
16 8 fico_range_low 0.173
17 57 mort_acc 0.099
18 26 last_pymnt_amnt 0.093
19 74 percent_bc_gt_75 0.092
20 28 last_fico_range_low 0.083
21 5 annual_inc 0.081
22 10 inq_last_6mths 0.070
23 36 mths_since_rcnt_il 0.065
24 42 all_util 0.055
25 43 total_rev_hi_lim 0.042
26 120 initial_list_status_f 0.042
27 81 title_len 0.037
28 108 purpose_credit_card 0.032
29 58 mths_since_recent_bc 0.031
30 38 il_util 0.030
31 72 num_tl_op_past_12m 0.027
32 11 mths_since_last_delinq 0.023
33 54 mo_sin_old_rev_tl_op 0.021
34 104 verification_status_Verified 0.020
35 13 pub_rec 0.018
36 20 total_pymnt_inv 0.016
37 19 total_pymnt 0.016
38 121 initial_list_status_w 0.015
39 127 disbursement_method_DirectPay 0.015

4.4.4 Re-Train Models on the Most Important Features

The LGBM model was trained on a smaller number of features based on various SHAP value thresholds. Threshold 0.10 resulted in 17 features which performed best (validation RMSE = 1.364, MAE = 0.850).

Please, find the details below.

Code
# Function with the analysis template
def fit_lgbm_int_rate_on_certain_features(features):
    """Template to fit LGBM regression model on certain features."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(features)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train, y_train_int_rate)
    return pipeline


def fit_lgbm_int_rate_by_shap(importance_threshold):
    """Function for feature selection based on SHAP values"""
    features = feature_importance_int_rate.query(
        f"importance > {importance_threshold}"
    ).col_name.to_list()

    k = len(features)

    return f"LGBM ({k} features)", fit_lgbm_int_rate_on_certain_features(features)
Code
for threshold in [0.02, 0.03, 0.05, 0.08, 0.10, 0.30, 0.40, 1.00]:
    model_name, model = fit_lgbm_int_rate_by_shap(threshold)
    models_int_rate[model_name] = model
# Time: 2m 52.0s
Code
print("--- Train ---")
ml.regression_scores(
    models_int_rate,
    X_train,
    y_train_int_rate,
    sort_by="RMSE",
    color="orange",
    incorrect_as_nan=True,
)
--- Train ---
  n SD RMSE MedianAD MAE SD_RMSE_ratio MedianAD_MAE_ratio
LGBM (17 features) 346669 5.149 1.333 3.610 0.835 0.933 3.862 4.323
LGBM (22 features) 346669 5.149 1.341 3.610 0.846 0.932 3.839 4.269
LGBM (25 features) 346669 5.149 1.344 3.610 0.851 0.932 3.831 4.243
LGBM (30 features) 346669 5.149 1.362 3.610 0.867 0.930 3.781 4.165
LGBM (34 features) 346669 5.149 1.369 3.610 0.872 0.929 3.762 4.140
LGBM 346669 5.149 1.382 3.610 0.881 0.928 3.727 4.096
LGBM (9 features) 346669 5.149 1.479 3.610 0.859 0.917 3.481 4.204
LGBM (6 features) 346669 5.149 1.595 3.610 0.894 0.904 3.228 4.039
LGBM (2 features) 346669 5.149 2.809 3.610 1.826 0.702 1.833 1.977
Ridge Regression 346669 5.149 3.682 3.610 2.811 0.489 1.398 1.284
LASSO Regression 346669 5.149 4.329 3.610 3.401 0.293 1.189 1.061
Linear Regression 346669 5.149 53939835148.674 3.610 1450519710.172 nan 0.000 0.000
Code
print("--- Validation ---")
ml.regression_scores(
    models_int_rate,
    X_validation,
    y_validation_int_rate,
    sort_by="RMSE",
    incorrect_as_nan=True,
)
--- Validation ---
  n SD RMSE MedianAD MAE SD_RMSE_ratio MedianAD_MAE_ratio
LGBM (17 features) 74286 5.153 1.364 3.610 0.850 0.930 3.778 4.246
LGBM (22 features) 74286 5.153 1.371 3.610 0.859 0.929 3.758 4.201
LGBM (25 features) 74286 5.153 1.373 3.610 0.865 0.929 3.754 4.173
LGBM (30 features) 74286 5.153 1.393 3.610 0.881 0.927 3.700 4.099
LGBM (34 features) 74286 5.153 1.402 3.610 0.888 0.926 3.676 4.067
LGBM 74286 5.153 1.416 3.610 0.897 0.925 3.640 4.024
LGBM (9 features) 74286 5.153 1.511 3.610 0.868 0.914 3.410 4.159
LGBM (6 features) 74286 5.153 1.623 3.610 0.906 0.901 3.174 3.983
LGBM (2 features) 74286 5.153 2.819 3.610 1.824 0.701 1.828 1.979
Ridge Regression 74286 5.153 3.692 3.610 2.822 0.487 1.396 1.279
LASSO Regression 74286 5.153 4.338 3.610 3.410 0.291 1.188 1.059
Linear Regression 74286 5.153 34670485474.882 3.610 1397356309.915 nan 0.000 0.000

4.4.5 Evaluation on Test Set

On the test set, RMSE = 1.347 and MAE = 0.838 (even a bit better than before which might be a consequence that the model was trained on a larger dataset).

Please, find the details below.

Code
# Model input variables
int_rate_model_input = [
    "total_rec_int",
    "total_rec_prncp",
    "issue_month",
    "fico_range_high",
    "installment",
    "out_prncp",
    "funded_amnt",
    "disbursement_method_Cash",
    "loan_amnt",
    "bc_open_to_buy",
    "last_fico_range_high",
    "out_prncp_inv",
    "term_ 60 months",
    "funded_amnt_inv",
    "term_ 36 months",
    "dti",
    "fico_range_low",
]
Code
X_train_validation = pd.concat([X_train, X_validation], axis="index")
y_train_validation_int_rate = pd.concat(
    [y_train_int_rate, y_validation_int_rate], axis="index"
)


@my.cache_results(dir_interim + "models_4_03_lgbm--final-evaluation.pickle")
def fit_lgbm_int_rate_final_evaluation():
    """Fit a LGBM model for final testing."""
    pipeline = Pipeline(
        steps=[
            ("preprocessor", clone(pre_processing_trees)),
            ("selector", ColumnSelector(int_rate_model_input)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )
    pipeline.fit(X_train_validation, y_train_validation_int_rate)
    return pipeline


model_int_rate_final = {}
model_int_rate_final["LGBM (final evaluation)"] = fit_lgbm_int_rate_final_evaluation()
Code
print("--- Train + Validation ---")
display(
    ml.regression_scores(
        model_int_rate_final,
        X_train_validation,
        y_train_validation_int_rate,
        sort_by="RMSE",
        color="orange",
    )
)
--- Train + Validation ---
  n SD RMSE MedianAD MAE SD_RMSE_ratio MedianAD_MAE_ratio
LGBM (final evaluation) 420955 5.151 1.332 3.610 0.831 0.933 3.868 4.346
Code
print("--- Test ---")
display(
    ml.regression_scores(model_int_rate_final, X_test, y_test_int_rate, sort_by="RMSE")
)
--- Test ---
  n SD RMSE MedianAD MAE SD_RMSE_ratio MedianAD_MAE_ratio
LGBM (final evaluation) 74287 5.152 1.347 3.610 0.838 0.932 3.825 4.307

4.4.6 Final Model for Deployment

In this section, the final pre-processing and prediction pipeline will be created and trained on the whole dataset. The model will be saved into a pickle file and deployed on Google Cloud Platform (GCP).

Code
# Pipeline input variables
int_rate_pipeline_input = [
    "total_rec_int",
    "total_rec_prncp",
    "issue_d",
    "fico_range_high",
    "installment",
    "out_prncp",
    "funded_amnt",
    "disbursement_method",
    "loan_amnt",
    "bc_open_to_buy",
    "last_fico_range_high",
    "out_prncp_inv",
    "term",
    "funded_amnt_inv",
    "dti",
    "fico_range_low",
]


@my.cache_results(dir_interim + "04--model_predict_int_rate.pkl")
def fit_lgbm_int_rate_for_deployment():
    """Final LGBM model for sub-grades prediction."""
    pipeline = Pipeline(
        steps=[
            ("selector_1", ColumnSelector(int_rate_pipeline_input)),
            ("preprocessor_1", PreprocessorForInterestRates()),
            ("preprocessor_2", clone(pre_processing_trees)),
            ("selector_2", ColumnSelector(int_rate_model_input)),
            (
                "regressor",
                LGBMRegressor(
                    random_state=1,
                    objective="regression",
                    n_jobs=-1,
                    device="gpu",
                    verbosity=1,
                ),
            ),
        ]
    )

    # Load required dataset
    data_file_path = dir_interim + "task-2--1-accepted_loans_2018--raw.feather"
    accepted_2018 = pd.read_feather(data_file_path)

    pipeline.set_output(transform="pandas")
    pipeline.fit(accepted_2018, accepted_2018["int_rate"])
    return pipeline


# Fit the model
loan_int_rate_predictor_final = fit_lgbm_int_rate_for_deployment()
loan_int_rate_predictor_final
Pipeline(steps=[('selector_1',
                 ColumnSelector(keep=['total_rec_int', 'total_rec_prncp',
                                      'issue_d', 'fico_range_high',
                                      'installment', 'out_prncp', 'funded_amnt',
                                      'disbursement_method', 'loan_amnt',
                                      'bc_open_to_buy', 'last_fico_range_high',
                                      'out_prncp_inv', 'term',
                                      'funded_amnt_inv', 'dti',
                                      'fico_range_low'])),
                ('preprocessor_1', PreprocessorForInterestRates()),
                ('prep...
                 ColumnSelector(keep=['total_rec_int', 'total_rec_prncp',
                                      'issue_month', 'fico_range_high',
                                      'installment', 'out_prncp', 'funded_amnt',
                                      'disbursement_method_Cash', 'loan_amnt',
                                      'bc_open_to_buy', 'last_fico_range_high',
                                      'out_prncp_inv', 'term_ 60 months',
                                      'funded_amnt_inv', 'term_ 36 months',
                                      'dti', 'fico_range_low'])),
                ('regressor',
                 LGBMRegressor(device='gpu', n_jobs=-1, objective='regression',
                               random_state=1, verbosity=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5 Deployment

Currently, the models are deployed on the Google Cloud Platform (GCP) as a single Flask app that is available at https://lending-club-app-x6jg32rquq-ew.a.run.app/. More details and examples on how to use the models are available in the README file of the app subdirectory of this project’s GitHub repository.

  • The app folder on the GitHub repository of the project: app.

6 Final Remarks

  1. In binary classification, the default threshold of 0.5 was used.
    Threshold adjustment (e.g., via ROC curve analysis) might be beneficial.

  2. Hyperparameter tuning was not performed in this analysis due to the lack of time assigned to the task. Models with tuned parameters might lead to better performance. Hope, the Scikit Learn’s defaults work well enough.

  3. In loan status prediction, for logistic regression and Naive Bayes, capped versions of ‘loan_amount’ and ‘debt_to_income_ratio’ were used. And for LGBM, non-capped. It could have been tried to check which type of pre-processing works better instead of making assumptions.

  4. Feature selection after initial modeling was based on SHAP values and was done manually. It helped to achieve almost the same or even better performance with significantly fewer features. More rigorous procedures like recursive feature elimination (RFE) might lead to an even better subset of variables.

  5. This notebook from time to time started generating underflow issues, especially in some plots and predict_proba() method. So the analysis was not performed in one go. (This also leads to the next point.)

  6. Many intermediate results were saved to files:

    1. Better and more systematic naming conventions should make it easier to track the results.
    2. Saving so many results on a disc may save time program crashes but needs an excessive amount of space to save the results. If disc space is an issue, some more clever way may be invented to remove older versions of the results that are not needed.
    3. Due to constant saving some parts of code became more dense and, thus, harder to read.
  7. In some cases (especially model training) there is a lot of repeated code that could have been generalized. Unfortunately, this would have required much more time.

  8. I explored 3 libraries of AutoML. Unfortunately, none of them were included in the analysis:

    • auto-sklearn works only on Linux (I used Windows) and has issues with the newest versions of some packages (I cannot remember which ones: pandas, numpy, scikit-learn, or the newest Python itself (I used 3.11).
    • ludwig dependencies conflict with the dependencies of jupyter;
    • tplot2 has poor documentation and tpot does not support the newest versions of pandas. I managed to install it into a separate environment and run some toy examples but decided not to include it in the analysis due lack of available time.
  9. Plots:

    • plot themes are not unified throughout the project (the same type of plot might have a different theme). I am not really sure, which functions silently change the themes.
    • some plots may be better labeled.
    • some pots (especially SHAP values) are large and could be made smaller.
  10. Some results are poorly described or come without explanation. I used them to make decisions for further analysis or inspect whether pre-processing or other steps of analysis were performed as expected.

  11. Not all functions from functions subfolder were used. They should be treated as a separate module.

  12. Models were created and deployed. Unfortunately, I do not have enough field knowledge if the selected variables make sense and do not leak information. A good discussion with a field specialist would be beneficial.