In this project, a comprehensive analysis of the travel insurance dataset was conducted, delving deep into its intricacies. A sophisticated classification model was methodically developed, fine-tuned, and rigorously evaluated using advanced machine learning techniques. The chosen model, a Random Forest algorithm, exhibited an impressive accuracy of 82.3% (a balanced accuracy of 76.7%). This model effectively predicts whether a customer will opt for the insurance or not, based on the customer’s annual income, age, and the number of family members. Notably, annual income emerged as the most influential feature in making accurate predictions.
1 Introduction
1.1 The Issue
A company in the travel industry is introducing a travel insurance package that includes coverage for COVID-19. To determine, which customers would be interested in buying it, the company needs to analyze its database history. Data from approximately 2000 previous customers, who were offered the insurance package in 2019, has been provided for this purpose. The task is to build a predictive model that can determine whether a customer is likely to purchase the travel insurance package.
1.2 Notes on Methodology
In statistical inference, significance level is 0.05, confidence level is 95%.
In machine learning, balanced accuracy (BAcc) is used as the main performance metric.
1.3 Setup and Instructions
Create and setup virtual environment (VE) for Python
conda, mamba, pip and other required command line tools must be installed.
Create a virtual environment and install the required packages:
Every time before working on the project, make sure that the working directory (WD) of the Jupyter Notebook matches the working directory set in the terminal.
To get your current WD in Jupyter Notebook, run the following code in a Python cell:
!echo '%cd%'# under windows
In the terminal, to set WD to the project’s directory, use cd and output of the above code, e.g.:
# This is just an example, you should choose the correct path in your casecd'd:/Data Science/projects/proj-travel-insurance'
Next, activate the virtual environment (VE) for this project: In Jupyter Notebook, use the kernel from this environment.
Download data
To download the dataset:
If you do not have it, create Kaggle API token and save it locally:
Log in to Kaggle.
Go to https://www.kaggle.com/settings
Under API section, click “Create New Token” and it will be suggested to download kaggle.json file.
Save the kaggle.json locally in the location ~/.kaggle/kaggle.json.
The data file will be called TravelInsurancePrediction.csv.
Create/Update HTML report
First, run all the code cells in the notebook.
Next, create HTML report (command line tool quarto must be installed):
quarto render travel-insurance.ipynb --to html --output index.html
Last, open the HTML file index.html in browser. On Windows, you can use the following command in terminal:
start index.html
Code: The main Python setup
# Automatically reload certain modules%reload_ext autoreload%autoreload 1# Plotting%matplotlib inline# Packages and modules -------------------------------import osimport picklefrom pprint import pprint# Dataframesimport pandas as pd# EDAimport ydata_profiling as edaimport sweetviz as svfrom skimpy import skim# Data wrangling, mathsimport numpy as npimport janitor # imports additional Pandas methods# Statistical analysisimport scipy.stats as spsfrom statsmodels.stats.multitest import multipletests as p_adjust# Machine learningfrom sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import FunctionTransformer, StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.model_selection import GridSearchCV, RandomizedSearchCVfrom mlxtend.feature_selection import SequentialFeatureSelector, ColumnSelectorfrom sklearn.metrics import ( classification_report, confusion_matrix, ConfusionMatrixDisplay,)from imblearn.under_sampling import RandomUnderSamplerfrom imblearn.pipeline import Pipeline as ImbPipeline# Visualizationimport matplotlib.pyplot as plt# Custom functionsimport functions.fun_utils as myimport functions.pandas_methodsimport functions.fun_analysis as animport functions.fun_ml as ml%aimport functions.fun_utils%aimport functions.pandas_methods%aimport functions.fun_analysis%aimport functions.fun_ml# Settings --------------------------------------------# Default plot optionsplt.rc("figure", titleweight="bold")plt.rc("axes", labelweight="bold", titleweight="bold")plt.rc("font", weight="normal", size=10)plt.rc("figure", figsize=(7, 3))# Pandas optionspd.set_option("display.max_rows", 1000)pd.set_option("display.max_columns", 300)pd.set_option("display.max_colwidth", 50) # Possible option: Nonepd.set_option("display.float_format", lambda x: f"{x:.2f}")pd.set_option("styler.format.thousands", ",")# Turn off the scientific notation for floating point numbers.np.set_printoptions(suppress=True)# Analysis parametersdo_eda =True# For caching results ---------------------------------dir_cache =".saved_results/"os.mkdir(dir_cache) ifnot os.path.exists(dir_cache) elseNone
Code: Custom ad-hoc functions
# NOTE: Some ad-hoc functions are defined in other places of the file# when it seemed that it makes the analysis easier to follow.# Pre-processing pipeline ================================================class DataFrameAssertion(BaseEstimator, TransformerMixin):"""Custom transformer for asserting the input data as a Pandas DataFrame. This transformer checks if the input data is an instance of a Pandas DataFrame. If the input data is not a DataFrame, an AssertionError is raised. Attributes: None Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to perform the DataFrame assertion. """def fit(self, X, y=None):returnselfdef transform(self, X):assertisinstance( X, pd.DataFrame ), f"Expected a Pandas DataFrame, got {str(type(X))}"return Xclass ColnamesToSnakeCase(BaseEstimator, TransformerMixin):"""Custom transformer for renaming variables to snake case. This transformer renames the columns of a DataFrame to snake case. Attributes: None Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to rename variables to snake case. """def fit(self, X, y=None):returnselfdef transform(self, X): X = X.rename(columns=my.to_snake_case)return Xclass FeatureSelector(BaseEstimator, TransformerMixin):"""Selects the indicated features from a DataFrame Attributes: feature_names (list): List of feature names to select. Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to select columns of interest. Returns a DataFrame with the selected features """def__init__(self, feature_names):"""Constructor Args.: feature_names (list): List of feature names to select. """self.feature_names = feature_namesdef fit(self, X, y=None):returnselfdef transform(self, X):# Select the indicated features from the input DataFrame X selected_features = X[self.feature_names]return pd.DataFrame(selected_features, columns=self.feature_names)class ColumnPresenceAssertion(BaseEstimator, TransformerMixin):"""Custom transformer for column name presence assertion. This transformer checks if the input data has the required columns. It compares the columns of the input data with a list of required columns. If any required columns are missing or there are additional columns not specified in the required list, an AssertionError is raised. Attributes: required_columns (list or array-like): The list of required column names. Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to perform column name assertion. """def__init__(self, required_columns):"""Constructor for ColumnPresenceAssertion. Parameters: - required_columns: List of required column names. """self.required_columns = required_columnsdef fit(self, X, y=None):returnselfdef transform(self, X): required_columns =set(self.required_columns) current_columns =set(X.columns) missing_columns = required_columns - current_columns not_used_columns = current_columns - required_columnsassertnot missing_columns, (f"\nMissing columns: {missing_columns}\n"f"Present but not used columns: {not_used_columns}" )return Xclass CreateIncomePerFamilyMember(BaseEstimator, TransformerMixin):"""Custom transformer for creating the 'income_per_family_member' variable. This transformer calculates the income per family member by dividing the 'annual_income' column by the 'family_members' column. Attributes: None Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to create the 'income_per_family_member' variable. """def fit(self, X, y=None):returnselfdef transform(self, X): X["income_per_family_member"] = X["annual_income"] / X["family_members"]return Xclass CreateAgeGroups(BaseEstimator, TransformerMixin):"""Custom transformer for creating the 'age_group' variable. This transformer creates the 'age_group' variable based on the 'age' column, using predefined age group bins. It also creates an additional binary variable 'age_group_01' indicating if the age group is '30-35'. Attributes: only_01 (bool): if True, only groups encoded as 0/1 are returned. Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to create the 'age_group' and 'age_group_01' variables. """def__init__(self, only_01=False):"""Constructor for ColumnPresenceAssertion. Parameters: - only_01: if True, only groups encoded as 0/1 are returned. """self.only_01 = only_01def fit(self, X, y=None):returnselfdef transform(self, X): X["age_group"] = pd.cut( X["age"], bins=[24.5, 29.5, 35.5], labels=["25-29", "30-35"], ) X["age_group_01"] = (X["age_group"] =="30-35").astype(int)ifself.only_01: X.drop(columns=["age_group"], inplace=True)return Xclass EncodeEmployment01(BaseEstimator, TransformerMixin):"""Custom transformer for encoding the 'employment_type' variable. This transformer creates a binary variable 'employment_type_01' where "1" indicates that employment type is 'Private Sector/Self Employed' and "0" indicates that employment type is 'Government Sector'. Attributes: None Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to create the 'employment_type_01' variable. """def fit(self, X, y=None):returnselfdef transform(self, X): X["employment_type_01"] = ( X["employment_type"] =="Private Sector/Self Employed" ).astype(int)return Xclass CreateTravelExperience(BaseEstimator, TransformerMixin):"""Custom transformer for creating the 'experienced_traveler' variable. This transformer creates a binary variable 'experienced_traveler' based on the values of 'ever_travelled_abroad' and 'frequent_flyer' columns. Attributes: None Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to create the 'experienced_traveler' variable. """def fit(self, X, y=None):returnselfdef transform(self, X): X["experienced_traveler"] = ( (X["ever_travelled_abroad"] =="Yes")& (X["frequent_flyer"] =="Yes") ).astype(int)return Xclass SortColumns(BaseEstimator, TransformerMixin):"""Custom transformer to sort columns alphabetically and move the target to the first position. This transformer sorts the columns of the input DataFrame in alphabetical order and moves the target column (if specified) to the first position. Attributes: target (str): The name of the target column. Default is None. Methods: fit(X, y=None): Fit method (Returns self). transform(X): Transform method to sort columns. """def__init__(self, target=None):"""Constructor for SortColumns. Parameters: - target: Name of the target variable to be moved. """assertisinstance(target, str), "Target variable name must be a string"self.target = targetdef fit(self, X, y=None):returnselfdef transform(self, X): X = X[sorted(X.columns)]ifself.target isnotNone: target_column = X.pop(self.target) X.insert(0, self.target, target_column)return Xclass YesNoEncoder(BaseEstimator, TransformerMixin):"""Custom transformer to encode 'Yes' as 1, 'No' as 0, and handle other values."""def fit(self, X, y=None): my.assert_values(X, ["Yes", "No"])self.columns_ = X.columnsreturnselfdef transform(self, X): my.assert_values(X, ["Yes", "No"])return X[self.columns_].apply(yes_no_to_1_0)def get_feature_names_out(self, *args, **params):returnself.columns_def yes_no_to_1_0(x: pd.Series):"""Function to convert 'Yes' to 1, 'No' to 0, and handle other values. Parameters: - x: Series of values. Returns: - x: Series with 'Yes' and 'No' values converted to 1 and 0, respectively. """# Convert to lowercase and trim whitespace x = x.str.lower().str.strip()# Map values to 1, 0 or NaN x = x.map({"yes": 1, "no": 0}).astype(int)return x# Classification ===========================================================def classification_report_with_confusion_matrix( best_models, X, y, labels=[1, 0]):""" Function to print a classification report with a confusion matrix. Reference to interpret confusion matrix: - rows: true class - columns: predicted class - cells contain counts. Usually, the meaning is as follows: ``` TN FP FN TP ``` (T - true, F - false, P - positive, N - negative) Args: best_models (dict): Dictionary containing the best models. X (pd.DataFrame): Dataframe containing the features. y (pd.Series): Series containing the target. labels (list): List containing the labels for confusion matrix. """ [ [ y_pred := model.predict(X),print("<<< "+ name +" >>>","\n", classification_report(y, y_pred),"\n","Confusion matrix (order of labels is "+str(labels) +"): \n","rows = true labels, columns = predicted labels \n", confusion_matrix(y, y_pred, labels=labels),"\n","\n\n", ), ]for name, model in best_models.items() ]
External files with code
The following functions, methods and classes are present in external files (Python modules). They are displayed here for convenience
Code
"""Various functions for data pre-processing, analysis and plotting."""# OS moduleimport os# Other Python libraries and modulesimport pathlibimport pickleimport functoolsimport pandas as pdimport matplotlib.pyplot as pltfrom typing import Unionfrom IPython.display import displayfrom matplotlib.ticker import MaxNLocator# Utilities ==================================================================# Check/Assertdef index_has_names(obj: Union[pd.Series, pd.DataFrame]) ->bool:"""Check if index of a Pandas object (Series of DataFrame) has names. Args: obj: Object that has `.index` attribute. Returns: bool: True if index has names, False otherwise. Examples: >>> import pandas as pd >>> series1 = pd.Series([1, 2], index=pd.Index(['A', 'B'], name='Letters')) >>> index_has_names(series1) True >>> series2 = pd.Series([1, 2], index=pd.Index(['A', 'B'])) >>> index_has_names(series2) False >>> dataframe1 = pd.DataFrame( ... [[1, 2], [3, 4]], ... index=pd.Index(['A', 'B'], name='Rows'), ... columns=['X', 'Y'] ... ) >>> index_has_names(dataframe1) True >>> dataframe2 = pd.DataFrame([[1, 2], [3, 4]]) >>> index_has_names(dataframe2) False """returnNonenotinlist(obj.index.names)def assert_values(df: pd.DataFrame, expected_values: list[str]) ->None:"""Assert that the values of each column in a Pandas DataFrame are among the expected values. Args: df (pd.DataFrame): The input DataFrame to check for expected values. expected_values (list[str]): The list of expected values. Raises: AssertionError: If any column in the DataFrame contains values not present in the expected values. Examples: >>> data = pd.DataFrame({ >>> 'col1': ['Yes', 'No', 'Yes'], >>> 'col2': ['Yes', 'Yes', 'Yes'] >>> }) >>> assert_values(data, ['Yes', 'No']) # No AssertionError is raised >>> data = pd.DataFrame({ >>> 'col1': ['Yes', 'No', 'Yes'], >>> 'col2': ['Yes', 'Maybe', 'no'] >>> }) >>> assert_values(data, ['Yes', 'No']) AssertionError: Only ['Yes', 'No'] values were expected in the following columns (Column name [unexpected values]): col2: ['Maybe', 'no'] """ non_matching_values = {}for column in df.columns: non_matching = df[~df[column].isin(expected_values)][column].tolist()if non_matching: non_matching_values[column] = non_matchingif non_matching_values: error_message = (f"\nOnly {expected_values} values were expected in the following ""columns\n(Column name [unexpected values]):\n" )for col_name, unexpected_values in non_matching_values.items(): error_message +=f"{col_name}: {unexpected_values}\n"raiseAssertionError(error_message)# Cachedef cache_results(file: str):"""Decorator to cache results of a function and save them to a file in the pickle format. Args.: file (str): File name. """def decorator_cache(fun):@functools.wraps(fun)def wrapper(*args, **kwargs):if pathlib.Path(file).is_file():withopen(file, "rb") as f: results = pickle.load(f)else: results = fun(*args, **kwargs)withopen(file, "wb") as f: pickle.dump(results, f)return resultsreturn wrapperreturn decorator_cache# Format values ------------------------------------------------------------def to_snake_case(text: str):"""Convert a string to the snake case. Args: text (str): The input string to change to the snake case. Returns: str: The string converted to the snake case. Examples: >>> to_snake_case("Some Text") 'some_text' >>> to_snake_case("SomeText2") 'some_text_2' """assertisinstance(text, str), "Input must be a string."return ( pd.Series(text) .str.replace("(?<=[a-z])(?=[A-Z0-9])", "_", regex=True) .str.replace(r"[ ]", "_", regex=True) .str.replace(r"_+", "_", regex=True) .str.lower() .to_string(index=False) )def format_p(p: float, digits: int=3, add_p: bool=True) ->str:"""Format p values at 3 decimal places. Args: p (float): p value (number between 0 and 1). digits (int, optional): Number of decimal places to round to. Defaults to 3. add_p (bool, optional): Should the string start with "p"? Examples: >>> format_p(1) 'p > 0.999' >>> format_p(0.12345) 'p = 0.123' >>> format_p(0.00001) 'p < 0.001' >>> format_p(0.00001, digits=2) 'p < 0.01' >>> format_p(1, digits=5) 'p > 0.99999' """ precision =10** (-digits)if add_p: prefix = ["p < ", "p > ", "p = "]else: prefix = ["<", ">", ""]if p < precision:returnf"{prefix[0]}{precision}"elif p > (1- precision):returnf"{prefix[1]}{1- precision}"else:returnf"{prefix[2]}{p:.{digits}f}"def format_percent(x: Union[float, list[float], pd.Series]) -> pd.Series:"""Round percentages to 1 decimal place and format as strings Values between 0 and 0.05 are printed as <0.1% Values between 99.95 and 100 are printed as >99.9% Args: x: (A sequence of) percentage values ranging from 0 to 100. Returns: pd.Series[str]: Pandas series of formatted values. Values equal to 0 are formatted as "0%", values between 0 and 0.05 are formatted as "<0.1%", values between 99.95 and 100 are formatted as ">99.9%", and values equal to 100 are formatted as "100%". Examples: >>> format_percent(0) ['0%'] >>> format_percent(0.01) ['<0.1%'] >>> format_percent(1) ['1.0%'] >>> format_percent(10) ['10.0%'] >>> format_percent(99.986) ['>99.9%'] >>> format_percent(100) ['100.0%'] >>> format_percent([100, 0, 0.2]) ['100.0%', '0%', '0.2%'] Author: Vilmantas Gėgžna """ifnotisinstance(x, (list, pd.Series)): x = [x] x_formatted = ["0%"if i ==0else"<0.1%"if i <0.05else">99.9%"if99.95<= i <100elsef"{i:.1f}%"for i in x ]ifisinstance(x, pd.Series):return pd.Series(x_formatted, index=x.index)else:return x_formatteddef counts_to_percentages(x: pd.Series, name: str="percent") -> pd.Series:"""Express counts as percentages. The sum of count values is treated as 100%. Args: x (int, float): Counts data as pandas.Series. name (str, optional): The name for output pandas.Series with percentage values. Defaults to "percent". Returns: str: pandas.Series object with `x` values expressed as percentages and rounded to 1 decimal place, e.g., "0.2%". Values equal to 0 are formatted as "0%", values between 0 and 0.1 are formatted as "<0.1%", values between 99.9 and 100 are formatted as ">99.9%". Examples: >>> import pandas as pd >>> counts_to_percentages(pd.Series([1, 0, 1000, 2000, 1000, 5000, 1000])) 0 <0.1% 1 0% 2 10.0% 3 20.0% 4 10.0% 5 50.0% 6 10.0% Name: percent, dtype: object >>> counts_to_percentages(pd.Series([1, 0, 10000])) 0 <0.1% 1 0% 2 >99.9% Name: percent, dtype: object Author: Vilmantas Gėgžna """return format_percent(x / x.sum() *100).rename(name)# Display --------------------------------------------------------def highlight_max(s, color="green"):"""Helper function to highlight the maximum in a Series or DataFrame Args: s: Numeric values one of which will be highlighted. color (str, optional): Text highlight color. Defaults to 'green'. Examples: >>> import pandas as pd >>> data_frame = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> data_frame.style.apply(highlight_max) >>> data_frame.style.format(precision=2).apply(highlight_max) """ is_max = s == s.max()return [f"color: {color}"if cell else""for cell in is_max]def highlight_rows_by_index(x, values, color="green"):"""Highlight rows/columns with certain index/column name. Args.: x (pandas.DataFrame): Dataframe to highlight. values (list): List of index/column names to highlight. color (str, optional): Text highlight color. Defaults to 'green'. Examples: >>> iris.head(10).style.apply(highlight_rows, values=[8, 9], axis=1) """return [f"color: {color}"if (x.name in values) else""for i in x]# Plot counts ---------------------------------------------------------------def plot_counts_with_labels( counts, title="", x=None, y="n", x_lab=None, y_lab="Count", label="percent", label_rotation=0, title_fontsize=13, legend=False, ec="black", y_lim_max=None, ax=None,**kwargs,):"""Plot count data as bar plots with labels. Args: counts (pandas.DataFrame): Data frame with counts data. title (str, optional): Figure title. Defaults to "". x (str, optional): Column name from `counts` to plot on x axis. Defaults to None: first column. y (str, optional): Column name from `counts` to plot on y axis. Defaults to "n". x_lab (str, optional): X axis label. Defaults to value of `x` with capitalized first letter. y_lab (str, optional): Y axis label. Defaults to "Count". label (str, None, optional): Column name from `counts` for value labels. Defaults to "percent". If None, label is not added. label_rotation (int, optional): Angle of label rotation. Defaults to 0. legend (bool, optional): Should legend be shown?. Defaults to False. ec (str, optional): Edge color. Defaults to "black". y_lim_max (float, optional): Upper limit for Y axis. Defaults to None: do not change. ax (matplotlib.axes.Axes, optional): Axes object. Defaults to None. **kwargs: further arguments to pandas.DataFrame.plot.bar() Returns: matplotlib.axes.Axes: Axes object of the generate plot. Author: Vilmantas Gėgžna """if x isNone: x = counts.columns[0]if x_lab isNone: x_lab = x.capitalize()if y_lim_max isNone: y_lim_max = counts[y].max() *1.15 ax = counts.plot.bar(x=x, y=y, legend=legend, ax=ax, ec=ec, **kwargs) ax.set_title(title, fontsize=title_fontsize) ax.set_xlabel(x_lab) ax.set_ylabel(y_lab)if label isnotNone: ax_add_value_labels_ab( ax, labels=counts[label], rotation=label_rotation ) ax.set_ylim(0, y_lim_max)return axdef ax_xaxis_integer_ticks(min_n_ticks: int, rot: int=0):"""Ensure that x axis ticks has integer values Args: min_n_ticks (int): Minimal number of ticks to use. rot (int, optional): Rotation angle of x axis tick labels. Defaults to 0. """ ax = plt.gca() ax.xaxis.set_major_locator( MaxNLocator(min_n_ticks=min_n_ticks, integer=True) ) plt.xticks(rotation=rot)def ax_axis_comma_format(axis: str="xy", ax=None):"""Write values of X axis ticks with comma as thousands separator Args: axis (str, optional): which axis should be formatted: "x" X axis, "y" Y axis or "xy" (default) both axes. ax (axis object, None, optional):Axis of plot. Defaults to None: current axis. """if ax isNone: ax = plt.gca() fmt ="{x:,.0f}" formatter = plt.matplotlib.ticker.StrMethodFormatter(fmt)if"x"in axis: ax.xaxis.set_major_formatter(formatter)if"y"in axis: ax.yaxis.set_major_formatter(formatter)def ax_add_value_labels_ab( ax, labels=None, spacing=2, size=9, weight="bold", **kwargs):"""Add value labels above/below each bar in a bar chart. Arguments: ax (matplotlib.Axes): Plot (axes) to annotate. label (str or similar): Values to be used as labels. spacing (int): Number of points between bar and label. size (int): font size. weight (str): font weight. **kwargs: further arguments to axis.annotate. Source: This function is based on https://stackoverflow.com/a/48372659/4783029 """# For each bar: Place a labelfor rect, label inzip(ax.patches, labels):# Get X and Y placement of label from rect. y_value = rect.get_height() x_value = rect.get_x() + rect.get_width() /2 space = spacing# Vertical alignment for positive values va ="bottom"# If the value of a bar is negative: Place label below the barif y_value <0:# Invert space to place label below space *=-1# Vertical alignment va ="top"# Use Y value as label and format number with one decimal placeif labels isNone: label ="{:.1f}".format(y_value)# Create annotation ax.annotate( label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha="center", va=va, fontsize=size, fontweight=weight,**kwargs, )
Code
"""Additional methods for Pandas Series and DataFrames"""# Setup -----------------------------------------------------------------import pandas as pdimport pandas_flavor as pffrom typing import Unionimport functions.fun_utils as my # Custom module# Series methods --------------------------------------------------------@pf.register_series_methoddef to_df(self: pd.Series, values_name: str=None, key_name: Union[str, list[str], tuple[str, ...]] =None,) -> pd.DataFrame:"""Convert Series to DataFrame with desired or default column names. Similar to `pandas.Series.to_frame()`, but the main purpose of this method is to be used with the result of `.value_counts()`. So appropriate default column names are pre-defined. And index is always reset. Args: self (pandas.Series): The object the method is applied to. values_name (str): Name for series values (applied before conversion to DataFrame). Defaults "count". key_name (str or sequence of str): New name for the columns, that are created from Series index that was present before the conversion to DataFrame. Defaults to `self.index.names`, if index has names, to `self.name` if index has no names but series has name, or to "value" otherwise. Return: pandas.DataFrame Examples: >>> import pandas as pd >>> df = pd.Series({'right': 138409, 'left': 44733}).rename("foot") >>> df.to_df() >>> # Compared to .to_frame() >>> df.to_frame() """ k_name =None v_name =None# Check if defaults can be set based on non-missing attribute valuesif my.index_has_names(self): k_name =self.index.namesifself.name isnotNone: v_name =self.nameelse: k_name =self.name# Set user-defined values or defaultsif key_name isnotNone: k_name = key_nameelif k_name isNone: k_name ="value"# Defaultif values_name isnotNone: v_name = values_nameelif v_name isNone: v_name ="count"# Default# Outputreturnself.rename_axis(k_name).rename(v_name).reset_index()# DataFrame methods --------------------------------------------------------@pf.register_dataframe_methoddef index_start_at(self, start=1):"""Create a new sequential index that starts at indicated number. Args.: self (pd.DataFrame): The object the method is applied to. start (int): The start of an index Return: pandas.DataFrame """ i =self.indexself.index =range(start, len(i) + start)returnself
Code
"""Functions to perform statistical analysis and output the results."""import pandas as pdimport statsmodels.stats.api as smsimport scipy.stats as spsimport scikit_posthocs as spimport functions.fun_utils as my # Custom modulefrom typing import Optional, Union# Functions =================================================================# Exploratory analysis ------------------------------------------------------def get_columns_overview(data: pd.DataFrame) -> pd.DataFrame:"""Get overview of data frame columns: data types, number of unique values, and number of missing values. Args: data (pd.DataFrame): Data frame to analyze. Returns: pd.DataFrame: Data frame with columns `column`, `data_type`, `n_unique`, and `n_missing`. """return pd.DataFrame( {"column": data.columns,"data_type": data.dtypes,"n_unique": data.nunique(),"n_missing": data.isna().sum(), } )# Inferential statistics -----------------------------------------------------def ci_proportion_binomial( counts, method: str="wilson", n_label: str="n", percent_label: str="percent",**kwargs,) -> pd.DataFrame:"""Calculate confidence intervals for binomial proportion. Calculates confidence intervals for each category separately on "category counts / total counts" basis. Wrapper around statsmodels.stats.proportion.proportion_confint() More information in documentation of statsmodels's proportion_confint(). Args: x (int): ps.Series, list or tuple with count data. method (str, optional): Method. Defaults to "wilson". n_label (str, optional): Name for column for counts. percent_label (str, optional): Name for column for percentage values. **kwargs: Additional arguments passed to proportion_confint(). Returns: pd.DataFrame: Data frame with group names, absolute counts, percentages and their confidence intervals. Examples: >>> ci_proportion_binomial([62, 55]) """assertisinstance(counts, (pd.Series, list, tuple))ifnotisinstance(counts, pd.Series): counts = pd.Series(counts)return pd.concat( [ (counts).rename(n_label), (counts /sum(counts)).rename(percent_label) *100, pd.DataFrame( [ sms.proportion_confint( count_i, sum(counts), method=method, **kwargs )for count_i in counts ], index=counts.index, columns=[f"ci_lower", "ci_upper"], )*100, ], axis=1, )def test_chi_square_gof( f_obs: list[int], f_exp: Union[str, list[float]] ="all equal", output: str="long",) ->str:"""Chi-squared (χ²) goodness-of-fit (gof) test Args: f_obs (list[int]): Observed frequencies f_exp str, list[int]: List of expected frequencies or "all equal" if all frequencies are equal to the mean of observed frequencies. Defaults to "all equal". output (str, optional): Output format (available options: "short", "long"). Defaults to "long". Returns: str: formatted test results including p value. """ k =len(f_obs) n =sum(f_obs) exp = n / k dof = k -1if f_exp =="all equal": f_exp = [exp for _ inrange(k)] stat, p = sps.chisquare(f_obs=f_obs, f_exp=f_exp)# May also be formatted this way:if output =="short": result =f"chi-square test, {my.format_p(p)}"else: result =f"chi-square test, χ²({dof}, n = {n}) = {round(stat, 2)}, {my.format_p(p)}"return result
Code
"""Various functions for machine learning related tasks."""import pandas as pdimport matplotlib.pyplot as pltimport matplotlib.ticker as mticker# Machine learningfrom sklearn.metrics import ( f1_score, accuracy_score, balanced_accuracy_score, roc_auc_score, cohen_kappa_score,)import functions.fun_utils as my # Custom module# Helpersdef get_metric_abbreviation_and_sign(scoring: str):"""Internal function to parse scoring string and return abbreviation and sign. """ sign =-1if scoring.startswith("neg") else1if scoring =="neg_root_mean_squared_error": metric ="RMSE"elif scoring =="balanced_accuracy": metric ="BAcc"elif scoring =="r2": metric ="R²"elif scoring =="f1": metric ="F1"else: metric = scoringreturn sign, metric# Feature selectiondef sfs_plot_results(sfs_object, sub_title="", ref_y=None):"""Plot results from SFS object Args.: sfs_object: object with SFS results. sub_title (str): second line of title. ref_y (float): Y coordinate of reference line. """ scoring = sfs_object.get_params()["scoring"] sign, metric = get_metric_abbreviation_and_sign(scoring)if sfs_object.forward: sfs_plot_title ="Forward Feature Selection"else: sfs_plot_title ="Backward Feature Elimination" fig, ax = plt.subplots(1, 2, sharey=True) xlab ="Number of predictors included"if ref_y isnotNone: ax[0].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5) ax[1].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5) avg_score = [ (int(i), sign * c["avg_score"]) for i, c in sfs_object.subsets_.items() ] averages = pd.DataFrame(avg_score, columns=["k_features", "avg_score"]) ( averages.plot.scatter( x="k_features", y="avg_score", xlabel=xlab, ylabel=metric, title=f"Average {metric}", ax=ax[0], ) ) cv_scores = {int(i): sign * c["cv_scores"] for i, c in sfs_object.subsets_.items() } ( pd.DataFrame(cv_scores).plot.box( xlabel=xlab, title=f"{metric} in CV splits", ax=ax[1], ) ) ax[0].xaxis.set_major_locator(mticker.MaxNLocator(integer=True)) ax[1].xaxis.set_major_locator(mticker.MaxNLocator(integer=True))ifnot sfs_object.forward: ax[1].invert_xaxis() main_title = (f"{sfs_plot_title} with {sfs_object.cv}-fold CV "+f"\n{sub_title}" ) fig.suptitle(main_title) plt.tight_layout() plt.show()# Print resultsifnot sfs_object.interrupted_:if sfs_object.is_parsimonious: note ="[Parsimonious]" k_selected =f"k = {len(sfs_object.k_feature_names_)}" score_at_k =f"avg. {metric} = {sign * sfs_object.k_score_:.3f}" note_2 ="Smallest number of predictors at best ± 1 SE score"else: note ="[Best]"if sign <0: best = averages.nsmallest(1, "avg_score")else: best = averages.nlargest(1, "avg_score") k_selected =f"k = {int(best.k_features.values)}" score_at_k =f"avg. {metric} = {float(best.avg_score.values):.3f}" note_2 ="Number of predictors at best score"print(f"{k_selected}, {score_at_k}{note}\n({note_2})")def sfs_list_features(sfs_result):"""List features by order when they were added. Current implementation correctly works with forward selection only. Args: sfs_result (SFS object) """ scoring = sfs_result.get_params()["scoring"] sign, metric = get_metric_abbreviation_and_sign(scoring) feature_dict = sfs_result.get_metric_dict() lst = [[*feature_dict[i]["feature_names"]] for i in feature_dict] feature = []if sfs_result.forward:for x, y inzip(lst[0::], lst[1::]): feature.append(*set(y).difference(x)) res = pd.DataFrame( {"added_feature": [*lst[0], *feature],"metric": metric,"score": [ sign * feature_dict[i]["avg_score"] for i in feature_dict ], } ).index_start_at(1)else:for x, y inzip(lst[0::], lst[1::]): feature.append(*set(x).difference(y)) res = pd.DataFrame( {"feature": [*feature, *lst[-1]],"metric": metric,"score": [ sign * feature_dict[i]["avg_score"] for i in feature_dict ], } ).index_start_at(1)[::-1]return ( res.assign(score_improvement=lambda x: sign * x.score.diff()) .assign( score_percentage_change=lambda x: sign * x.score.pct_change() *100 ) .rename_axis("step") )# Functions for regression/classificationdef get_classification_scores(model, X, y):"""Calculate scores of classification performance for a model. The following metrics are calculated: - accuracy - balanced accuracy - Cohen's kappa - F1 score - ROC AUC value Args: model: Scikit-learn classifier. X: predictor variables. y: target variable. Returns: dictionary with classification scores. """ y_pred = model.predict(X)return {"Accuracy": accuracy_score(y, y_pred),"BAcc": balanced_accuracy_score(y, y_pred),"Kappa": cohen_kappa_score(y, y_pred),"F1": f1_score(y, y_pred),"ROC AUC": roc_auc_score(y, y_pred), }def print_classification_scores( models, X, y, title="--- All data ---", color="green", precision=3):"""Print classification scores for a set of models. Args: models (dictionary): A dictionary of models. X: predictor variables. y: target variable. title (str, optional): Title to print. Defaults to "--- All data ---". color (str, optional): Text highlight color. Defaults to "green". precision (int, optional): Number of digits after the decimal point. """print(title)print("No information rate: ",round(pd.Series(y).value_counts(normalize=True).max(), precision), ) display( pd.DataFrame.from_dict( { name: get_classification_scores(model, X, y)for name, model in models.items() }, orient="index", ) .style.format(precision=precision) .apply(my.highlight_max, color=color) )
2 Data Exploration and Pre-Processing
2.1 Dataset Description
In this project, travel insurance prediction dataset (version 4) was used. Both dataset and its description are available on Kaggle. The dataset contains the following variables:
Age – age of the customer.
Employment Type – the sector in which the customer is employed.
GraduateOrNot – whether the customer is a college graduate or not.
AnnualIncome – the yearly income of the customer in Indian Rupees (rounded to the nearest 50 thousand Rupees).
FamilyMembers – the number of members in the customer’s family.
ChronicDisease – whether the customer suffers from any major disease or condition like diabetes, high blood pressure, or asthma, etc.
FrequentFlyer – derived data based on the customer’s history of booking air tickets on at least 4 different instances in the last 2 years (2017-2019).
EverTravelledAbroad – has the customer ever traveled to a foreign country (not necessarily using the company’s services).
TravelInsurance – did the customer buy the travel insurance package during the introductory offering held in the year 2019.
The initial overview of the whole dataset and inference on the target variable revealed that:
The dataset has 1987 rows and 9 columns (8 explanatory and 1 target variable).
No missing data is present.
The target variable is binary. The classes of target variable are imbalanced (ratio of frequencies is ~1.8) and the difference between the group sizes is significant (chi-square test, p < 0.001): more frequently clients reject the offer to buy travel insurance.
Column names do not conform to the snake case convention: some are in camel case some have spaces.
Some binary variables have values “Yes” and “No”, some – 1 and 0. This can be unified.
Find the details in the collapsible section below.
To check if target group clients equally frequently buy travel insurance (category code “1”) or reject this offer (category code “0”), the following inferential procedures were used:
chi-square goodness-of-fit test (testing a hypothesis that the group sizes are equal);
Wilson’s 95% confidence intervals of proportions.
Note: The population here is the whole target group of clients represented by the dataset.
print("Difference between the group sizes is significant \n("+ an.test_chi_square_gof(freq)+").")
Difference between the group sizes is significant
(chi-square test, χ²(1, n = 1987) = 161.8, p < 0.001).
Around 35.7% (95% CI: 35.1%-36.3%) of clients accept and 64.3% (95% CI: 64.9%-63.7%) reject the offer to buy travel insurance.
Ratio of frequencies of the two groups:
Code
round(freq[0] / freq[1], 2)
1.8
Code of the figure
# Plotax = my.plot_counts_with_labels(freq_df, rot=0, y_lim_max=1600)chi_sq_rez_1 = an.test_chi_square_gof(freq, output="short")my.ax_axis_comma_format("y")# Get the limits of the x-axis and y-axisxlim = ax.get_xlim()ylim = ax.get_ylim()# Set the position of the annotationx_pos = xlim[1] -0.05* (xlim[1] - xlim[0]) # 5% offset from the right edgey_pos = ylim[1] -0.05* (ylim[1] - ylim[0]) # 5% offset from the top edge# Add the annotation to the top right cornerax.annotate( chi_sq_rez_1.capitalize(), xy=(x_pos, y_pos), xycoords="data", ha="right", va="top",)plt.show()
2.4 Create Train, Validation and Test Sets
To prevent data leakage and to get more rigorous estimates of model performance, the dataset was split into train, validation and test sets in the ratio 70:15:15. The split was stratified by the target variable to take class imbalance into account.
the train set is used to gain more insights about the data, train and tune models,
the validation set is used to test the performance of the candidate models,
the test set is used to evaluate the final model.
Code
# The example to split data into 3 datasetsdata_train, data_vt = train_test_split( data_all, test_size=0.30, random_state=22, stratify=data_all.TravelInsurance)data_validation, data_test = train_test_split( data_vt, test_size=0.5, random_state=22, stratify=data_vt.TravelInsurance)
2.5 EDA on Train Set Before Further Pre-Processing
In addition to the initial overview, a deeper look into the train set was made. It suggested some pre-processing ideas, which are described in the next section.
An extensive EDA was performed using the skimpy, SweetViz and ydata-profiling packages. The results are summarized in the following paragraphs, and the details are presented in the collapsible sections and tables below.
Summary. Based on the sample of 1390 clients (the training set):
The age of the clients (target group) ranges from 25 to 35 years old;
The number of family members ranges from 2 to 9;
Annual income ranges from 300k to 1800k;
Annual income per family member ranges from 33k to 850k;
The majority of clients:
are college graduates;
work in the private sector or are self-employed;
have no chronic diseases;
have never been abroad;
are not frequent flyers.
Analysis of correlations/associations revealed that the status of buying travel insurance is associated with the status if a client ever traveled abroad or if he/she is a frequent and experienced traveler, his/her annual income, annual income per family member, employment type and even the number of family members are associated while graduation status and presence of chronic diseases are not (see Table 2.1). On the other hand, annual income is associated with such variables as employment type, the status of being a frequent/experienced traveler or income per family member (see section “Associations” under “SweetViz” report) so these variables might not contribute much to the prediction of travel insurance purchase in addition to annual income.
Note: Some variables (like age_group and age_grup_01) now have identical information just coded in a different way (one variable is better suited for EDA, the other one for modeling). The redundant variables will be removed later.
EDA: Data Profiling Report of Training Data (SweetViz)
Code
if do_eda: report = sv.analyze( [data_train_preprocessed, "Train"], target_feat="travel_insurance" ) report.show_notebook()
EDA: Data Profiling Report of Training Data (ydata-profiling)
Code
if do_eda: display( eda.ProfileReport( data_train_preprocessed, title="Data Profiling Report: Training Data", config_file="_config/ydata_profile_config--mini.yaml", ) )
Correlation Analysis To explore the relationship between the target variable and the other numeric variables, a point-biserial correlation analysis was conducted. The correlation coefficient (denoted as r in the table below) was calculated, along with the corresponding p-values. Additionally, p-values adjusted for multiple comparisons using the Holm-Šídák method were computed. The correlation analysis results are summarized in Table 2.1.
Code
target ="travel_insurance"non_target_numeric = ( data_train_preprocessed.select_dtypes("number").drop(columns=target).columns)cor_data = [ ( i,*sps.pointbiserialr( data_train_preprocessed[target], data_train_preprocessed[i] ), )for i in non_target_numeric]cor_data = pd.DataFrame(cor_data, columns=["variable_2", "r", "p"])cor_data.insert(0, "variable_1", target)( cor_data.sort_values("r", ascending=False) .assign( p_adj=lambda x: [my.format_p(i, add_p=False) for i in p_adjust(x.p)[1]] ) .assign(r=lambda x: x.r.round(2)) .assign(p=lambda x: x.p.apply(my.format_p, add_p=False)) .index_start_at(1) .style.format({"r": "{:.2f}"}) .bar(vmin=-1, vmax=1, cmap="BrBG", subset=["r"]))
Table 2.1. Point-biserial correlation between the target variable and the numeric/binary features**.
variable_1
variable_2
r
p
p_adj
1
travel_insurance
ever_travelled_abroad
0.44
<0.001
<0.001
2
travel_insurance
annual_income
0.41
<0.001
<0.001
3
travel_insurance
experienced_traveler
0.33
<0.001
<0.001
4
travel_insurance
income_per_family_member
0.27
<0.001
<0.001
5
travel_insurance
frequent_flyer
0.23
<0.001
<0.001
6
travel_insurance
employment_type_01
0.15
<0.001
<0.001
7
travel_insurance
age_group_01
0.12
<0.001
<0.001
8
travel_insurance
age
0.08
0.004
0.016
9
travel_insurance
family_members
0.07
0.010
0.028
10
travel_insurance
graduate_or_not
0.02
0.528
0.778
11
travel_insurance
chronic_diseases
0.01
0.626
0.778
2.8 Prepare Train and Validation Sets for Modeling
Initially, redundant variables such as “age_group” and “employment_type” will be eliminated from the dataset. Although it would be justifiable to remove uncorrelated variables like “graduate_or_not” and “chronic_disease” due to the small number of features, they will be retained for the time being.
Subsequently, the training and validation sets will be prepared for the modeling process. This involves applying appropriate transformations and separating the target variable from the remaining features.
Code
data_train_2 = data_train_preprocessed.drop( columns=["age_group", "employment_type"])[print(i, col) for i, col inenumerate(data_train_2.columns, start=1)];
There were several main stages of the modeling procedure:
Model pre-selection (narrowing down the list of candidate models);
Feature selection (selecting the most important features);
Parameter tuning (tuning the hyperparameters of the selected models);
Evaluation of the final model on the test set.
Balanced accuracy (BAcc) was used as the main metric for model evaluation.
3.1 Model Pre-Selection
In this section, a short list of parameters for 5 types of ML algorithms (logistic regression, k-nearest neighbors, support vector machine for classification (SVC), random forest and Naive Bayes) will be tuned and these 5 models will be compared to select the best candidates for the next step.
First, group-dependent pre-processing pipeline steps will be introduced (this section).
Then, a grid of models and e few hyperparameters will be defined (this section).
Finally, (in the two following subsections) modeling without and with under-sampling will be performed.
Summary. In modeling without under-sampling, the best model is RandomForest, in modeling with under-sampling, the best model is SVC. The kNN model’s performance was similar, so these three candidates were used in the following steps of model selection. The main results are presented in Table 3.2 and Table 3.4.
Details:
Code
# Add group-dependent pre-processing steps which will be performed# before each model re-fittingnumeric_features = ["age","annual_income","family_members","income_per_family_member",]numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])group_dependent_preprocessor = ColumnTransformer( transformers=[("numeric", numeric_transformer, numeric_features)], remainder="passthrough",)# Create the pipeline with the preprocessor and the classifierpipeline = Pipeline( steps=[ ("preprocessor", group_dependent_preprocessor), ("classifier", BaseEstimator()), # Replace with desired classifier ])
3.1.1 Model Pre-Seleciton (Original Group Size Proportions)
In this sub-section, the issue of class imbalance is not addressed. The original group size proportions are used.
Code
@my.cache_results(dir_cache +"01_1_best_models_orig.pickle")def do_best_models_orig():"""The following code is wrapped into a function for result catching. Select the best model for each classifier. """# Create a list to store the best models best_models_orig = {}# Iterate over the classifiers and perform hyperparameter tuning using# cross-validationfor name, classifier, param_grid in classifiers:# Create the pipeline with the preprocessor, and the classifier pipeline = Pipeline( steps=[ ("preprocessor", group_dependent_preprocessor), ("classifier", classifier), ] )# Perform hyperparameter tuning using cross-validation grid_search = GridSearchCV( pipeline, param_grid, cv=5, n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train)# Get the best model best_models_orig[name] = grid_search.best_estimator_return best_models_origbest_models_orig = do_best_models_orig()
Output details
Time: 2m 52.2s
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
The performance of the classification. The best scores of each column are in orange (training set)/green (validation set):
Code
ml.print_classification_scores( best_models_orig, X_train, y_train,"--- Train (original group sizes) ---", color="orange",)
--- Train (original group sizes) ---
No information rate: 0.642
Table 3.1. Classification scores for the train set (original group sizes). The best values in each column are highlighted.
The classification performance was also evaluated using the random undersampling strategy that is used to balance the dataset.
Results: Using the undersampling strategy, SVC performance improved, in the case of logistic regression, some performance coefficients improved (e.g., F1 score), some decreased(e.g., Kappa), and some did not change. In all other cases, this strategy did not improve the performance of classifiers.
Find the details below.
Code
@my.cache_results(dir_cache +"01_2_best_models_undersampling.pickle")def do_best_models_undersampling():"""The following code is wrapped into a function for result catching. Select the best model for each classifier. Undersampling is applied. """# Create a list to store the best models best_models_undersampling = {}# Iterate over the classifiers and perform hyperparameter tuning# using cross-validationfor name, classifier, param_grid in classifiers:# Create the imblearn pipeline with the preprocessor, undersampler,# and the classifier imblearn_pipeline = ImbPipeline( steps=[ ("preprocessor", group_dependent_preprocessor), ("undersampler", RandomUnderSampler(random_state=1)), ("classifier", classifier), ] )# Perform hyperparameter tuning using cross-validation grid_search = GridSearchCV( imblearn_pipeline, param_grid, cv=5, n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train)# Get the best model best_models_undersampling[name] = grid_search.best_estimator_return best_models_undersamplingbest_models_undersampling = do_best_models_undersampling()
Output details
Time: 21.8s
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
The performance of the classification. The best scores of each column are in orange (training set)/green (validation set):
Three best-performing models from the previous step are used for sequential feature selection, SFS (in particular, backward feature elimination with 5-fold cross-validation).
The results of SFS analysis are presented in Figure 3.1, Figure 3.2, and Figure 3.3 as well as Table 3.6, Table 3.7, and Table 3.8 and summarized in Table 3.5: despite the fact, that automatic feature selection option (the parsimonious method) suggested a different number of features for each model, the analysis of the mentioned plots revealed that the three most important predictors are essentially the same for all three models: “annual_income”, “family_members”, and “age”/“age_group_01”.
Table 3.5. Feature selection results by classifier type. In all cases, the 3 most important features are similar.
Classifier
Suggested¹ number of features
BAcc
Three most important features²
kNN
3
0.750
“annual_income”, “family_members”, “age”
SVC
7
0.763
“annual_income”, “family_members”, “age_group_01”
RF
3
0.769
“annual_income”, “family_members”, “age”
¹ Parsimonious method (the smallest number of features within 1 standard error from the best performance score) was used to select the number of features.
² Based on figures 3.1, 3.2, and 3.3.
Code
def SFS(classifier):"""Create a Sequential Feature Selector object for the given classifier.""" estimator = best_models_orig[classifier]["classifier"]print(estimator)return SequentialFeatureSelector( estimator, k_features="parsimonious", forward=False, floating=False, scoring="balanced_accuracy", verbose=1, cv=5, n_jobs=-1, )
3.2.1 SFS for kNN
Code
sfs_knn = SFS("kNN")
KNeighborsClassifier(n_neighbors=17)
Code
@my.cache_results(dir_cache +"02_1_sfs_knn_res_bacc.pickle")def do_sfs_knn():"""The following code is wrapped into a function for result catching. SFS for kNN. """return sfs_knn.fit(X_train, y_train)sfs_knn_res = do_sfs_knn()
Output details
Time: 9.6s
F1
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 out of 11 | elapsed: 6.8s remaining: 2.5s
[Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 6.9s finished
Features: 10/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 0.4s finished
Features: 9/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 0.2s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 0.3s finished
Features: 8/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 0.2s remaining: 0.7s
[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 0.2s finished
Features: 7/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.2s finished
Features: 6/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 0.1s finished
Features: 5/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 5 | elapsed: 0.1s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 0.1s finished
Features: 4/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 0.1s finished
Features: 3/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.1s finished
Features: 2/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.0s finished
Features: 1/1
Time: 1.8s
BAcc
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 out of 11 | elapsed: 0.4s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 0.5s finished
Features: 10/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 0.1s finished
Features: 9/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 0.1s finished
Features: 8/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 0.0s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 0.0s finished
Features: 7/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.0s finished
Features: 6/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 0.0s finished
Features: 5/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 5 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 0.0s finished
Features: 4/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 0.0s finished
Features: 3/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.0s finished
Features: 2/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.0s finished
Features: 1/1
Table 3.6. Sequential Feature Selection using kNN classifier. In column feature, the row of interest and the rows above it indicate the combination of features included in the model, e.g., row two (step 10) shows that family_members and annual_income are included. Columns score_improvement and score_percentage_change show the difference between the current and the previous row. The highlighted rows indicate the investigator-selected features of the best model.
@my.cache_results(dir_cache +"02_2_sfs_svc_res_bacc.pickle")def do_sfs_svc():"""The following code is wrapped into a function for result catching. SFS for SVC. """return sfs_svc.fit(X_train, y_train)sfs_svc_res = do_sfs_svc()
Output details
Time: Time: 1m 2.3s
F1
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 out of 11 | elapsed: 8.1s remaining: 3.0s
[Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 11.5s finished
Features: 10/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 6.4s remaining: 4.2s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 9.0s finished
Features: 9/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 5.3s remaining: 6.7s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.6s finished
Features: 8/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 5.3s remaining: 16.2s
[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 6.0s finished
Features: 7/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 5.1s finished
Features: 6/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 4.8s finished
Features: 5/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 5 | elapsed: 3.1s remaining: 4.7s
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 4.0s finished
Features: 4/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 3.6s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 3.6s finished
Features: 3/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 3.8s finished
Features: 2/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 2.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 2.1s finished
Features: 1/1
Time: 58.4s
BAcc
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 out of 11 | elapsed: 10.2s remaining: 3.8s
[Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 11.6s finished
Features: 10/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 5.5s remaining: 3.6s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 8.0s finished
Features: 9/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 5.1s remaining: 6.4s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.1s finished
Features: 8/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 4.8s remaining: 14.7s
[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 5.2s finished
Features: 7/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 4.7s finished
Features: 6/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 4.3s finished
Features: 5/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 5 | elapsed: 3.1s remaining: 4.7s
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 3.8s finished
Features: 4/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 3.4s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 3.4s finished
Features: 3/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 3.6s finished
Features: 2/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 1.9s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 1.9s finished
Features: 1/1
Code
ml.sfs_plot_results(sfs_svc_res, "SVM for classification (SVC)")
k = 7, avg. BAcc = 0.763 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Table 3.7. Sequential Feature Selection using SVC classifier. For details, refer to the description of Table 3.6. The highlighted rows indicate the investigator-selected features of the best model.
feature
metric
score
score_improvement
score_percentage_change
step
11
annual_income
BAcc
0.709
nan
nan
10
family_members
BAcc
0.731
0.022
3.170
9
age_group_01
BAcc
0.757
0.025
3.441
8
frequent_flyer
BAcc
0.753
-0.003
-0.428
7
graduate_or_not
BAcc
0.760
0.007
0.905
6
ever_travelled_abroad
BAcc
0.759
-0.001
-0.089
5
chronic_diseases
BAcc
0.763
0.004
0.513
4
experienced_traveler
BAcc
0.763
-0.000
-0.058
3
employment_type_01
BAcc
0.758
-0.005
-0.694
2
age
BAcc
0.738
-0.019
-2.572
1
income_per_family_member
BAcc
0.715
-0.023
-3.120
3.2.3 SFS for RF
Code
sfs_rf = SFS("Random Forest")
RandomForestClassifier(random_state=1)
Code
@my.cache_results(dir_cache +"02_3_sfs_rf_res_bacc.pickle")def do_sfs_rf():"""The following code is wrapped into a function for result catching. SFS for Random Forest. """return sfs_rf.fit(X_train, y_train)sfs_rf_res = do_sfs_rf()
Output details
Time: 54.8s
F1
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 out of 11 | elapsed: 5.6s remaining: 2.1s
[Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 8.3s finished
Features: 10/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 5.3s remaining: 3.5s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 7.8s finished
Features: 9/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 5.0s remaining: 6.3s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.3s finished
Features: 8/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 5.2s remaining: 15.7s
[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 5.6s finished
Features: 7/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 4.7s finished
Features: 6/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 4.2s finished
Features: 5/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 5 | elapsed: 3.5s remaining: 5.3s
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 3.7s finished
Features: 4/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 3.9s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 3.9s finished
Features: 3/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 3.2s finished
Features: 2/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 2.5s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 2.5s finished
Features: 1/1
Time: 29.3s
BAcc
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 out of 11 | elapsed: 2.8s remaining: 1.0s
[Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 4.1s finished
Features: 10/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 2.9s remaining: 1.9s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 4.1s finished
Features: 9/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 2.5s remaining: 3.2s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 4.0s finished
Features: 8/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 2.5s remaining: 7.7s
[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 3.0s finished
Features: 7/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 2.6s finished
Features: 6/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 2.4s finished
Features: 5/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 5 | elapsed: 2.1s remaining: 3.2s
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 2.2s finished
Features: 4/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 2.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 2.1s finished
Features: 3/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 1.3s finished
Features: 2/1[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.9s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.9s finished
Features: 1/1
Table 3.8. Sequential Feature Selection using Random Forest classifier. For details, refer to the description of Table 3.6. The highlighted rows indicate the investigator-selected features of the best model.
# Tune RF model@my.cache_results(dir_cache +"03_3_tune_rf_randomizedsearch.pickle")def tune_rf():# Features required_features = ["age", "annual_income", "family_members"] numeric_features = required_features# Group-dependent transformations numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())]) group_dependent_preprocessor = ColumnTransformer( transformers=[("numeric", numeric_transformer, numeric_features)], remainder="passthrough", )# Pipeline pipeline = Pipeline( steps=[ ("preprocessor", group_dependent_preprocessor), ("classifier", RandomForestClassifier(random_state=1, oob_score=False), ), ] ) param_grid_rf = {# Number of trees in random forest"classifier__n_estimators": [50, *np.arange(100, 3000, step=100)],# Number of features to consider at every split"classifier__max_features": [1, 2, 3],# Maximum number of levels in tree"classifier__max_depth": list(np.arange(5, 100, step=5)) + [None],# Minimum number of samples required to split a node"classifier__min_samples_split": np.arange(2, 11, step=1),# Minimum number of samples required at each leaf node"classifier__min_samples_leaf": np.arange(1, 11, step=1),# Method of selecting samples for training each tree"classifier__bootstrap": [True, False], }# Perform hyperparameter tuning using cross-validation search = RandomizedSearchCV( pipeline, param_grid_rf, n_iter=200, cv=5, scoring="balanced_accuracy", n_jobs=-1, verbose=3, ) search.fit(X_train[required_features], y_train)return searchtuning_results_rf = tune_rf()
Output details
Fitting 5 folds for each of 200 candidates, totalling 1000 fits
Time: 14m 26.1s
Code
print("\n--- Best RF model: ---\n",f"\nBAcc (train CV) score: {tuning_results_rf.best_score_:.3f}","\n\nParameters: ",)pprint(tuning_results_rf.best_params_)print("\nFeatures:")pprint([*tuning_results_rf.best_estimator_.feature_names_in_])
Table 3.9. Hyperparameter tuning results for the training set The best values in each column are highlighted. Note: The BAcc scores here and above differ as scores in this table are not cross-validation scores.
Table 3.10. Hyperparameter tuning results for the validation set The best values in each column are highlighted.
Accuracy
BAcc
Kappa
F1
ROC AUC
kNN
0.785
0.742
0.509
0.663
0.742
SVC
0.782
0.746
0.509
0.670
0.746
RF
0.822
0.767
0.579
0.697
0.767
3.4 Final Model Evaluation
3.4.1 Prepare Data for Final Model Evaluation
First, merge training and test data to form a dataset that will be used for creating the model.
Second, separate the predictors and the target variable.
Code
target ="TravelInsurance"# Train + validation set for training modeldata_train_final = pd.concat([data_train, data_validation])X_train_final = data_train_final.drop(target, axis=1)y_train_final = data_train_final[target].to_numpy()# Test set for testing modelX_test = data_test.drop(target, axis=1)y_test = data_test[target].to_numpy()
3.4.2 Create Final Pipeline
Now, the final pre-processing and prediction pipeline will be created and fitted:
First, define the final pipeline.
Second, fit the pipeline to the training (training + validation) data.
The analysis of the final model performance revealed that:
The balanced accuracy of the final model is 76.7% (see Table 3.11).
The model’s specificity (96.4%) is larger than sensitivity/recall (57.0%; see Figure 3.4 and Table 3.12) which means that it is easier to identify customers that will not buy insurance than those who will do.
The positive predictive value (precision) is 89.7% and the negative predictive value is 80.1% (see Table 3.12). This indicates that predictions are quite accurate.
The most important predictor is annual income (see Figure 3.5).
The model trained on the whole dataset (training + validation + test) will be saved to a file. It will be used in the production environment to predict whether a customer will buy insurance or not.
Currently, some Python packages (e.g., SweetViz) are not updated on PyPi and Conda repositories so versions of these packages with custom-made changes are used. So installation instructions are not fully reproducible.
In the recursive feature selection part, the under-sampling strategy was not used, which might have led to inferior results for the SVC model. Yet, as all models indicated essentially the same variables as the most important ones, no big difference in results is expected.
In some parts, code could have been generalized into functions to avoid code repetition. Yet, as the code was not repeated more than 2-3 times, it was not done.
It might be questionable if the “preselect model”-“select features”-“tune model” strategy is the best one.
The final RF model used 1000 trees. It could have been explicitly explored if a smaller number of trees would have been sufficient to achieve the same performance. It would be an advantage in terms of computational time and model size.
More feature engineering could have been done. For example, it might be beneficial to categorize the annual income variable and test its suitability for classification.
For models, which allow that, different weights could have been tried.
Instead of some custom-made transformers, Scikit-learn’s transformers could have been used (e.g., OneHotEncoder).