Red Wine Quality Prediction

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-02-07

Updated

2023-07-30

Project logo. Generated with Leonardo.Ai.

Data analysis tools: Python, R, Looker Studio Helper tools: Quarto, RStudio, Git Skills:

data pre-processing
exploratory data analysis (EDA):
- descriptive statistics
- data visualization
- exploratory PCA (principal component analysis)
inferential statistics:
- hypothesis testing
- confidence intervals
statistical modelling:
- linear regression
- logistic regression
literate programming
statistical programming
dashboarding

Technical requirements:

Use a combination of R and Python programming languages in a single document.
Create a dashboard in Looker Studio.

Abbreviations

AIC – Akaike information criterion.
CI – 95% confidence interval.
CV – cross-validation.
Dim – dimension (principal component, the same as PC).
EDA – exploratory data analysis.
n – either sample or group size.
p – p-value.
p_adj – p-value (adjusted).
PCA – principal component analysis.
PC – principal component.
r – correlation coefficient.
R² – r squared, coefficient of determination.
RNG – (pseudo)random number generator.
RMSE – root mean squared error.
SFS – sequential feature selection.
VIF – variation inflation factor.

1 Introduction

Wine quality assessment and certification are important for wine making and selling processes. Certification helps to prevent counterfeiting and in this way protects people’s health as well as assures quality for the wine market. To identify the most influential factors is crucial for effective quality evaluation while to classify wines into quality groups is useful for setting prices (Cortez et al. 2009). Vinho verde is wine from the Minho region, which is located in the northwest part of Portugal. In this project, data of red vinho verde wine specimens are investigated. The dataset was originally provided by (Cortez et al. 2009). This project mainly focuses on:

prediction quality (to investigate how accurately (1) wine quality and (2) alcohol content in wine can be predicted) and
explanation (to identify, which are the most important factors for the prediction tasks).

1.1 The Dataset

Various aspects of the dataset are described in various sources. Pieces of description from article of (Cortez et al. 2009) as well as from websites (“Red Wine Quality — Kaggle.com”; “Wine Quality Datasets — www3.dsi.uminho.pt”) were copied, merged and presented in this sub-section. Variables based on physicochemical tests:

fixed acidity ( $g_{tartaric acid} / l$ ): most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
volatile acidity ( $g_{acetic acid} / l$ ): the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste.
citric acid ( $g / l$ ): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
residual sugar ( $g / l$ ): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
chlorides ( $g_{sodium chloride} / l$ ): the amount of salt in the wine.
free sulfur dioxide ( $m g / l$ ): the free form of SO₂ exists in equilibrium between molecular SO₂ (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
total sulfur dioxide ( $m g / l$ ): amount of free and bound forms of SO₂; in low concentrations, SO₂ is mostly undetectable in wine, but at free SO₂ concentrations over 50 ppm, SO₂ becomes evident in the nose and taste of wine.
density ( $g / c m^{3}$ ): the density of water is close to that of water depending on the percent alcohol and sugar content.
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
sulphates: a wine additive which can contribute to sulfur dioxide gas (SO₂) levels, which acts as an antimicrobial and antioxidant.
alcohol (% vol.): the percent alcohol content of the wine.

Variable based on sensory data:

quality (score between 0 and 10).

2 Methods

2.1 R and Python in a Single Analysis

RStudio is an IDE for R and Python. In Quarto (.qmd) and R Markdown (.rmd) documents, RStudio supports an interface that allows using R and Python in a single document (e.g., we can access Python objects in R code cells by using py$<python_object_name> accessor, and access R objects in Python code cells by using r.<r_object_name> accessor). So additional technical tasks for this project were to:

Try RStudio capabilities for Python programming (documentation, data analysis, plotting, etc.).
Try RStudio interface between R and Python: use these 2 tools in the same file.

Main principles how it was leveraged between R and Python in this project:

Python is used for all steps which are directly related to putting model into production.
Some exploratory analysis steps and other tasks that are easier to perform in R, are done in R.

2.2 Population and Sample

The population of interest is red vinho verde wines. It is assumed that the data is a simple random sample or its equivalent that allows using methods of inferential statistics.

2.3 Training and Test Sets

As the prediction step will be included in the analysis, the whole sample is divided into model training and test sets using 80:20 train/test hold-out strategy with stratification by wine quality scores.

Training set was used for:

exploratory analysis;
statistical inference;
model creation and selection.

Test set was used only to evaluate the performance of the final models.

2.4 General: Inferential Statistics

For statistical inference, significance level $α = 0.05$ and 95% confidence level are used.

2.5 Differences Between Groups

For statistical inference about several groups (categories), the following strategy is used:

first, 95% confidence intervals (CI) for each group are calculated,
second, an omnibus test is performed.

In the analysis of counts (sizes of wine quality groups),
the following methods are used:
- confidence intervals: Goodman’s simultaneous confidence intervals of proportions;
- omnibus: Pearson’s chi-square (χ²) goodness-of-fit (GOF) test,
  - hypotheses:
    $H_{0} :$ All proportions are equal: $π_{1} = π_{2} = . . . = π_{i} = . . . = π_{k}$
    $H_{1} :$ At least two proportions differ: $π_{i} \neq π_{j}$ for at least single pair of i and j.
Here $π$ (pi) is a proportion (relative frequency, percentage) of values in certain group,
$i$ and $j$ are group indices ( $i$ = 1, 2, …, $k$ ; $j$ = 1, 2, …, $k$ ; $i \neq j$ ),
$k$ – total number of groups.

2.6 Correlation Between Variables

For relationship between numeric variables, parametric Pearson’s correlation analysis is used.

Hypotheses:
$H_{0} :$ $ρ = 0$ (variables do not correlate),
$H_{1} :$ $ρ \neq 0$ (variables correlate).

Here $ρ$ (rho) is population Pearson’s correlation coefficient.

In statistical significance testing, as hypotheses are tested multiple times, to prevent inflated type 1 error rate, Holm correction is applied.

2.7 Classification Task

To predict wine quality, quality scores were merged into 3 groups (“low”, “medium”, “high”) and this variable was modeled by using multinomial logistic regression. For feature selection, statistical inference based strategy is applied. The main measure to evaluate overall classification performance is Cohen’s kappa (κ).

2.8 Regression Task

To predict alcohol content, linear regression is used. For feature selection sequential feature selection (SFS) with 5-fold cross validation was used. The main metric for model performance was R². For the final evaluation, RMSE was included too.

2.9 Feature Engineering

In some cases data were right-skewed. So this type of variables were investigated on the original scale (not transformed) and after the logarithmic transformation.

3 Preparation and Inspection

3.1 Setup

Both R and Python will be used in this project. R code is marked with blue and Python code with yellow stripes.

Code: R setup

# R setup ================================================
# R packages ---------------------------------------------
library(tidyverse)
library(reticulate)
library(factoextra)
library(DescTools)
library(patchwork)

# R options ----------------------------------------------
options(scipen = 5)
DescToolsOptions(stamp = NULL, digits = 3)

# Default ggplot2 theme
theme_set(theme_bw())

# Other options
knitr::opts_chunk$set(fig.height = 3.5, fig.width = 6, fig.align = "center")

# RNG state
set.seed(100)

Code: Python setup

# Python setup =========================================

# Packages and modules ---------------------------------

# Utilities
import warnings

# Data wrangling, math
import numpy as np
import pandas as pd
import janitor
from numpy import log

# Statistical analysis and modeling
import patsy
import scipy.stats as sps
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import pingouin as pg
import scikit_posthocs as sp
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import ttest_1samp

# Machine learning
from sklearn.preprocessing import StandardScaler

Code: Python setup

from sklearn.metrics import mean_squared_error as mse, r2_score

Code: Python setup

from sklearn.metrics import cohen_kappa_score, classification_report
from sklearn.metrics import confusion_matrix, balanced_accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Python Settings -----------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_colwidth", 40)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.3f}")
pd.set_option("styler.format.thousands", ",")

# RNG state
np.random.seed(100)

Code: Ad-hoc Python functions

# Ad-hoc Python functions
def format_p0(p):
    """Format p values at 3 decimal places.

    Args:
        p (float): p value (number between 0 and 1).
    """
    if p < 0.001:
        return "<0.001"
    elif p > 0.999:
        return ">0.999"
    else:
        return f"{p:.3f}"


# Inferential statistics
def ci_proportion_multinomial(
    counts,
    method: str = "goodman",
    n_label: str = "n",
    percent_label: str = "percent",
) -> pd.DataFrame:
    """Calculate  simultaneous confidence intervals for multinomial proportion.

    More information in documentation of statsmodels'
    multinomial_proportions_confint.

    Args:
        x (int): ps.Series, list or tuple with count data.
        method (str, optional): Method. Defaults to "goodman".
       n_label (str, optional): Name for column for counts.
       percent_label (str, optional): Name for column for percentage values.

    Returns:
        pd.DataFrame: _description_

    Examples:
    >>> ci_proportion_multinomial([62, 33, 55])
    """
    assert type(counts) in [pd.Series, list, tuple]
    if type(counts) is not pd.Series:
        counts = pd.Series(counts)

    return pd.concat(
        [
            (counts).rename(n_label),
            (counts / sum(counts)).rename(percent_label) * 100,
            pd.DataFrame(
                sms.multinomial_proportions_confint(counts, method=method),
                index=counts.index,
                columns=["ci_lower", "ci_upper"],
            )
            * 100,
        ],
        axis=1,
    )


def test_chi_square_gof(f_obs, f_exp="all equal") -> str:
    """Chi squared (χ²) goodness-of-fit (gof) test

    Args:
        f_obs (list[int]): Observed frequencies
        f_exp str, list[int]: List of expected frequencies or "all equal" if
              all frequencies are equal to the mean of observed frequencies.
              Defaults to "all equal".

    Returns:
        str: test p value.
    """
    k = len(f_obs)
    n = sum(f_obs)
    exp = n / k
    dof = k - 1
    if f_exp == "all equal":
        f_exp = [exp for _ in range(k)]
    stat, p = sps.chisquare(f_obs=f_obs, f_exp=f_exp)
    # May also be formatted this way:
    return (
        "Chi square test, "
        f"χ²({dof}, n = {n}) = {round(stat, 2)}, p{format_p0(p)}"
    )


def pairwise_correlation(data, method="pearson"):
    """Correlation analysis (including p value calculation)

    Args:
      data (pandas.DataFrame): dataset with numeric variables
      method (str): correlation method, e.g., "pearson" or "spearman".

    Return:
      pandas.DataFrame with the resilts of correlation analysis.

    """
    return (
        pg.pairwise_corr(data, method=method, padjust="holm")
        .sort_values(by="r", key=abs, ascending=False)
        .transform_column("p-corr", format_p0, "p_adj")
        .select_columns(["X", "Y", "r", "CI95%", "p_adj"])
    )


def to_3_quality_groups(score, numeric=False):
    """
    Wine quality score will be recoded as 3 quality groups:

    - score ≤ 4 = low,
    - score [5-6] = medium,
    - score ≥ 7 = high.

    args:
      score (int, float): numeric value from 0 to 10
      numeric (bool):
         If True, this output dedicated for multinomial logistic regression:
         low = 0, medium = 2 (reference group), high = 1.
    """
    assert 0 <= score <= 10
    score = int(score)

    if numeric:
        if score <= 4:
            return 0
        elif 5 <= score <= 6:
            # Used as a reference group
            return 2
        elif 7 <= score:
            return 1
    else:
        if score <= 4:
            return "low"
        elif 5 <= score <= 6:
            return "medium"
        elif 7 <= score:
            return "high"


# Functions for regression/classification
def do_sfs_lin_reg(formula, data, forward=True):
    """Do sequential feature selection for Linear Regression

    Args.:
      formula (str): R style formula
      data (pandas.DataFrame)
      forward (bool): True – forward selection
                      False – backward selection
    """
    lin_regression = LinearRegression(fit_intercept=True)

    sfs = SequentialFeatureSelector(
        lin_regression,
        k_features="parsimonious",
        forward=forward,
        floating=True,
        scoring="r2",
        verbose=0,
        cv=5,
    )

    y, X = patsy.dmatrices(formula + "+ 0", data, return_type="dataframe")

    return sfs.fit(X, y)


def get_sfs_performance_lin_reg(sfs_object, n_features):
    """Return performance measures with certain number of features.

    Args.:
        sfs_object: result of function do_sfs_lin_reg()
        n_features (int): number of features.
    """
    md = round(
        np.median(sfs_object.get_metric_dict()[n_features]["cv_scores"]), 3
    )
    return {
        "n_features": n_features,
        "mean R²": round(
            sfs_object.get_metric_dict()[n_features]["avg_score"], 3
        ),
        "median R²": md,
        "sd R²": round(sfs_object.get_metric_dict()[n_features]["std_dev"], 3),
    }


def show_sfs_results_lin_reg(sfs_object, sub_title=""):
    """Show results of do_sfs_lin_reg()

    Args.:
      sfs_object: result of function do_sfs_lin_reg()
      sub_title (str): second line of title.
    """
    if sfs_object.forward:
        sfs_type = "Forward"
    else:
        sfs_type = "Backward"

    plt.clf()
    fig, ax = plt.subplots(1, 2)

    avg_score = [(i, c["avg_score"]) for i, c in sfs_object.subsets_.items()]
    (
        pd.DataFrame(
            avg_score, columns=["n_features", "avg_score"]
        ).plot.scatter(
            x="n_features",
            y="avg_score",
            xlabel="Number of features included",
            ylabel="Average R²",
            ylim=(0.1, 0.8),
            ax=ax[0],
        )
    )

    cv_scores = {i: c["cv_scores"] for i, c in sfs_object.subsets_.items()}
    (
        pd.DataFrame(cv_scores).plot.box(
            xlabel="Number of features included",
            ylabel="R²",
            ylim=(0.1, 0.8),
            ax=ax[1],
        )
    )

    if not sfs_object.forward:
        ax[1].invert_xaxis()

    main_title = (
        f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV "
        + f"\n{sub_title}"
    )

    fig.suptitle(main_title)
    plt.tight_layout()
    plt.show()

    # Print results
    n_selected = f"n = {len(sfs_object.k_feature_names_)}"
    r_sq_selected = f"avg. R² = {sfs_object.k_score_:.3f}"

    print(main_title)
    print("\nSelected variables:")
    print(f"[ {n_selected}, {r_sq_selected} ]\n")
    for i, name in enumerate(sfs_object.k_feature_names_, start=1):
        print(f"{i}. {name}")


def get_regression_performance(y_true, y_pred, name=""):
    """Evaluate regression model performance

    Calculate R², RMSE, and SD of predicted variable

    Args.:
      y_true, y_pred: true and predicted numeric values.
      name (str): the name of investigated set.
    """
    return (
        pd.DataFrame(
            {
                "set": name,
                "n": len(y_true),
                "SD": [float(np.std(y_true))],
                "RMSE": [float(np.sqrt(mse(y_true, y_pred)))],
                "R²": [r2_score(y_true, y_pred)],
            }
        )
        .eval("RMSE_SD_ratio = RMSE/SD")
        .eval("SD_RMSE_ratio = SD/RMSE")
    )


def predict_class_mnlogit(model, data):
    """Predict classes from multinomial logit (mnlogit) model.

    Args.:
        model: trained mnlogit model.
        data: pandas.DataFrame

    Return: numeric class codes.
    """
    return np.asarray(model.predict(data)).argmax(1)


def print_classification_report(true_class, predicted_class):
    """Print summary of classification performance

    Args.:
        true_class, predicted_class: data sequences of the same length:
                                     with class names/indicators.
    """
    kappa = cohen_kappa_score(true_class, predicted_class)
    print(f"kappa = {kappa:.3f}")
    print(classification_report(true_class, predicted_class))
    print("Confusion matrix (rows - true, columns - predicted):")
    print(confusion_matrix(true_class, predicted_class))


def do_mnlogit(formula, data):
    """Perform multivatiate logistic regression and print the results.

    formula (str): model formula
    data (pandas.DataFrame): dataset
    """

    response_var = formula.split("~")[0].strip()

    # Do multinomial logistic regression
    mnlogit_model = smf.mnlogit(formula, data).fit()

    # Evaluate performance
    print(mnlogit_model.summary2())

    # Classification report
    gr_pred = predict_class_mnlogit(mnlogit_model, data)
    gr_true = data[response_var]
    print_classification_report(gr_true, gr_pred)

    # Output
    return mnlogit_model

Code: Python functions by other authors

# Class for linear regression diagnostics with R-style diagnostic plots

# NOTE: This class for linear regression assumption checking should be in
# a separate file, but RStudio failed to load custom Python modules.

# Source:
# https://www.statsmodels.org/dev/examples/notebooks/generated/linear_regression_diagnostics_plots.html

# Base code
import numpy as np
import seaborn as sns
import statsmodels
from statsmodels.tools.tools import maybe_unwrap_results
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from typing import Type

style_talk = "seaborn-talk"  # refer to plt.style.available


class Linear_Reg_Diagnostic:
    """
    Diagnostic plots to identify potential problems in a linear regression fit.
    Mainly,
        a. non-linearity of data
        b. Correlation of error terms
        c. non-constant variance
        d. outliers
        e. high-leverage points
        f. collinearity

    Author:
        Prajwal Kafle (p33ajkafle@gmail.com, where 3 = r)
        Does not come with any sort of warranty.
        Please test the code one your end before using.
    """

    def __init__(
        self,
        results: Type[
            statsmodels.regression.linear_model.RegressionResultsWrapper
        ],
    ) -> None:
        """
        For a linear regression model, generates following diagnostic plots:

        a. residual
        b. qq
        c. scale location and
        d. leverage

        and a table

        e. vif

        Args:
            results (Type[statsmodels.regression.linear_model.RegressionResultsWrapper]):
            must be instance of statsmodels.regression.linear_model object

        Raises:
            TypeError: if instance does not belong to above object

        Example:
        >>> import numpy as np
        >>> import pandas as pd
        >>> import statsmodels.formula.api as smf
        >>> x = np.linspace(-np.pi, np.pi, 100)
        >>> y = 3*x + 8 + np.random.normal(0,1, 100)
        >>> df = pd.DataFrame({'x':x, 'y':y})
        >>> res = smf.ols(formula= "y ~ x", data=df).fit()
        >>> cls = Linear_Reg_Diagnostic(res)
        >>> cls(plot_context="seaborn-paper")

        In case you do not need all plots you can also independently
        make an individual plot/table
        in following ways

        >>> cls = Linear_Reg_Diagnostic(res)
        >>> cls.residual_plot()
        >>> cls.qq_plot()
        >>> cls.scale_location_plot()
        >>> cls.leverage_plot()
        >>> cls.vif_table()
        """

        if (
            isinstance(
                results,
                statsmodels.regression.linear_model.RegressionResultsWrapper,
            )
            is False
        ):
            raise TypeError(
                "result must be instance of "
                "statsmodels.regression.linear_model.RegressionResultsWrapper "
                "object"
            )

        self.results = maybe_unwrap_results(results)

        self.y_true = self.results.model.endog
        self.y_predict = self.results.fittedvalues
        self.xvar = self.results.model.exog
        self.xvar_names = self.results.model.exog_names

        self.residual = np.array(self.results.resid)
        influence = self.results.get_influence()
        self.residual_norm = influence.resid_studentized_internal
        self.leverage = influence.hat_matrix_diag
        self.cooks_distance = influence.cooks_distance[0]
        self.nparams = len(self.results.params)

    def __call__(self, plot_context="seaborn-paper"):
        # print(plt.style.available)
        with plt.style.context(plot_context):
            fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
            self.residual_plot(ax=ax[0, 0])
            self.qq_plot(ax=ax[0, 1])
            self.scale_location_plot(ax=ax[1, 0])
            self.leverage_plot(ax=ax[1, 1])
            plt.show()

        # self.vif_table()
        return fig, ax

    def residual_plot(self, ax=None):
        """
        Residual vs Fitted Plot

        Graphical tool to identify non-linearity.
        (Roughly) Horizontal red line is an indicator that the residual has
        a linear pattern
        """
        if ax is None:
            fig, ax = plt.subplots()

        sns.residplot(
            x=self.y_predict,
            y=self.residual,
            lowess=True,
            scatter_kws={"alpha": 0.5},
            line_kws={"color": "red", "lw": 1, "alpha": 0.8},
            ax=ax,
        )

        # annotations
        residual_abs = np.abs(self.residual)
        abs_resid = np.flip(np.sort(residual_abs))
        abs_resid_top_3 = abs_resid[:3]
        for i, _ in enumerate(abs_resid_top_3):
            ax.annotate(i, xy=(self.y_predict[i], self.residual[i]), color="C3")

        ax.set_title("Residuals vs Fitted", fontweight="bold")
        ax.set_xlabel("Fitted values")
        ax.set_ylabel("Residuals")
        return ax

    def qq_plot(self, ax=None):
        """
        Standarized Residual vs Theoretical Quantile plot

        Used to visually check if residuals are normally distributed.
        Points spread along the diagonal line will suggest so.
        """
        if ax is None:
            fig, ax = plt.subplots()

        QQ = ProbPlot(self.residual_norm)
        QQ.qqplot(line="45", alpha=0.5, lw=1, ax=ax)

        # annotations
        abs_norm_resid = np.flip(np.argsort(np.abs(self.residual_norm)), 0)
        abs_norm_resid_top_3 = abs_norm_resid[:3]
        for r, i in enumerate(abs_norm_resid_top_3):
            ax.annotate(
                i,
                xy=(
                    np.flip(QQ.theoretical_quantiles, 0)[r],
                    self.residual_norm[i],
                ),
                ha="right",
                color="C3",
            )

        ax.set_title("Normal Q-Q", fontweight="bold")
        ax.set_xlabel("Theoretical Quantiles")
        ax.set_ylabel("Standardized Residuals")
        return ax

    def scale_location_plot(self, ax=None):
        """
        Sqrt(Standarized Residual) vs Fitted values plot

        Used to check homoscedasticity of the residuals.
        Horizontal line will suggest so.
        """
        if ax is None:
            fig, ax = plt.subplots()

        residual_norm_abs_sqrt = np.sqrt(np.abs(self.residual_norm))

        ax.scatter(self.y_predict, residual_norm_abs_sqrt, alpha=0.5)
        sns.regplot(
            x=self.y_predict,
            y=residual_norm_abs_sqrt,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={"color": "red", "lw": 1, "alpha": 0.8},
            ax=ax,
        )

        # annotations
        abs_sq_norm_resid = np.flip(np.argsort(residual_norm_abs_sqrt), 0)
        abs_sq_norm_resid_top_3 = abs_sq_norm_resid[:3]
        for i in abs_sq_norm_resid_top_3:
            ax.annotate(
                i, xy=(self.y_predict[i], residual_norm_abs_sqrt[i]), color="C3"
            )
        ax.set_title("Scale-Location", fontweight="bold")
        ax.set_xlabel("Fitted values")
        ax.set_ylabel(r"$\sqrt{|\mathrm{Standardized\ Residuals}|}$")
        return ax

    def leverage_plot(self, ax=None):
        """
        Residual vs Leverage plot

        Points falling outside Cook's distance curves are considered
        observation that can sway the fit aka are influential.
        Good to have none outside the curves.
        """
        if ax is None:
            fig, ax = plt.subplots()

        ax.scatter(self.leverage, self.residual_norm, alpha=0.5)

        sns.regplot(
            x=self.leverage,
            y=self.residual_norm,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={"color": "red", "lw": 1, "alpha": 0.8},
            ax=ax,
        )

        # annotations
        leverage_top_3 = np.flip(np.argsort(self.cooks_distance), 0)[:3]
        for i in leverage_top_3:
            ax.annotate(
                i, xy=(self.leverage[i], self.residual_norm[i]), color="C3"
            )

        xtemp, ytemp = self.__cooks_dist_line(0.5)  # 0.5 line
        ax.plot(
            xtemp, ytemp, label="Cook's distance", lw=1, ls="--", color="red"
        )
        xtemp, ytemp = self.__cooks_dist_line(1)  # 1 line
        ax.plot(xtemp, ytemp, lw=1, ls="--", color="red")

        ax.set_xlim(0, max(self.leverage) + 0.01)
        ax.set_title("Residuals vs Leverage", fontweight="bold")
        ax.set_xlabel("Leverage")
        ax.set_ylabel("Standardized Residuals")
        ax.legend(loc="upper right")
        return ax

    def vif_table(self):
        """
        VIF table

        VIF, the variance inflation factor, is a measure of multicollinearity.
        VIF > 5 for a variable indicates that it is highly collinear with the
        other input variables.
        """
        vif_df = pd.DataFrame()
        vif_df["Features"] = self.xvar_names
        vif_df["VIF Factor"] = [
            variance_inflation_factor(self.xvar, i)
            for i in range(self.xvar.shape[1])
        ]

        return vif_df.sort_values("VIF Factor").round(2)

    def __cooks_dist_line(self, factor):
        """
        Helper function for plotting Cook's distance curves
        """
        p = self.nparams
        formula = lambda x: np.sqrt((factor * p * (1 - x)) / x)
        x = np.linspace(0.001, max(self.leverage), 50)
        y = formula(x)
        return x, y

3.2 Import Data

Data was provided as a text file.

Preview of data file

A few lines from the text file with data:

R code

read_lines("data/winequality-red.csv", n_max = 6) |> writeLines()

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5

Code

wine_raw = pd.read_csv("data/winequality-red.csv", delimiter=";")

The dataset contains 1599 rows and 12 columns. All variables are numeric. No missing values were found.

Details: inspect wine_raw dataset

The properties of the imported dataset:

Code

wine_raw.shape

(1599, 12)

R code

head(py$wine_raw)

ABCDEFGHIJ0123456789

	fixed acidity <dbl>	volatile acidity <dbl>	citric acid <dbl>
1	7.4	0.70	0.00
2	7.8	0.88	0.00
3	7.8	0.76	0.04
4	11.2	0.28	0.56
5	7.4	0.70	0.00
6	7.4	0.66	0.00

R code

glimpse(py$wine_raw)

Rows: 1,599
Columns: 12
$ `fixed acidity`        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7…
$ `volatile acidity`     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600…
$ `citric acid`          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00,…
$ `residual sugar`       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.…
$ chlorides              <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069…
$ `free sulfur dioxide`  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, …
$ `total sulfur dioxide` <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 10…
$ density                <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978,…
$ pH                     <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39,…
$ sulphates              <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47,…
$ alcohol                <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 1…
$ quality                <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5,…

Number of missing values:

Code

wine_raw.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Code

wine_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

3.3 Recoding and Renaming Variables

Variable names will be changed into lower snake case (e.g., to make them easier to be used in statistical model formulas). Wine quality score will be merged into quality groups as follows:

≤ 4 – low,
[5-6] – medium,
≥ 7 – high.

Code

wine = wine_raw.clean_names()

wine["quality_gr"] = pd.Categorical(
    wine["quality"].apply(to_3_quality_groups),
    categories=["low", "medium", "high"],
    ordered=True,
)

del wine_raw

How do quality scores and quality groups match?

R code

py$wine |>
  with(table(quality_gr, quality)) |>
  addmargins() |>
  pander::pander()

**Table 3.1.** Match between wine quality scores (columns) and quality groups (rows). Values indicate counts. “Sum” shows column/row sums.
	3	4	5	6	7	8	Sum
low	10	53	0	0	0	0	63
medium	0	0	681	638	0	0	1319
high	0	0	0	0	199	18	217
Sum	10	53	681	638	199	18	1599

Variable recoding was performed without errors as numbers add up correctly.

Details: inspect wine dataset

Column names:

R code

colnames(py$wine)

 [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
 [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
 [7] "total_sulfur_dioxide" "density"              "ph"                  
[10] "sulphates"            "alcohol"              "quality"             
[13] "quality_gr"

Dataset:

R code

head(py$wine)

ABCDEFGHIJ0123456789

	fixed_acidity <dbl>	volatile_acidity <dbl>	citric_acid <dbl>
1	7.4	0.70	0.00
2	7.8	0.88	0.00
3	7.8	0.76	0.04
4	11.2	0.28	0.56
5	7.4	0.70	0.00
6	7.4	0.66	0.00

R code

glimpse(py$wine)

Rows: 1,599
Columns: 13
$ fixed_acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
$ volatile_acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
$ citric_acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
$ residual_sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
$ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
$ free_sulfur_dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
$ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
$ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
$ ph                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
$ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
$ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
$ quality              <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
$ quality_gr           <ord> medium, medium, medium, medium, medium, medium, m…

3.4 Split into Training/Test Sets

In this project, for more accurate models’ prediction power, a 80:20 train/test hold-out split with stratification by wine quality scores will be used.

Code

wine_train, wine_test = train_test_split(
    wine, test_size=0.2, random_state=50, stratify=wine["quality"]
)

del wine

The sizes of training and test sets after splitting are 1279:320 wine samples. If needed, find more details below.

Details: inspect data after splitting to train/test sets

The dimensions of training set:

Code

wine_train.shape

(1279, 13)

The dimensions of test set:

Code

wine_test.shape

(320, 13)

Is the split really stratified? Yes as sample size ratios (column ratio) in all strata are around 0.80:

R code

full_join(
  py$wine_train |> count(quality, name = "n_train"),
  py$wine_test |> count(quality, name = "n_test"),
  by = "quality"
) |>
  mutate(ratio = round(n_train / (n_train + n_test), digits = 2))

ABCDEFGHIJ0123456789

quality <dbl>	n_train <int>	n_test <int>	ratio <dbl>
3	8	2	0.80
4	42	11	0.79
5	545	136	0.80
6	510	128	0.80
7	159	40	0.80
8	15	3	0.83

Dataset (train only):

R code

head(py$wine_train)

ABCDEFGHIJ0123456789

	fixed_acidity <dbl>	volatile_acidity <dbl>	citric_acid <dbl>
1525	6.7	0.480	0.08
92	8.6	0.490	0.29
556	10.9	0.530	0.49
1012	7.7	1.005	0.15
385	7.4	0.630	0.07
466	10.3	0.500	0.42

R code

glimpse(py$wine_train)

Rows: 1,279
Columns: 13
$ fixed_acidity        <dbl> 6.7, 8.6, 10.9, 7.7, 7.4, 10.3, 10.0, 9.0, 9.0, 8…
$ volatile_acidity     <dbl> 0.480, 0.490, 0.530, 1.005, 0.630, 0.500, 0.490, …
$ citric_acid          <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.25, 0…
$ residual_sugar       <dbl> 2.1, 2.0, 4.6, 2.1, 2.4, 2.0, 11.0, 2.0, 2.8, 2.0…
$ chlorides            <dbl> 0.064, 0.110, 0.118, 0.102, 0.090, 0.069, 0.071, …
$ free_sulfur_dioxide  <dbl> 18, 19, 10, 11, 11, 21, 13, 8, 28, 11, 9, 40, 32,…
$ total_sulfur_dioxide <dbl> 34, 133, 17, 32, 37, 51, 50, 21, 104, 27, 22, 83,…
$ density              <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, 0.99…
$ ph                   <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.27, 3…
$ sulphates            <dbl> 0.64, 1.98, 0.56, 0.48, 0.76, 0.72, 0.69, 0.72, 0…
$ alcohol              <dbl> 9.70000, 9.80000, 11.70000, 10.00000, 9.70000, 11…
$ quality              <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, 4, 6…
$ quality_gr           <ord> medium, medium, medium, medium, medium, medium, m…

To avoid additional biases, test set will not be investigated until the final prediction models are created. Now let’s focus on training set only.

Use training set only

From here, further exploratory and inferential analysis as well as statistical modelling will use training set only.

3.5 Further Inspection of Training Set

Graphical and numeric inspection of each variable (find the details below) revealed that some variables are either moderately (skewness > 0.5) or highly (skewness > 1) right-skewed. As it is planned to use linear models in this project, skewed data may lead to non-linearities. This issue will be accounted for in the next subsection.

Details: plots and summaries of each variable on original scale

R code

DescTools::Desc(
  py$wine_train |> mutate(quality = ordered(quality)),
  verbose = 3
)

------------------------------------------------------------------------------ 
Describe mutate(py$wine_train, quality = ordered(quality)) (data.frame):

data frame: 1279 obs. of  13 variables
        1279 complete cases (100.0%)

  Nr  ColName               Class            NAs  Levels                  
  1   fixed_acidity         numeric          .                            
  2   volatile_acidity      numeric          .                            
  3   citric_acid           numeric          .                            
  4   residual_sugar        numeric          .                            
  5   chlorides             numeric          .                            
  6   free_sulfur_dioxide   numeric          .                            
  7   total_sulfur_dioxide  numeric          .                            
  8   density               numeric          .                            
  9   ph                    numeric          .                            
  10  sulphates             numeric          .                            
  11  alcohol               numeric          .                            
  12  quality               ordered, factor  .    (6): 1-3, 2-4, 3-5, 4-6,
                                                  5-7, ...                
  13  quality_gr            ordered, factor  .    (3): 1-low, 2-medium,   
                                                  3-high                  


------------------------------------------------------------------------------ 
1 - fixed_acidity (numeric)

  length       n    NAs  unique    0s   mean  meanCI'
   1'279   1'279      0      94     0   8.30    8.21
          100.0%   0.0%          0.0%           8.40
                                                    
     .05     .10    .25  median   .75    .90     .95
    6.10    6.50   7.10    7.90  9.20  10.60   11.71
                                                    
   range      sd  vcoef     mad   IQR   skew    kurt
   11.30    1.75   0.21    1.48  2.10   1.02    1.26
                                                    
lowest : 4.6, 4.9, 5.0 (4), 5.1 (3), 5.2 (6)
highest: 14.3, 15.0 (2), 15.5, 15.6 (2), 15.9

' 95%-CI (classic)

------------------------------------------------------------------------------ 
2 - volatile_acidity (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0     141     0  0.53    0.52
          100.0%   0.0%          0.0%          0.54
                                                   
     .05     .10    .25  median   .75   .90     .95
    0.27    0.31   0.40    0.52  0.64  0.76    0.85
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
    1.46    0.18   0.34    0.18  0.24  0.71    1.32
                                                   
lowest : 0.12 (3), 0.16 (2), 0.18 (8), 0.19, 0.2 (2)
highest: 1.18, 1.18, 1.24, 1.33 (2), 1.58

' 95%-CI (classic)

------------------------------------------------------------------------------ 
3 - citric_acid (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0      78   110  0.27    0.26
          100.0%   0.0%          8.6%          0.28
                                                   
     .05     .10    .25  median   .75   .90     .95
    0.00    0.01   0.09    0.25  0.42  0.53    0.60
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
    0.79    0.19   0.73    0.25  0.33  0.32   -0.88
                                                   
lowest : 0.0 (110), 0.01 (27), 0.02 (40), 0.03 (21), 0.04 (25)
highest: 0.74 (4), 0.75, 0.76 (2), 0.78, 0.79

heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
4 - residual_sugar (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0      85     0  2.55    2.47
          100.0%   0.0%          0.0%          2.63
                                                   
     .05     .10    .25  median   .75   .90     .95
    1.50    1.70   1.90    2.20  2.60  3.61    5.20
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
   14.50    1.44   0.57    0.44  0.70  4.42   26.74
                                                   
lowest : 0.9, 1.2 (7), 1.3 (5), 1.4 (33), 1.5 (22)
highest: 12.9, 13.4, 13.8 (2), 13.9, 15.4 (2)

heap(?): remarkable frequency (9.9%) for the mode(s) (= 2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
5 - chlorides (numeric)

  length       n     NAs  unique      0s    mean   meanCI'
   1'279   1'279       0     141       0  0.0867   0.0843
          100.0%    0.0%            0.0%           0.0891
                                                         
     .05     .10     .25  median     .75     .90      .95
  0.0540  0.0600  0.0700  0.0790  0.0900  0.1062   0.1242
                                                         
   range      sd   vcoef     mad     IQR    skew     kurt
  0.4520  0.0437  0.5045  0.0148  0.0200  5.0275  31.2563
                                                         
lowest : 0.012 (2), 0.034, 0.038 (2), 0.039 (4), 0.041 (4)
highest: 0.413, 0.414 (2), 0.415 (2), 0.422, 0.464

' 95%-CI (classic)

------------------------------------------------------------------------------ 
6 - free_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      58      0  15.70   15.12
          100.0%   0.0%           0.0%          16.29
                                                     
     .05     .10    .25  median    .75    .90     .95
    4.00    5.00   7.00   13.00  21.00  30.00   35.00
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   71.00   10.60   0.68   10.38  14.00   1.31    2.24
                                                     
lowest : 1.0 (2), 3.0 (44), 4.0 (34), 5.0 (86), 5.5
highest: 54.0, 55.0 (2), 66.0, 68.0 (2), 72.0

heap(?): remarkable frequency (9.1%) for the mode(s) (= 6)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
7 - total_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0     140      0  45.67   43.85
          100.0%   0.0%           0.0%          47.49
                                                     
     .05     .10    .25  median    .75    .90     .95
   11.00   14.00  21.00   37.00  61.00  94.00  112.00
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
  283.00   33.16   0.73   26.69  40.00   1.61    4.46
                                                     
lowest : 6.0 (3), 7.0 (4), 8.0 (10), 9.0 (13), 10.0 (24)
highest: 155.0, 160.0, 165.0, 278.0, 289.0

' 95%-CI (classic)

------------------------------------------------------------------------------ 
8 - density (numeric)

    length         n       NAs    unique        0s      mean    meanCI'
     1'279     1'279         0       400         0  0.996736  0.996633
              100.0%      0.0%                0.0%            0.996839
                                                                      
       .05       .10       .25    median       .75       .90       .95
  0.993598  0.994480  0.995600  0.996720  0.997825  0.999108  1.000000
                                                                      
     range        sd     vcoef       mad       IQR      skew      kurt
  0.013620  0.001883  0.001889  0.001661  0.002225  0.127046  0.867452
                                                                      
lowest : 0.99007, 0.99064 (2), 0.99084, 0.9912, 0.9915
highest: 1.0026, 1.00289, 1.00315 (2), 1.0032, 1.00369 (2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
9 - ph (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0      86     0  3.31    3.30
          100.0%   0.0%          0.0%          3.32
                                                   
     .05     .10    .25  median   .75   .90     .95
    3.07    3.13   3.21    3.31  3.40  3.51    3.57
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
    1.15    0.15   0.05    0.15  0.19  0.25    0.88
                                                   
lowest : 2.86, 2.87, 2.88 (2), 2.89 (2), 2.9
highest: 3.72 (3), 3.75, 3.78 (2), 3.9 (2), 4.01 (2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
10 - sulphates (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      88      0  0.653   0.644
          100.0%   0.0%           0.0%          0.662
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.470   0.490  0.550   0.620  0.730  0.840   0.921
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.650   0.163  0.250   0.119  0.180  2.263  10.489
                                                     
lowest : 0.33, 0.37, 0.39 (5), 0.4 (4), 0.42 (4)
highest: 1.59, 1.61, 1.62, 1.95, 1.98

' 95%-CI (classic)

------------------------------------------------------------------------------ 
11 - alcohol (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      61      0  10.42   10.36
          100.0%   0.0%           0.0%          10.48
                                                     
     .05     .10    .25  median    .75    .90     .95
    9.20    9.30   9.50   10.20  11.10  12.00   12.50
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
    6.50    1.06   0.10    1.04   1.60   0.87    0.23
                                                     
lowest : 8.4 (2), 8.5, 8.7, 8.8 (2), 9.0 (22)
highest: 13.4 (2), 13.5, 13.6, 14.0 (6), 14.9

heap(?): remarkable frequency (8.8%) for the mode(s) (= 9.5)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
12 - quality (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      6      6      y
         100.0%   0.0%                     

   level  freq   perc  cumfreq  cumperc
1      3     8   0.6%        8     0.6%
2      4    42   3.3%       50     3.9%
3      5   545  42.6%      595    46.5%
4      6   510  39.9%    1'105    86.4%
5      7   159  12.4%    1'264    98.8%
6      8    15   1.2%    1'279   100.0%

------------------------------------------------------------------------------ 
13 - quality_gr (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      3      3      y
         100.0%   0.0%                     

    level   freq   perc  cumfreq  cumperc
1     low     50   3.9%       50     3.9%
2  medium  1'055  82.5%    1'105    86.4%
3    high    174  13.6%    1'279   100.0%

3.6 Log-Transformation of Right-Skewed Variables

The amount of skewness in variables was investigated before (columns skewness_original in the table below) and after (skewness_after_log) log-transformation, which was applied for variables with skewness above 1.

Code

skewness = (
    wine_train.skew(numeric_only=True)
    .sort_values(ascending=False)
    .rename("skewness_original")
)

skewed_vars = skewness.where(skewness > 1).dropna().index.tolist()

skewness_after = (
    wine_train.transform_columns(skewed_vars, np.log10)
    .skew(numeric_only=True)
    .rename("skewness_after_log")
)

skew_diff = (skewness_after - skewness).rename("difference")

pd.concat([skewness, skewness_after, skew_diff], axis=1).reset_index(
    names="variable"
)

                variable  skewness_original  skewness_after_log  difference
0              chlorides              5.039               1.530      -3.509
1         residual_sugar              4.435               1.803      -2.632
2              sulphates              2.268               0.852      -1.416
3   total_sulfur_dioxide              1.616              -0.033      -1.649
4    free_sulfur_dioxide              1.313              -0.172      -1.485
5          fixed_acidity              1.025               0.426      -0.599
6                alcohol              0.870               0.870       0.000
7       volatile_acidity              0.710               0.710       0.000
8            citric_acid              0.323               0.323       0.000
9                     ph              0.254               0.254       0.000
10               quality              0.226               0.226       0.000
11               density              0.127               0.127       0.000

Guidelines to evaluate degree of skewness (coefficient of asymmetry)

I use these guidelines to interpret skewness coefficient.

Interpret the direction of skewness:
- $< 0$ : left-skewed data;
- $0$ : ideally symmetric data;
- $0 <$ : right-skewed data.

Interpret the amount of skewness (rule of thumb):
- $0$ : ideally symmetric;
- $- 0.5$ – $+ 0.5$ : almost symmetric;
- $- 1$ – $- 0.5$ and $+ 0.5$ – $+ 1$ : moderate asymmetry;
- $< - 1$ and $+ 1 <$ : high asymmetry.

In all cases, amount of skewness was reduced.

Apply the transformation to the dataset:

Code

# Create a copy of data frame that contains log-transformed variables
new_names_map = {i: "log_" + str(i) for i in skewed_vars}

wine_train_log = wine_train.transform_columns(skewed_vars, np.log10).rename(
    new_names_map, axis=1
)

Details: inspect data after log-transformation

Column names:

R code

colnames(py$wine_train_log)

 [1] "log_fixed_acidity"        "volatile_acidity"        
 [3] "citric_acid"              "log_residual_sugar"      
 [5] "log_chlorides"            "log_free_sulfur_dioxide" 
 [7] "log_total_sulfur_dioxide" "density"                 
 [9] "ph"                       "log_sulphates"           
[11] "alcohol"                  "quality"                 
[13] "quality_gr"

Dataset:

R code

head(py$wine_train_log)

ABCDEFGHIJ0123456789

	log_fixed_acidity <dbl>	volatile_acidity <dbl>	citric_acid <dbl>
1525	0.8260748	0.480	0.08
92	0.9344985	0.490	0.29
556	1.0374265	0.530	0.49
1012	0.8864907	1.005	0.15
385	0.8692317	0.630	0.07
466	1.0128372	0.500	0.42

R code

glimpse(py$wine_train_log)

Rows: 1,279
Columns: 13
$ log_fixed_acidity        <dbl> 0.8260748, 0.9344985, 1.0374265, 0.8864907, 0…
$ volatile_acidity         <dbl> 0.480, 0.490, 0.530, 1.005, 0.630, 0.500, 0.4…
$ citric_acid              <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.2…
$ log_residual_sugar       <dbl> 0.3222193, 0.3010300, 0.6627578, 0.3222193, 0…
$ log_chlorides            <dbl> -1.1938200, -0.9586073, -0.9281180, -0.991399…
$ log_free_sulfur_dioxide  <dbl> 1.2552725, 1.2787536, 1.0000000, 1.0413927, 1…
$ log_total_sulfur_dioxide <dbl> 1.531479, 2.123852, 1.230449, 1.505150, 1.568…
$ density                  <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, …
$ ph                       <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.2…
$ log_sulphates            <dbl> -0.19382003, 0.29666519, -0.25181197, -0.3187…
$ alcohol                  <dbl> 9.70000, 9.80000, 11.70000, 10.00000, 9.70000…
$ quality                  <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, …
$ quality_gr               <ord> medium, medium, medium, medium, medium, mediu…

Details: plots and summaries of variables after log-transformation

R code

DescTools::Desc(
  py$wine_train_log |> mutate(quality = ordered(quality)),
  verbose = 3, digits = 3
)

------------------------------------------------------------------------------ 
Describe mutate(py$wine_train_log, quality = ordered(quality)) (data.frame):

data frame: 1279 obs. of  13 variables
        1279 complete cases (100.0%)

  Nr  ColName                   Class            NAs  Levels               
  1   log_fixed_acidity         numeric          .                         
  2   volatile_acidity          numeric          .                         
  3   citric_acid               numeric          .                         
  4   log_residual_sugar        numeric          .                         
  5   log_chlorides             numeric          .                         
  6   log_free_sulfur_dioxide   numeric          .                         
  7   log_total_sulfur_dioxide  numeric          .                         
  8   density                   numeric          .                         
  9   ph                        numeric          .                         
  10  log_sulphates             numeric          .                         
  11  alcohol                   numeric          .                         
  12  quality                   ordered, factor  .    (6): 1-3, 2-4, 3-5,  
                                                      4-6, 5-7, ...        
  13  quality_gr                ordered, factor  .    (3): 1-low, 2-medium,
                                                      3-high               


------------------------------------------------------------------------------ 
1 - log_fixed_acidity (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      94      0  0.910   0.906
          100.0%   0.0%           0.0%          0.915
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.785   0.813  0.851   0.898  0.964  1.025   1.069
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.539   0.087  0.095   0.078  0.113  0.425   0.131
                                                     
lowest : 0.663, 0.690, 0.699 (4), 0.708 (3), 0.716 (6)
highest: 1.155, 1.176 (2), 1.190, 1.193 (2), 1.201

' 95%-CI (classic)

------------------------------------------------------------------------------ 
2 - volatile_acidity (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0     141      0  0.532   0.522
          100.0%   0.0%           0.0%          0.542
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.270   0.310  0.400   0.520  0.640  0.760   0.850
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.460   0.183  0.344   0.178  0.240  0.709   1.321
                                                     
lowest : 0.120 (3), 0.160 (2), 0.180 (8), 0.190, 0.200 (2)
highest: 1.180, 1.185, 1.240, 1.330 (2), 1.580

' 95%-CI (classic)

------------------------------------------------------------------------------ 
3 - citric_acid (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      78    110  0.268   0.257
          100.0%   0.0%           8.6%          0.278
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.000   0.010  0.090   0.250  0.420  0.530   0.600
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.790   0.195  0.728   0.252  0.330  0.322  -0.882
                                                     
lowest : 0.000 (110), 0.010 (27), 0.020 (40), 0.030 (21), 0.040 (25)
highest: 0.740 (4), 0.750, 0.760 (2), 0.780, 0.790

heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
4 - log_residual_sugar (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      85      0  0.369   0.361
          100.0%   0.0%           0.0%          0.378
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.176   0.230  0.279   0.342  0.415  0.558   0.716
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.233   0.158  0.428   0.094  0.136  1.799   4.740
                                                     
lowest : -0.046, 0.079 (7), 0.114 (5), 0.146 (33), 0.176 (22)
highest: 1.111, 1.127, 1.140 (2), 1.143, 1.188 (2)

heap(?): remarkable frequency (9.9%) for the mode(s) (= 0.301029995663981)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
5 - log_chlorides (numeric)

  length       n     NAs  unique      0s    mean  meanCI'
   1'279   1'279       0     141       0  -1.090  -1.098
          100.0%    0.0%            0.0%          -1.083
                                                        
     .05     .10     .25  median     .75     .90     .95
  -1.268  -1.222  -1.155  -1.102  -1.046  -0.974  -0.906
                                                        
   range      sd   vcoef     mad     IQR    skew    kurt
   1.587   0.141  -0.129   0.078   0.109   1.527   8.730
                                                        
lowest : -1.921 (2), -1.469, -1.420 (2), -1.409 (4), -1.387 (4)
highest: -0.384, -0.383 (2), -0.382 (2), -0.375, -0.333

' 95%-CI (classic)

------------------------------------------------------------------------------ 
6 - log_free_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s    mean  meanCI'
   1'279   1'279      0      58      2   1.098   1.082
          100.0%   0.0%           0.2%           1.115
                                                      
     .05     .10    .25  median    .75     .90     .95
   0.602   0.699  0.845   1.114  1.322   1.477   1.544
                                                      
   range      sd  vcoef     mad    IQR    skew    kurt
   1.857   0.301  0.274   0.339  0.477  -0.172  -0.566
                                                      
lowest : 0.000 (2), 0.477 (44), 0.602 (34), 0.699 (86), 0.740
highest: 1.732, 1.740 (2), 1.820, 1.833 (2), 1.857

heap(?): remarkable frequency (9.1%) for the mode(s) (= 0.778151250383644)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
7 - log_total_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s    mean  meanCI'
   1'279   1'279      0     140      0   1.553   1.536
          100.0%   0.0%           0.0%           1.570
                                                      
     .05     .10    .25  median    .75     .90     .95
   1.041   1.146  1.322   1.568  1.785   1.973   2.049
                                                      
   range      sd  vcoef     mad    IQR    skew    kurt
   1.683   0.310  0.199   0.335  0.463  -0.033  -0.693
                                                      
lowest : 0.778 (3), 0.845 (4), 0.903 (10), 0.954 (13), 1.000 (24)
highest: 2.190, 2.204, 2.217, 2.444, 2.461

' 95%-CI (classic)

------------------------------------------------------------------------------ 
8 - density (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0     400      0  0.997   0.997
          100.0%   0.0%           0.0%          0.997
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.994   0.994  0.996   0.997  0.998  0.999   1.000
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.014   0.002  0.002   0.002  0.002  0.127   0.867
                                                     
lowest : 0.990, 0.991 (2), 0.991, 0.991, 0.992
highest: 1.003, 1.003, 1.003 (2), 1.003, 1.004 (2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
9 - ph (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      86      0  3.312   3.304
          100.0%   0.0%           0.0%          3.321
                                                     
     .05     .10    .25  median    .75    .90     .95
   3.070   3.130  3.210   3.310  3.400  3.510   3.570
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.150   0.153  0.046   0.148  0.190  0.254   0.881
                                                     
lowest : 2.860, 2.870, 2.880 (2), 2.890 (2), 2.900
highest: 3.720 (3), 3.750, 3.780 (2), 3.900 (2), 4.010 (2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
10 - log_sulphates (numeric)

  length       n     NAs  unique      0s    mean  meanCI'
   1'279   1'279       0      88       1  -0.196  -0.201
          100.0%    0.0%            0.1%          -0.191
                                                        
     .05     .10     .25  median     .75     .90     .95
  -0.328  -0.310  -0.260  -0.208  -0.137  -0.076  -0.036
                                                        
   range      sd   vcoef     mad     IQR    skew    kurt
   0.778   0.096  -0.487   0.089   0.123   0.850   1.854
                                                        
lowest : -0.481, -0.432, -0.409 (5), -0.398 (4), -0.377 (4)
highest: 0.201, 0.207, 0.210, 0.290, 0.297

' 95%-CI (classic)

------------------------------------------------------------------------------ 
11 - alcohol (numeric)

  length       n    NAs  unique      0s    mean  meanCI'
   1'279   1'279      0      61       0  10.423  10.365
          100.0%   0.0%            0.0%          10.481
                                                       
     .05     .10    .25  median     .75     .90     .95
   9.200   9.300  9.500  10.200  11.100  12.000  12.500
                                                       
   range      sd  vcoef     mad     IQR    skew    kurt
   6.500   1.060  0.102   1.038   1.600   0.868   0.228
                                                       
lowest : 8.400 (2), 8.500, 8.700, 8.800 (2), 9.000 (22)
highest: 13.400 (2), 13.500, 13.600, 14.000 (6), 14.900

heap(?): remarkable frequency (8.8%) for the mode(s) (= 9.5)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
12 - quality (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      6      6      y
         100.0%   0.0%                     

   level  freq     perc  cumfreq   cumperc
1      3     8   0.625%        8    0.625%
2      4    42   3.284%       50    3.909%
3      5   545  42.611%      595   46.521%
4      6   510  39.875%    1'105   86.396%
5      7   159  12.432%    1'264   98.827%
6      8    15   1.173%    1'279  100.000%

------------------------------------------------------------------------------ 
13 - quality_gr (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      3      3      y
         100.0%   0.0%                     

    level   freq     perc  cumfreq   cumperc
1     low     50   3.909%       50    3.909%
2  medium  1'055  82.486%    1'105   86.396%
3    high    174  13.604%    1'279  100.000%

4 Analysis

4.1 Correlation: All Pairs

In this section we will look into the relationships between each pair of numeric variables. We will start from graphical analysis (scatter plot matrices) and later will proceed to linear correlation analysis.

Scatter plot matrices of non-transformed and log-transformed variables are presented in Fig. 4.1 and Fig. 4.2. In the plot we see, that some variables are related (e.g., density and fixed acidity). It can also be noticed, that after log-transformation the relationship between some variables (e.g., pH and fixed acidity) look more linear. On the other hand, the plots are small and there are many points, so further investigation is needed to understand data better.

Code

fig_sns_1 = sns.pairplot(
    wine_train,
    kind="hist",
    height=1,
    corner=True,
)

for ax in fig_sns_1.axes.flatten():
    if ax:
        # rotate x axis labels
        ax.set_xlabel(ax.get_xlabel(), rotation=90)
        # rotate y axis labels
        ax.set_ylabel(ax.get_ylabel(), rotation=0)
        # set y labels alignment
        ax.yaxis.get_label().set_horizontalalignment("right")

fig_sns_1.tight_layout()
plt.show()

**Fig. 4.1.** Relationships between pairs of variables on the original scale.

Code

fig_sns_2 = sns.pairplot(
    wine_train_log,
    kind="hist",
    height=1,
    corner=True,
)

for ax in fig_sns_2.axes.flatten():
    if ax:
        # rotate x axis labels
        ax.set_xlabel(ax.get_xlabel(), rotation=90)
        # rotate y axis labels
        ax.set_ylabel(ax.get_ylabel(), rotation=0)
        # set y labels alignment
        ax.yaxis.get_label().set_horizontalalignment("right")

fig_sns_2.tight_layout()
plt.show()

**Fig. 4.2.** Relationships between pairs of variables after log-transformation of right-skewed variables.

Next, the influence of logarithmic transformation on change in linear correlation coefficient is analyzed and summarized in the table below. The negative values or r_abs_difference (difference in absolute values of correlation coefficient) means that correlation decreased, positive value shows increase in correlation strength.

Code

# Calculate correlation
corr_df_orig = pairwise_correlation(wine_train)
corr_df_log = pairwise_correlation(wine_train_log)

# Investigate the changes in coefficients due to transformation
(
    pd.concat(
        [
            corr_df_orig.r.rename("r_orig_var"),
            corr_df_log.r.rename("r_log_var"),
        ],
        axis=1,
    )
    .eval("r_abs_difference = abs(r_log_var) - abs(r_orig_var)")
    .select_columns("r_abs_difference")
    .agg(["min", "mean", "std", "max"])
)

      r_abs_difference
min             -0.114
mean             0.014
std              0.041
max              0.146

The average shift in correlation coefficient was only slight, as in some coefficient started showing stronger, in other cases weaker correlation. The cases with largest positive and negative change (indicated by the the min and max values of r_abs_difference) are analyzed in the details sections. To sum up, it seams that in both cases log-transformation the true correlation was reflected better by the coefficients after the transformation.

Details: case study of chlorides vs. sulphates

After log-transformation, linear correlation strength between chlorides and sulphates decreased most evidently (change in absolute values is -0.11). It seems that the linear correlation coefficient size between these two variables is driven by 17 outlying points (where chlorides > 0.3, red dashed line in the plot below). Logarithmic transformation reduces this issue and true correlation strength is reflected better in this case. On the other hand, rank correlation coefficient suggests that there is no correlation or it is very weak.

R code

ggplot(py$wine_train, aes(x = chlorides, y = sulphates)) +
  geom_point(alpha = 0.1, pch = 19) +
  geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
  ggtitle("Variables on Original Scale") +
  ggplot(py$wine_train, aes(x = log(chlorides), y = log(sulphates))) +
  geom_point(alpha = 0.1, pch = 19) +
  ggtitle("Variables on Log Scale")

R code

py$wine_train |> count(chlorides > 0.3)

ABCDEFGHIJ0123456789

chlorides > 0.3 <lgl>	n <int>
FALSE	1262
TRUE	17

R code

c(
  "r_original_variables" = with(
    py$wine_train, cor(chlorides, sulphates, method = "pearson")
  ),
  "r_log_variables" = with(
    py$wine_train,
    cor(log(chlorides), log(sulphates), method = "pearson")
  ),
  "r_without_17_outliers" = with(
    py$wine_train |> filter(chlorides < 0.3),
    cor(chlorides, sulphates, method = "pearson")
  ),
  "r_spearman_all_data" = with(
    py$wine_train, cor(chlorides, sulphates, method = "spearman")
  )
) |> round(2)

 r_original_variables       r_log_variables r_without_17_outliers 
                 0.31                  0.20                  0.05 
  r_spearman_all_data 
                -0.01

Details: case study of chlorides vs. density

After log-transformation, linear correlation size between chlorides and density increased most evidently (change in absolute values is +0.15). It seems that the linear correlation coefficient size again was influenced by a group of 17 outlying points (where chlorides > 0.3, red dashed line in the plot below) and logarithmic transformation reduced this issue and correlation of log-transformed variables and of non-transformed variables without the 17 points is almost of the same size (r = 0.36 vs. r = 0.34).

R code

ggplot(py$wine_train, aes(x = chlorides, y = density)) +
  geom_point(alpha = 0.1, pch = 19) +
  geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
  ggtitle("Variables on Original Scale") +
  ggplot(py$wine_train, aes(x = log(chlorides), y = density)) +
  geom_point(alpha = 0.1, pch = 19) +
  ggtitle("Log-Transformed Chloride")

R code

py$wine_train |> count(chlorides > 0.3)

ABCDEFGHIJ0123456789

chlorides > 0.3 <lgl>	n <int>
FALSE	1262
TRUE	17

R code

c(
  "r_original_variables" = with(
    py$wine_train, cor(chlorides, density, method = "pearson")
  ),
  "r_log_variables" = with(
    py$wine_train,
    cor(log(chlorides), density, method = "pearson")
  ),
  "r_without_17_outliers" = with(
    py$wine_train |> filter(chlorides < 0.3),
    cor(chlorides, density, method = "pearson")
  ),
  "r_spearman_all_data" = with(
    py$wine_train, cor(chlorides, density, method = "spearman")
  )
) |> round(2)

 r_original_variables       r_log_variables r_without_17_outliers 
                 0.21                  0.36                  0.34 
  r_spearman_all_data 
                 0.41

Fig. 4.3 contains a graphical representation of all pair-wise correlation coefficients. The “details” sections below contain numeric results while the two tables below the section list correlations with quality sore and with alcohol contents.

R code

py$wine_train_log |>
  select_if(is.numeric) |>
  ggstatsplot::ggcorrmat() +
  scale_y_discrete(limits = rev)

**Fig. 4.3.** The results of correlation analysis after right-skewed variables were log-transformed. Numbers in cells indicate Pearson correlation coefficient. Cross-out cells show statistically insignificant results. Significance level is 0.05, p values with Holm correction were used.

Details: correlation table (variables on original scale)

Important

It can be suspected that linear correlation coefficients may not reflect the true strength of relationship between skewed variables correctly.

R code

py$corr_df_orig |>
  rowwise() |>
  mutate(
    `CI95%_lower` = `CI95%`[[1]],
    `CI95%_upper` = `CI95%`[[2]],
    r = round(r, digits = 2)
  ) |>
  select(-`CI95%`) |>
  ungroup() |>
  rownames_to_column("index")

The results of correlation analysis as correlation matrix. All variables are on the original scale.

ABCDEFGHIJ0123456789

index <chr>	X <chr>	Y <chr>
1	fixed_acidity	ph
2	fixed_acidity	citric_acid
3	fixed_acidity	density
4	free_sulfur_dioxide	total_sulfur_dioxide
5	volatile_acidity	citric_acid
6	citric_acid	ph
7	density	alcohol
8	alcohol	quality
9	volatile_acidity	quality
10	citric_acid	density

Details: correlation table (after log-transformation)

R code

py$corr_df_log |>
  rowwise() |>
  mutate(
    `CI95%_lower` = `CI95%`[[1]],
    `CI95%_upper` = `CI95%`[[2]],
    r = round(r, digits = 2)
  ) |>
  select(-`CI95%`) |>
  ungroup() |>
  rownames_to_column("index")

The results of correlation analysis after right-skewed variables were log-transformed.

ABCDEFGHIJ0123456789

index <chr>	X <chr>
1	log_free_sulfur_dioxide
2	log_fixed_acidity
3	log_fixed_acidity
4	log_fixed_acidity
5	volatile_acidity
6	citric_acid
7	density
8	alcohol
9	log_residual_sugar
10	volatile_acidity

Results. The strongest correlation is detected between log_free_sulfur_dioxide vs. log_total_sulfur_dioxide (r =, 0.79 [0.77, 0.81], p_adj < 0.001) and log_fixed_acidity and ph (r = -0.71, 95% CI [-0.74, -0.69], p_adj < 0.001). In statistical modeling these variables will be used as explanatory variables. This means that in linear modeling, a multicollinearity problem may appear. So we have to pay attention to this while interpreting regression coefficients results.

4.2 Exploratory PCA

4.2.1 Model Diagnostics

Principal component analysis (PCA) will be used to explore relationships between target (wine quality score, quality group as well as alcohol content) and other variables.

R code

pca_model <- prcomp(py$wine_train_log |> select_if(is.numeric), scale. = TRUE)

R code

fviz_eig(pca_model, addlabels = TRUE) + ylim(NA, 29)

Details: table with eigenvalues and percentage of explained variance

R code

get_eigenvalue(pca_model) |> round(1)

ABCDEFGHIJ0123456789

	eigenvalue <dbl>	variance.percent <dbl>
Dim.1	3.2	26.8
Dim.2	2.4	19.8
Dim.3	1.8	14.7
Dim.4	1.1	8.9
Dim.5	1.0	8.2
Dim.6	0.6	5.4
Dim.7	0.6	5.0
Dim.8	0.5	4.2
Dim.9	0.4	3.2
Dim.10	0.2	2.0

To explain more than 80 % of variance, 5 components are required.
“Elbow” method suggests 3-4 components.
Eigenvalue above 1 rule suggest 4-5 components.

These numbers might be important while doing feature selection. For visual inspection, it will be used the first 3 principal components that account for 26.8 %, 20.1 %, and 14.6 % respectively.

4.2.2 PCA Individuals Map

In PCA “individuals map”, no very clear clusters are visible. Only it can be noticed that in the space of PC1 and PC2 (denoted as Dim1 and Dim2), roughly half area with points is more dense, and the other half is less dense.

R code

fviz_pca_ind(pca_model, label = FALSE, alpha = 0.2)

R code

fviz_pca_ind(pca_model, axes = c(3, 2), label = FALSE, alpha = 0.2)

After adding colors to the points, 3 partly overlapping clusters are visible. The direction of cluster centroids match the order of quality groups from lowest to highest. It seems that all 3 first PCs have influence on cluster separability.

R code

legend_txt <- "Quality"

fviz_pca_ind(pca_model,
  label = FALSE, alpha = 0.3,
  habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
  palette = "Dark2"
) +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

R code

fviz_pca_ind(pca_model,
  axes = c(3, 2), label = FALSE, alpha = 0.3,
  habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
  palette = "Dark2"
) +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

4.2.3 PCA Correlation Circle Plot

The purpose of correlation circle plot is to investigate relationship between variables. The angle between arrows shows the strength of relationship in the analyzed PCA space, the length of arrow shows how well variable is represented in the PCA space.

The variables of interest are quality score and fraction of alcohol content. These variables will be highlighted.

R code

highlight_vars <- recode_factor(
  names(py$wine_train_log |> select_if(is.numeric)),
  "quality" = "quality",
  "alcohol" = "alcohol",
  .default =  "other"
)

R code

fviz_pca_var(pca_model,
  axes = c(1, 2), repel = TRUE, alpha = 0.7,
  col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
) +
  lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
  theme(legend.position = "none")

R code

fviz_pca_var(pca_model,
  axes = c(3, 2), repel = TRUE, alpha = 0.7,
  col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
) +
  lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
  theme(legend.position = "none")

It seems that alcohol and quality score are related positively (arrows point to almost the same directions) while alcohol and density are negatively correlated (arrows point to opposite directions) in PC2 and PC3 space.

4.2.4 PCA Biplot

The purpose of PCA biplots is to investigate the relationship between patterns in points (e.g., clusters) and variables.

R code

fviz_pca_biplot(pca_model,
  axes = c(1, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
  col.var = "black",
  label = "var",
  addEllipses = TRUE,
  habillage = py$wine_train_log$quality_gr,
  palette = "Dark2"
) +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

R code

fviz_pca_biplot(pca_model,
  axes = c(3, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
  col.var = "black",
  label = "var",
  addEllipses = TRUE,
  habillage = py$wine_train_log$quality_gr,
  palette = "Dark2"
) +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

We can assume, that volatile acidity may be related to differences between medium and low quality wines while (in PC1-PC2 space the direction of volatile acidity arrow almost coincides with the direction between the centroids of medium and low quality groups). And differences between medium and high may be related to alcohol contents (see biplot of PC2-PC3 space).

4.3 Modelling Wine Quality

The purpose of this section is to create a logistic regression model that predicts wine quality.

Quick pair-wise summary of how other variables relate to the quality score and quality group is presented in the “details” sections below. And summary table of correlation between quality score and other variables is shown below the “details” sections (the correlation analysis was carried out in section “Section 4.1”).

Details: plots and summaries of variables by wine quality scores

Code

names(py$wine_train_log)

 [1] "log_fixed_acidity"        "volatile_acidity"        
 [3] "citric_acid"              "log_residual_sugar"      
 [5] "log_chlorides"            "log_free_sulfur_dioxide" 
 [7] "log_total_sulfur_dioxide" "density"                 
 [9] "ph"                       "log_sulphates"           
[11] "alcohol"                  "quality"                 
[13] "quality_gr"

R code

py$wine_train_log %>%
  mutate(quality = ordered(quality)) %>%
  DescTools::Desc(. ~ quality, data = ., verbose = 3)

------------------------------------------------------------------------------ 
log_fixed_acidity ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      0.906    0.866    0.906    0.910    0.938    0.917
median    0.875    0.860    0.892    0.898    0.949    0.898
sd        0.073    0.083    0.077    0.090    0.099    0.107
IQR       0.083    0.098    0.092    0.123    0.135    0.130
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       0.833    0.663    0.716    0.699    0.690    0.699
max       1.017    1.097    1.201    1.155    1.193    1.100
Q1        0.860    0.816    0.857    0.845    0.869    0.863
Q3        0.944    0.914    0.949    0.968    1.004    0.993
mad       0.049    0.071    0.070    0.087    0.100    0.091
skew      0.651    0.198    0.690    0.401   -0.100   -0.309
kurt     -1.457    0.611    0.845   -0.317   -0.062   -0.628

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 30.325, df = 5, p-value = 0.00001273

Summary of variables by wine quality scores.

------------------------------------------------------------------------------ 
volatile_acidity ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      0.938    0.734    0.579    0.501    0.402    0.417
median    0.927    0.708    0.580    0.500    0.360    0.360
sd        0.350    0.215    0.167    0.164    0.147    0.143
IQR       0.298    0.324    0.210    0.230    0.185    0.080
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       0.440    0.330    0.180    0.160    0.120    0.300
max       1.580    1.130    1.330    1.040    0.915    0.850
Q1        0.764    0.580    0.460    0.380    0.300    0.340
Q3        1.061    0.904    0.670    0.610    0.485    0.420
mad       0.274    0.259    0.163    0.170    0.119    0.059
skew      0.348    0.101    0.694    0.382    0.934    1.853
kurt     -0.938   -1.057    1.603   -0.104    0.765    2.821

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 204.18, df = 5, p-value < 2.2e-16

------------------------------------------------------------------------------ 
citric_acid ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      0.129    0.130    0.243    0.268    0.381    0.387
median    0.035    0.065    0.220    0.250    0.400    0.390
sd        0.203    0.151    0.179    0.197    0.198    0.179
IQR       0.143    0.193    0.260    0.340    0.200    0.210
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            3        8       46       47        6        0
min       0.000    0.000    0.000    0.000    0.000    0.050
max       0.490    0.540    0.790    0.780    0.760    0.720
Q1        0.000    0.022    0.100    0.090    0.310    0.305
Q3        0.143    0.215    0.360    0.430    0.510    0.515
mad       0.052    0.096    0.193    0.252    0.163    0.163
skew      0.940    1.250    0.526    0.251   -0.357   -0.218
kurt     -1.151    0.469   -0.521   -1.014   -0.548   -0.672

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 89.261, df = 5, p-value < 2.2e-16

------------------------------------------------------------------------------ 
log_residual_sugar ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      0.392    0.393    0.367    0.361    0.395    0.366
median    0.322    0.322    0.342    0.342    0.342    0.342
sd        0.229    0.199    0.152    0.154    0.172    0.159
IQR       0.270    0.167    0.136    0.119    0.168    0.160
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       0.079    0.114    0.079   -0.046    0.079    0.146
max       0.756    1.111    1.140    1.188    0.919    0.806
Q1        0.286    0.301    0.279    0.279    0.279    0.255
Q3        0.556    0.468    0.415    0.398    0.447    0.415
mad       0.263    0.082    0.108    0.094    0.129    0.129
skew      0.231    1.437    1.730    2.195    1.152    1.247
kurt     -1.505    2.237    4.073    7.826    0.841    1.373

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 4.3531, df = 5, p-value = 0.4998

------------------------------------------------------------------------------ 
log_chlorides ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean     -0.921   -1.111   -1.068   -1.097   -1.139   -1.172
median   -0.938   -1.100   -1.092   -1.108   -1.131   -1.155
sd        0.216    0.113    0.134    0.140    0.150    0.065
IQR       0.275    0.122    0.104    0.107    0.147    0.067
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min      -1.215   -1.347   -1.409   -1.469   -1.921   -1.347
max      -0.573   -0.764   -0.333   -0.382   -0.446   -1.097
Q1       -1.078   -1.174   -1.131   -1.167   -1.208   -1.201
Q3       -0.804   -1.052   -1.027   -1.060   -1.060   -1.134
mad       0.212    0.101    0.075    0.079    0.104    0.048
skew      0.235    0.416    2.404    1.901   -0.948   -1.227
kurt     -1.511    0.600   10.046    6.959   10.440    0.920

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 66.782, df = 5, p-value = 4.782e-13

------------------------------------------------------------------------------ 
log_free_sulfur_dioxide ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      0.924    0.920    1.129    1.107    1.035    0.942
median    0.739    0.874    1.146    1.146    1.000    0.845
sd        0.371    0.272    0.298    0.294    0.309    0.288
IQR       0.529    0.406    0.477    0.477    0.465    0.438
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        2        0        0
min       0.477    0.477    0.477    0.000    0.477    0.477
max       1.531    1.613    1.833    1.857    1.732    1.531
Q1        0.699    0.699    0.903    0.845    0.778    0.739
Q3        1.228    1.105    1.380    1.322    1.243    1.176
mad       0.223    0.282    0.347    0.347    0.329    0.217
skew      0.427    0.318   -0.189   -0.327    0.145    0.387
kurt     -1.606   -0.584   -0.731   -0.169   -0.650   -0.987

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 35.271, df = 5, p-value = 0.000001329

------------------------------------------------------------------------------ 
log_total_sulfur_dioxide ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      1.291    1.381    1.640    1.523    1.428    1.370
median    1.175    1.362    1.653    1.544    1.415    1.230
sd        0.284    0.321    0.321    0.271    0.301    0.255
IQR       0.397    0.546    0.531    0.382    0.403    0.311
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       0.954    0.845    0.778    0.778    0.845    1.079
max       1.690    2.053    2.190    2.217    2.461    1.944
Q1        1.120    1.114    1.398    1.342    1.230    1.204
Q3        1.517    1.660    1.929    1.724    1.633    1.515
mad       0.263    0.367    0.402    0.279    0.313    0.222
skew      0.370    0.322   -0.252   -0.182    0.490    0.721
kurt     -1.716   -1.030   -0.869   -0.490    0.395   -0.683

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 95.311, df = 5, p-value < 2.2e-16

------------------------------------------------------------------------------ 
density ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      0.997    0.996    0.997    0.997    0.996    0.995
median    0.998    0.996    0.997    0.997    0.996    0.995
sd        0.002    0.001    0.002    0.002    0.002    0.002
IQR       0.002    0.002    0.002    0.002    0.003    0.003
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       0.995    0.993    0.993    0.990    0.991    0.992
max       0.999    1.001    1.003    1.004    1.003    0.999
Q1        0.996    0.996    0.996    0.995    0.995    0.994
Q3        0.998    0.997    0.998    0.998    0.997    0.997
mad       0.002    0.001    0.001    0.002    0.002    0.003
skew     -0.315    0.555    0.425    0.136    0.335   -0.194
kurt     -1.652    0.950    1.169    0.678    0.338   -1.281

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 49.245, df = 5, p-value = 0.000000001977

------------------------------------------------------------------------------ 
ph ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      3.397    3.414    3.303    3.320    3.291    3.274
median    3.390    3.380    3.300    3.320    3.290    3.230
sd        0.142    0.157    0.147    0.155    0.152    0.197
IQR       0.167    0.195    0.190    0.190    0.180    0.175
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       3.160    3.050    2.880    2.860    2.920    2.880
max       3.630    3.900    3.720    4.010    3.780    3.720
Q1        3.317    3.312    3.200    3.220    3.200    3.175
Q3        3.485    3.507    3.390    3.410    3.380    3.350
mad       0.126    0.119    0.148    0.148    0.133    0.119
skew     -0.014    0.663    0.114    0.296    0.329    0.448
kurt     -1.063    1.031    0.050    1.528    0.655    0.187

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 28.884, df = 5, p-value = 0.00002443

------------------------------------------------------------------------------ 
log_sulphates ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean     -0.260   -0.265   -0.223   -0.182   -0.136   -0.113
median   -0.276   -0.252   -0.237   -0.194   -0.125   -0.131
sd        0.096    0.076    0.095    0.087    0.083    0.062
IQR       0.052    0.080    0.104    0.112    0.109    0.072
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        1        0        0
min      -0.398   -0.481   -0.432   -0.398   -0.409   -0.187
max      -0.066    0.033    0.297    0.290    0.134    0.041
Q1       -0.297   -0.317   -0.284   -0.237   -0.187   -0.158
Q3       -0.245   -0.237   -0.180   -0.125   -0.078   -0.086
mad       0.038    0.051    0.070    0.076    0.073    0.066
skew      0.665    0.780    1.477    0.944   -0.255    0.896
kurt     -0.343    4.793    4.020    2.414    1.024    0.172

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 207.01, df = 5, p-value < 2.2e-16

------------------------------------------------------------------------------ 
alcohol ~ quality (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean     10.075   10.420    9.912   10.624   11.408   11.900
median   10.050   10.350    9.700   10.500   11.500   11.700
sd        0.845    0.968    0.747    1.057    0.972    1.180
IQR       0.975    1.575    0.800    1.575    1.350    1.600
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       8.400    9.000    8.500    8.400    9.200    9.800
max      11.000   13.100   14.900   14.000   14.000   14.000
Q1        9.775    9.600    9.400    9.800   10.750   11.150
Q3       10.750   11.175   10.200   11.375   12.100   12.750
mad       0.741    1.112    0.445    1.186    1.038    1.334
skew     -0.659    0.373    1.911    0.583   -0.019   -0.168
kurt     -0.730   -0.508    5.899   -0.123   -0.586   -1.005

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 309.38, df = 5, p-value < 2.2e-16

------------------------------------------------------------------------------ 
quality_gr ~ quality (.)

Summary: 
n: 1'279, rows: 3, columns: 6

Pearson's Chi-squared test:
  X-squared = 2558, df = 10, p-value < 2.2e-16
Pearson's Chi-squared test (cont. adj):
  X-squared = 2558, df = 10, p-value < 2.2e-16
Log likelihood ratio (G-test) test of independence:
  G = 1424.6, X-squared df = 10, p-value < 2.2e-16
Mantel-Haenszel Chi-squared:
  X-squared = 843.76, df = 1, p-value < 2.2e-16
[31m
Warning message:
  Exp. counts < 5: Chi-squared approx. may be incorrect!!

[39m
                       estimate  lwr.ci  upr.ci'
Contingency Coeff.       0.8165       -       -
Cramer V                 1.0000  0.9595  1.0000
Kendall Tau-b            0.6827  0.6534  0.7120
Goodman Kruskal Gamma    1.0000  1.0000  1.0000
Stuart Tau-c             0.4493  0.4049  0.4938
Somers D C|R             1.0000  1.0000  1.0000
Somers D R|C             0.4661  0.4078  0.5244
Pearson Correlation      0.8125  0.7930  0.8304
Spearman Correlation     0.7133  0.6853  0.7392
Lambda C|R               0.2738  0.2416  0.3061
Lambda R|C               1.0000  1.0000  1.0000
Lambda sym               0.4436  0.4054  0.4819
Uncertainty Coeff. C|R   0.4698  0.4462  0.4935
Uncertainty Coeff. R|C   1.0000  0.9999  1.0000
Uncertainty Coeff. sym   0.6393  0.6174  0.6612
Mutual Information       0.8035       -       -

                                                                     
             quality      3      4      5      6      7      8    Sum
quality_gr                                                           
                                                                     
low          freq         8     42      0      0      0      0     50
             perc      0.6%   3.3%   0.0%   0.0%   0.0%   0.0%   3.9%
             p.row    16.0%  84.0%   0.0%   0.0%   0.0%   0.0%      .
             p.col   100.0% 100.0%   0.0%   0.0%   0.0%   0.0%      .
                                                                     
medium       freq         0      0    545    510      0      0  1'055
             perc      0.0%   0.0%  42.6%  39.9%   0.0%   0.0%  82.5%
             p.row     0.0%   0.0%  51.7%  48.3%   0.0%   0.0%      .
             p.col     0.0%   0.0% 100.0% 100.0%   0.0%   0.0%      .
                                                                     
high         freq         0      0      0      0    159     15    174
             perc      0.0%   0.0%   0.0%   0.0%  12.4%   1.2%  13.6%
             p.row     0.0%   0.0%   0.0%   0.0%  91.4%   8.6%      .
             p.col     0.0%   0.0%   0.0%   0.0% 100.0% 100.0%      .
                                                                     
Sum          freq         8     42    545    510    159     15  1'279
             perc      0.6%   3.3%  42.6%  39.9%  12.4%   1.2% 100.0%
             p.row        .      .      .      .      .      .      .
             p.col        .      .      .      .      .      .      .
                                                                     

----------
' 95% conf. level

Details: plots and summaries of variables by wine quality group

R code

py$wine_train_log %>%
  DescTools::Desc(. ~ quality_gr, data = ., verbose = 3)

------------------------------------------------------------------------------ 
log_fixed_acidity ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      0.873    0.908    0.937
median    0.866    0.892    0.944
sd        0.082    0.084    0.100
IQR       0.085    0.105    0.135
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       0.663    0.699    0.690
max       1.097    1.201    1.193
Q1        0.833    0.851    0.869
Q3        0.918    0.957    1.004
mad       0.070    0.078    0.103
skew      0.194    0.534   -0.128
kurt      0.533    0.185   -0.061

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 28.538, df = 2, p-value = 0.0000006354

Summary of variables by wine quality group.

------------------------------------------------------------------------------ 
volatile_acidity ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      0.766    0.542    0.404
median    0.760    0.540    0.360
sd        0.248    0.170    0.146
IQR       0.348    0.225    0.177
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       0.330    0.160    0.120
max       1.580    1.330    0.915
Q1        0.582    0.420    0.302
Q3        0.930    0.645    0.480
mad       0.263    0.163    0.104
skew      0.595    0.514    1.012
kurt      0.502    0.864    0.992

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 151.12, df = 2, p-value < 2.2e-16

------------------------------------------------------------------------------ 
citric_acid ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      0.129    0.255    0.381
median    0.055    0.240    0.400
sd        0.158    0.188    0.196
IQR       0.195    0.310    0.208
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s           11       93        6
min       0.000    0.000    0.000
max       0.540    0.790    0.760
Q1        0.020    0.090    0.302
Q3        0.215    0.400    0.510
mad       0.082    0.237    0.156
skew      1.234    0.392   -0.353
kurt      0.234   -0.797   -0.522

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 85.537, df = 2, p-value < 2.2e-16

------------------------------------------------------------------------------ 
log_residual_sugar ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      0.393    0.364    0.393
median    0.322    0.342    0.342
sd        0.202    0.153    0.170
IQR       0.220    0.136    0.164
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       0.079   -0.046    0.079
max       1.111    1.188    0.919
Q1        0.301    0.279    0.279
Q3        0.521    0.415    0.443
mad       0.099    0.094    0.129
skew      1.216    1.961    1.172
kurt      1.583    5.936    0.923

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 3.7912, df = 2, p-value = 0.1502

------------------------------------------------------------------------------ 
log_chlorides ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean     -1.081   -1.082   -1.142
median   -1.089   -1.097   -1.137
sd        0.150    0.138    0.145
IQR       0.151    0.108    0.137
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min      -1.347   -1.469   -1.921
max      -0.573   -0.333   -0.446
Q1       -1.172   -1.149   -1.208
Q3       -1.021   -1.041   -1.071
mad       0.109    0.077    0.105
skew      1.105    2.101   -0.926
kurt      1.729    8.331   11.085

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 30.575, df = 2, p-value = 0.0000002295

------------------------------------------------------------------------------ 
log_free_sulfur_dioxide ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      0.921    1.118    1.027
median    0.845    1.146    1.000
sd        0.285    0.296    0.308
IQR       0.439    0.459    0.452
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        2        0
min       0.477    0.000    0.477
max       1.613    1.857    1.732
Q1        0.699    0.903    0.778
Q3        1.138    1.362    1.230
mad       0.319    0.347    0.329
skew      0.384   -0.253    0.173
kurt     -0.690   -0.450   -0.658

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 32.821, df = 2, p-value = 0.00000007464

------------------------------------------------------------------------------ 
log_total_sulfur_dioxide ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      1.366    1.583    1.423
median    1.331    1.591    1.398
sd        0.315    0.303    0.297
IQR       0.546    0.451    0.427
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       0.845    0.778    0.845
max       2.053    2.217    2.461
Q1        1.114    1.362    1.204
Q3        1.660    1.813    1.631
mad       0.322    0.339    0.329
skew      0.364   -0.124    0.519
kurt     -1.004   -0.701    0.391

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 58.932, df = 2, p-value = 1.596e-13

------------------------------------------------------------------------------ 
density ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      0.996    0.997    0.996
median    0.996    0.997    0.996
sd        0.002    0.002    0.002
IQR       0.002    0.002    0.003
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       0.993    0.990    0.991
max       1.001    1.004    1.003
Q1        0.996    0.996    0.995
Q3        0.997    0.998    0.997
mad       0.001    0.002    0.002
skew      0.414    0.149    0.298
kurt      0.142    1.021    0.301

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 25.761, df = 2, p-value = 0.000002547

------------------------------------------------------------------------------ 
ph ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      3.411    3.311    3.289
median    3.380    3.310    3.275
sd        0.153    0.151    0.156
IQR       0.188    0.190    0.178
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       3.050    2.860    2.880
max       3.900    4.010    3.780
Q1        3.312    3.210    3.200
Q3        3.500    3.400    3.378
mad       0.119    0.148    0.126
skew      0.606    0.219    0.344
kurt      0.986    0.887    0.718

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 23.989, df = 2, p-value = 0.000006178

------------------------------------------------------------------------------ 
log_sulphates ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean     -0.264   -0.203   -0.134
median   -0.256   -0.222   -0.128
sd        0.078    0.094    0.081
IQR       0.073    0.113    0.105
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        1        0
min      -0.481   -0.432   -0.409
max       0.033    0.297    0.134
Q1       -0.310   -0.268   -0.185
Q3       -0.237   -0.155   -0.081
mad       0.058    0.080    0.077
skew      0.803    1.115   -0.244
kurt      3.735    2.769    1.126

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 134.41, df = 2, p-value < 2.2e-16

------------------------------------------------------------------------------ 
alcohol ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean     10.365   10.256   11.451
median   10.250   10.000   11.500
sd        0.950    0.977    0.998
IQR       1.375    1.350    1.400
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       8.400    8.400    9.200
max      13.100   14.900   14.000
Q1        9.625    9.500   10.800
Q3       11.000   10.850   12.200
mad       1.038    0.890    1.038
skew      0.300    1.121    0.010
kurt     -0.267    1.117   -0.564

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 173.51, df = 2, p-value < 2.2e-16

------------------------------------------------------------------------------ 
quality ~ quality_gr (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      3.840    5.483    7.086
median    4.000    5.000    7.000
sd        0.370    0.500    0.281
IQR       0.000    1.000    0.000
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       3.000    5.000    7.000
max       4.000    6.000    8.000
Q1        4.000    5.000    7.000
Q3        4.000    6.000    7.000
mad       0.000    0.000    0.000
skew     -1.799    0.066    2.923
kurt      1.265   -1.997    6.583

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 650.26, df = 2, p-value < 2.2e-16

Correlation strength between wine quality score and the remaining variables:

Code

corr_df_log.query("X=='quality' | Y == 'quality'")

                           X        Y      r           CI95%   p_adj
65                   alcohol  quality  0.446     [0.4, 0.49]  <0.001
20          volatile_acidity  quality -0.405  [-0.45, -0.36]  <0.001
64             log_sulphates  quality  0.345     [0.3, 0.39]  <0.001
29               citric_acid  quality  0.247     [0.19, 0.3]  <0.001
44             log_chlorides  quality -0.169  [-0.22, -0.12]  <0.001
55  log_total_sulfur_dioxide  quality -0.162  [-0.22, -0.11]  <0.001
59                   density  quality -0.152   [-0.21, -0.1]  <0.001
10         log_fixed_acidity  quality  0.123    [0.07, 0.18]  <0.001
62                        ph  quality -0.063  [-0.12, -0.01]   0.324
50   log_free_sulfur_dioxide  quality -0.041    [-0.1, 0.01]  >0.999
37        log_residual_sugar  quality  0.016   [-0.04, 0.07]  >0.999

Results:

Correlation analysis of quality scores revealed that the strongest correlation is between alcohol and quality (r = 0.45 95% CI [0.40, 0.49], p_adj < 0.001) and volatile_acidity and quality (r = -0.41 95% CI [-0.45, -0.36], p_adj < 0.001).
This means that correlation strength is moderate so no single variable is enough for extremely accurate predictions.

4.3.1 Wine Quality Distribution

Different wine quality scores come at different rates:

Code

quality_score_counts = wine_train_log.quality.value_counts().sort_index()
print(test_chi_square_gof(quality_score_counts))

Chi square test, χ²(5, n = 1279) = 1462.79, p<0.001

Average quality wines are the most common (error bars show 95 percent confidence intervals):

Code

quality_score_distribution = ci_proportion_multinomial(
    quality_score_counts
).reset_index(names="quality_score")

quality_score_distribution

   quality_score    n  percent  ci_lower  ci_upper
0              3    8    0.625     0.254     1.531
1              4   42    3.284     2.201     4.872
2              5  545   42.611    39.013    46.290
3              6  510   39.875    36.327    43.532
4              7  159   12.432    10.199    15.071
5              8   15    1.173     0.602     2.272

Code

ggplot(
  py$quality_score_distribution,
  aes(x = factor(quality_score), y = percent)
) +
  geom_col(fill = "skyblue3") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.3, lwd = 1) +
  labs(
    x = "Wine quality score",
    y = "Percentage",
    title = "The Distribution of Wines by Quality Scores"
  )

Code

quality_group_counts = wine_train_log.quality_gr.value_counts().sort_index()
print(test_chi_square_gof(quality_group_counts))

Chi square test, χ²(2, n = 1279) = 1408.57, p<0.001

Code

quality_group_distribution = ci_proportion_multinomial(
    quality_group_counts
).reset_index(names="quality_group")

quality_group_distribution

  quality_group     n  percent  ci_lower  ci_upper
0           low    50    3.909     2.804     5.426
1        medium  1055   82.486    79.799    84.884
2          high   174   13.604    11.471    16.062

Code

ggplot(
  py$quality_group_distribution,
  aes(x = quality_group, y = percent)
) +
  geom_col(fill = "skyblue3") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.3, lwd = 1) +
  labs(
    x = "Wine quality group",
    y = "Percentage",
    title = "The Distribution of Wines by Quality Groups"
  )

4.3.2 Logistic Regression Model

Code

wine_train_log["quality_gr_num"] = wine_train_log["quality"].apply(
    lambda x: to_3_quality_groups(x, numeric=True)
)

# quality_gr_num=2 is a reference group
pd.crosstab(wine_train_log["quality_gr_num"], wine_train_log["quality"])

quality         3   4    5    6    7   8
quality_gr_num                          
0               8  42    0    0    0   0
1               0   0    0    0  159  15
2               0   0  545  510    0   0

Here:

0: low quality;
2: medium quality (reference group);
1: high quality.

Results of statistical significance based feature selection.
Step	Selected model	Model / Additionally removed variable	Kappa	AIC	Pseudo R²
(NULL)	no	NULL	0.000	1428.6	0.000
(FULL)	no	FULL	0.327	1045.6	0.300
(1)	no	(FULL) - citric_acid	0.339	1042.9	0.299
(2)	no	(1) - density	0.334	1043.4	0.296
(3)	no	(2) - log_fixed_acidity	0.329	1040.1	0.295
(4)	yes	(3) - log_total_sulfur_dioxide	0.334	1053.2	0.283
(5)	no	(4) - log_chlorides	0.308	1059.3	0.276

Find the details below.

Multinomial regression (null model)

Code

formula_mnl_null = "quality_gr_num ~ 1"

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    _ = do_mnlogit(formula_mnl_null, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.556927
         Iterations 6
                        Results: MNLogit
================================================================
Model:              MNLogit          Method:           MLE      
Dependent Variable: quality_gr_num   Pseudo R-squared: 0.000    
Date:               2023-07-30 16:02 AIC:              1428.6187
No. Observations:   1279             BIC:              1438.9263
Df Model:           0                Log-Likelihood:   -712.31  
Df Residuals:       1277             LL-Null:          -712.31  
Converged:          1.0000           LLR p-value:      nan      
No. Iterations:     6.0000           Scale:            1.0000   
--------------------------------------------------------------
quality_gr_num = 0 Coef.  Std.Err.   t    P>|t|  [0.025 0.975]
--------------------------------------------------------------
       Intercept   1.2470   0.1605 7.7717 0.0000 0.9325 1.5615
---------------------------------------------------------------
quality_gr_num = 1 Coef.  Std.Err.    t    P>|t|  [0.025 0.975]
---------------------------------------------------------------
        Intercept  3.0493   0.1447 21.0682 0.0000 2.7656 3.3329
================================================================

kappa = 0.000
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        50
           1       0.00      0.00      0.00       174
           2       0.82      1.00      0.90      1055

    accuracy                           0.82      1279
   macro avg       0.27      0.33      0.30      1279
weighted avg       0.68      0.82      0.75      1279

Confusion matrix (rows - true, columns - predicted):
[[   0    0   50]
 [   0    0  174]
 [   0    0 1055]]

Multinomial regression (full model)

Code

formula_mnl_full = "quality_gr_num ~ " + " + ".join(
    [
        "log_total_sulfur_dioxide",
        "log_chlorides",
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "log_fixed_acidity",
        "density",
        "citric_acid",
        "ph",
        "alcohol",
    ]
)

_ = do_mnlogit(formula_mnl_full, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.390006
         Iterations 12
                              Results: MNLogit
=============================================================================
Model:                  MNLogit              Method:               MLE       
Dependent Variable:     quality_gr_num       Pseudo R-squared:     0.300     
Date:                   2023-07-30 16:02     AIC:                  1045.6357 
No. Observations:       1279                 BIC:                  1169.3277 
Df Model:               22                   Log-Likelihood:       -498.82   
Df Residuals:           1255                 LL-Null:              -712.31   
Converged:              1.0000               LLR p-value:          1.0879e-76
No. Iterations:         12.0000              Scale:                1.0000    
-----------------------------------------------------------------------------
   quality_gr_num = 0      Coef.   Std.Err.    t    P>|t|    [0.025   0.975] 
-----------------------------------------------------------------------------
               Intercept  119.8409 280.2599  0.4276 0.6689 -429.4584 669.1402
log_total_sulfur_dioxide   -0.8022   1.1834 -0.6778 0.4979   -3.1216   1.5173
           log_chlorides   -2.9954   1.5809 -1.8948 0.0581   -6.0939   0.1031
           log_sulphates   13.1604   2.8123  4.6796 0.0000    7.6484  18.6723
      log_residual_sugar   -0.0340   1.6013 -0.0212 0.9831   -3.1725   3.1046
 log_free_sulfur_dioxide    2.0438   1.1666  1.7520 0.0798   -0.2426   4.3302
        volatile_acidity   -6.3168   1.2915 -4.8911 0.0000   -8.8480  -3.7855
       log_fixed_acidity    3.8494   6.8680  0.5605 0.5752   -9.6116  17.3103
                 density -118.3428 288.3712 -0.4104 0.6815 -683.5399 446.8542
             citric_acid    0.1209   1.6703  0.0724 0.9423   -3.1528   3.3946
                      ph   -3.5982   2.5049 -1.4365 0.1509   -8.5078   1.3114
                 alcohol    0.9063   0.3405  2.6619 0.0078    0.2390   1.5736
-----------------------------------------------------------------------------
   quality_gr_num = 1      Coef.   Std.Err.    t    P>|t|    [0.025   0.975] 
-----------------------------------------------------------------------------
               Intercept -143.9094 253.5512 -0.5676 0.5703 -640.8606 353.0417
log_total_sulfur_dioxide    1.2885   1.0042  1.2831 0.1995   -0.6798   3.2567
           log_chlorides   -0.4318   1.3705 -0.3151 0.7527   -3.1178   2.2543
           log_sulphates    4.9014   2.5869  1.8947 0.0581   -0.1688   9.9716
      log_residual_sugar   -2.7214   1.4126 -1.9265 0.0540   -5.4899   0.0472
 log_free_sulfur_dioxide    1.2659   1.0113  1.2518 0.2107   -0.7162   3.2479
        volatile_acidity   -4.1921   0.9725 -4.3105 0.0000   -6.0983  -2.2860
       log_fixed_acidity   -1.5063   6.3136 -0.2386 0.8114  -13.8807  10.8681
                 density  161.1611 260.9309  0.6176 0.5368 -350.2541 672.5763
             citric_acid   -0.7918   1.4244 -0.5559 0.5783   -3.5835   2.0000
                      ph   -4.3257   2.2476 -1.9246 0.0543   -8.7309   0.0795
                 alcohol    0.3417   0.3092  1.1050 0.2692   -0.2643   0.9476
=============================================================================

kappa = 0.327
              precision    recall  f1-score   support

           0       0.58      0.14      0.23        50
           1       0.60      0.33      0.43       174
           2       0.86      0.96      0.91      1055

    accuracy                           0.84      1279
   macro avg       0.68      0.48      0.52      1279
weighted avg       0.82      0.84      0.82      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    1   42]
 [   0   58  116]
 [   5   38 1012]]

Multinomial regression (1)

Code

# Excluded:
# - citric_acid
formula_mnl_1 = "quality_gr_num ~ " + " + ".join(
    [
        "log_total_sulfur_dioxide",
        "log_chlorides",
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "log_fixed_acidity",
        "density",
        "ph",
        "alcohol",
    ]
)

_ = do_mnlogit(formula_mnl_1, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.390481
         Iterations 12
                              Results: MNLogit
=============================================================================
Model:                  MNLogit              Method:               MLE       
Dependent Variable:     quality_gr_num       Pseudo R-squared:     0.299     
Date:                   2023-07-30 16:02     AIC:                  1042.8517 
No. Observations:       1279                 BIC:                  1156.2360 
Df Model:               20                   Log-Likelihood:       -499.43   
Df Residuals:           1257                 LL-Null:              -712.31   
Converged:              1.0000               LLR p-value:          9.0793e-78
No. Iterations:         12.0000              Scale:                1.0000    
-----------------------------------------------------------------------------
    quality_gr_num = 0     Coef.   Std.Err.    t    P>|t|    [0.025   0.975] 
-----------------------------------------------------------------------------
               Intercept   96.9830 279.8112  0.3466 0.7289 -451.4369 645.4030
log_total_sulfur_dioxide   -0.8883   1.1358 -0.7820 0.4342   -3.1144   1.3379
           log_chlorides   -2.9051   1.5873 -1.8301 0.0672   -6.0162   0.2061
           log_sulphates   13.0336   2.7989  4.6566 0.0000    7.5478  18.5194
      log_residual_sugar   -0.0304   1.6022 -0.0190 0.9849   -3.1707   3.1098
 log_free_sulfur_dioxide    2.0718   1.1431  1.8124 0.0699   -0.1687   4.3123
        volatile_acidity   -6.5900   1.1552 -5.7047 0.0000   -8.8541  -4.3259
       log_fixed_acidity    3.4837   6.6004  0.5278 0.5976   -9.4529  16.4203
                 density  -94.3171 287.8715 -0.3276 0.7432 -658.5348 469.9007
                      ph   -3.7780   2.5068 -1.5071 0.1318   -8.6912   1.1352
                 alcohol    0.9255   0.3379  2.7393 0.0062    0.2633   1.5877
-----------------------------------------------------------------------------
   quality_gr_num = 1      Coef.   Std.Err.    t    P>|t|    [0.025   0.975] 
-----------------------------------------------------------------------------
               Intercept -154.0818 253.0695 -0.6089 0.5426 -650.0889 341.9253
log_total_sulfur_dioxide    1.1203   0.9506  1.1785 0.2386   -0.7429   2.9835
           log_chlorides   -0.4739   1.3808 -0.3432 0.7314   -3.1802   2.2323
           log_sulphates    4.7820   2.5728  1.8587 0.0631   -0.2607   9.8247
      log_residual_sugar   -2.7823   1.4142 -1.9674 0.0491   -5.5541  -0.0105
 log_free_sulfur_dioxide    1.3634   0.9892  1.3784 0.1681   -0.5753   3.3021
        volatile_acidity   -3.9864   0.9089 -4.3862 0.0000   -5.7678  -2.2051
       log_fixed_acidity   -2.5589   6.0456 -0.4233 0.6721  -14.4080   9.2903
                 density  172.2310 260.3805  0.6615 0.5083 -338.1055 682.5674
                      ph   -4.3085   2.2549 -1.9107 0.0560   -8.7280   0.1111
                 alcohol    0.3257   0.3079  1.0575 0.2903   -0.2779   0.9292
=============================================================================

kappa = 0.339
              precision    recall  f1-score   support

           0       0.58      0.14      0.23        50
           1       0.61      0.34      0.44       174
           2       0.87      0.96      0.91      1055

    accuracy                           0.84      1279
   macro avg       0.69      0.48      0.53      1279
weighted avg       0.82      0.84      0.82      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    1   42]
 [   0   60  114]
 [   5   37 1013]]

Multinomial regression (2)

Code

# Excluded:
# - citric_acid
# - density
formula_mnl_2 = "quality_gr_num ~ " + " + ".join(
    [
        "log_total_sulfur_dioxide",
        "log_chlorides",
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "log_fixed_acidity",
        "ph",
        "alcohol",
    ]
)

_ = do_mnlogit(formula_mnl_2, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.392263
         Iterations 8
                            Results: MNLogit
=========================================================================
Model:                  MNLogit            Method:             MLE       
Dependent Variable:     quality_gr_num     Pseudo R-squared:   0.296     
Date:                   2023-07-30 16:02   AIC:                1043.4087 
No. Observations:       1279               BIC:                1146.4854 
Df Model:               18                 Log-Likelihood:     -501.70   
Df Residuals:           1259               LL-Null:            -712.31   
Converged:              1.0000             LLR p-value:        3.4228e-78
No. Iterations:         8.0000             Scale:              1.0000    
-------------------------------------------------------------------------
   quality_gr_num = 0     Coef.  Std.Err.    t    P>|t|   [0.025   0.975]
-------------------------------------------------------------------------
               Intercept  5.2484   9.4399  0.5560 0.5782 -13.2535 23.7502
log_total_sulfur_dioxide -1.1858   1.1280 -1.0513 0.2931  -3.3966  1.0250
           log_chlorides -3.1511   1.5809 -1.9932 0.0462  -6.2497 -0.0526
           log_sulphates 12.9201   2.6849  4.8121 0.0000   7.6578 18.1825
      log_residual_sugar -0.3780   1.1593 -0.3261 0.7444  -2.6502  1.8942
 log_free_sulfur_dioxide  2.2795   1.1356  2.0072 0.0447   0.0537  4.5053
        volatile_acidity -7.0259   1.1412 -6.1566 0.0000  -9.2626 -4.7892
       log_fixed_acidity  1.9618   3.7933  0.5172 0.6050  -5.4728  9.3965
                      ph -4.1767   1.9943 -2.0944 0.0362  -8.0854 -0.2680
                 alcohol  0.9960   0.2180  4.5682 0.0000   0.5687  1.4233
-------------------------------------------------------------------------
    quality_gr_num = 1     Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
-------------------------------------------------------------------------
               Intercept  13.1628   8.3845  1.5699 0.1164 -3.2705 29.5960
log_total_sulfur_dioxide   1.1572   0.9501  1.2180 0.2232 -0.7050  3.0194
           log_chlorides  -0.3448   1.3728 -0.2512 0.8017 -3.0354  2.3458
           log_sulphates   5.3032   2.4627  2.1534 0.0313  0.4763 10.1301
      log_residual_sugar  -2.1039   0.9858 -2.1341 0.0328 -4.0362 -0.1717
 log_free_sulfur_dioxide   1.3164   0.9863  1.3347 0.1820 -0.6167  3.2496
        volatile_acidity  -4.0310   0.9055 -4.4516 0.0000 -5.8058 -2.2562
       log_fixed_acidity   0.7247   3.4907  0.2076 0.8355 -6.1169  7.5663
                      ph  -3.3871   1.7697 -1.9139 0.0556 -6.8558  0.0815
                 alcohol   0.1733   0.1972  0.8789 0.3795 -0.2131  0.5597
=========================================================================

kappa = 0.334
              precision    recall  f1-score   support

           0       0.58      0.14      0.23        50
           1       0.61      0.34      0.44       174
           2       0.87      0.96      0.91      1055

    accuracy                           0.84      1279
   macro avg       0.69      0.48      0.52      1279
weighted avg       0.82      0.84      0.82      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    1   42]
 [   0   59  115]
 [   5   37 1013]]

Multinomial regression (3)

Code

# Excluded:
# - citric_acid
# - density
# - log_fixed_acidity
formula_mnl_3 = "quality_gr_num ~ " + " + ".join(
    [
        "log_total_sulfur_dioxide",
        "log_chlorides",
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "ph",
        "alcohol",
    ]
)

_ = do_mnlogit(formula_mnl_3, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.392523
         Iterations 8
                            Results: MNLogit
========================================================================
Model:                 MNLogit            Method:             MLE       
Dependent Variable:    quality_gr_num     Pseudo R-squared:   0.295     
Date:                  2023-07-30 16:02   AIC:                1040.0733 
No. Observations:      1279               BIC:                1132.8423 
Df Model:              16                 Log-Likelihood:     -502.04   
Df Residuals:          1261               LL-Null:            -712.31   
Converged:             1.0000             LLR p-value:        1.7841e-79
No. Iterations:        8.0000             Scale:              1.0000    
------------------------------------------------------------------------
   quality_gr_num = 0     Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
------------------------------------------------------------------------
               Intercept  9.8512   4.6487  2.1191 0.0341  0.7399 18.9625
log_total_sulfur_dioxide -1.2134   1.1257 -1.0779 0.2811 -3.4197  0.9930
           log_chlorides -3.0846   1.5759 -1.9573 0.0503 -6.1734  0.0041
           log_sulphates 13.1280   2.6369  4.9785 0.0000  7.9597 18.2964
      log_residual_sugar -0.2949   1.1536 -0.2557 0.7982 -2.5560  1.9661
 log_free_sulfur_dioxide  2.2388   1.1280  1.9848 0.0472  0.0280  4.4495
        volatile_acidity -7.0784   1.1341 -6.2412 0.0000 -9.3012 -4.8555
                      ph -4.9478   1.4362 -3.4452 0.0006 -7.7626 -2.1330
                 alcohol  0.9895   0.2177  4.5454 0.0000  0.5628  1.4162
------------------------------------------------------------------------
   quality_gr_num = 1     Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
------------------------------------------------------------------------
               Intercept 14.6710   4.1401  3.5436 0.0004  6.5565 22.7856
log_total_sulfur_dioxide  1.1759   0.9511  1.2364 0.2163 -0.6882  3.0401
           log_chlorides -0.3522   1.3767 -0.2558 0.7981 -3.0506  2.3461
           log_sulphates  5.4129   2.4166  2.2399 0.0251  0.6765 10.1493
      log_residual_sugar -2.0923   0.9879 -2.1179 0.0342 -4.0285 -0.1560
 log_free_sulfur_dioxide  1.2830   0.9781  1.3118 0.1896 -0.6339  3.2000
        volatile_acidity -4.0041   0.9001 -4.4487 0.0000 -5.7682 -2.2400
                      ph -3.6413   1.2705 -2.8661 0.0042 -6.1313 -1.1512
                 alcohol  0.1723   0.1971  0.8743 0.3819 -0.2140  0.5586
========================================================================

kappa = 0.329
              precision    recall  f1-score   support

           0       0.58      0.14      0.23        50
           1       0.60      0.33      0.43       174
           2       0.87      0.96      0.91      1055

    accuracy                           0.84      1279
   macro avg       0.68      0.48      0.52      1279
weighted avg       0.82      0.84      0.82      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    1   42]
 [   0   58  116]
 [   5   37 1013]]

Multinomial regression (4)

Code

# Excluded:
# - citric_acid
# - density
# - log_fixed_acidity
# - log_total_sulfur_dioxide
formula_mnl_4 = "quality_gr_num ~ " + " + ".join(
    [
        "log_chlorides",
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "ph",
        "alcohol",
    ]
)

classification_model_selected = do_mnlogit(formula_mnl_4, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.399213
         Iterations 8
                           Results: MNLogit
=======================================================================
Model:                MNLogit            Method:             MLE       
Dependent Variable:   quality_gr_num     Pseudo R-squared:   0.283     
Date:                 2023-07-30 16:02   AIC:                1053.1867 
No. Observations:     1279               BIC:                1135.6481 
Df Model:             14                 Log-Likelihood:     -510.59   
Df Residuals:         1263               LL-Null:            -712.31   
Converged:            1.0000             LLR p-value:        2.3990e-77
No. Iterations:       8.0000             Scale:              1.0000    
-----------------------------------------------------------------------
   quality_gr_num = 0    Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
-----------------------------------------------------------------------
              Intercept  9.7895   4.3226  2.2647 0.0235  1.3173 18.2617
          log_chlorides -2.7688   1.5637 -1.7706 0.0766 -5.8337  0.2961
          log_sulphates 12.7081   2.5966  4.8942 0.0000  7.6189 17.7973
     log_residual_sugar -0.8818   1.1331 -0.7782 0.4364 -3.1026  1.3390
log_free_sulfur_dioxide  1.3545   0.6432  2.1059 0.0352  0.0938  2.6152
       volatile_acidity -7.3250   1.1180 -6.5521 0.0000 -9.5162 -5.1338
                     ph -5.0049   1.4014 -3.5714 0.0004 -7.7515 -2.2583
                alcohol  0.9959   0.2127  4.6814 0.0000  0.5790  1.4129
-----------------------------------------------------------------------
   quality_gr_num = 1    Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
-----------------------------------------------------------------------
              Intercept 16.4577   3.8374  4.2888 0.0000  8.9366 23.9789
          log_chlorides -0.3891   1.3754 -0.2829 0.7773 -3.0849  2.3067
          log_sulphates  5.7134   2.3979  2.3827 0.0172  1.0137 10.4131
     log_residual_sugar -1.9889   0.9859 -2.0174 0.0437 -3.9211 -0.0566
log_free_sulfur_dioxide  2.2719   0.5707  3.9811 0.0001  1.1534  3.3904
       volatile_acidity -3.9980   0.8887 -4.4986 0.0000 -5.7398 -2.2561
                     ph -3.8146   1.2447 -3.0646 0.0022 -6.2542 -1.3749
                alcohol  0.1256   0.1925  0.6528 0.5139 -0.2516  0.5029
=======================================================================

kappa = 0.334
              precision    recall  f1-score   support

           0       0.78      0.14      0.24        50
           1       0.61      0.33      0.43       174
           2       0.87      0.96      0.91      1055

    accuracy                           0.85      1279
   macro avg       0.75      0.48      0.53      1279
weighted avg       0.83      0.85      0.82      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    2   41]
 [   0   57  117]
 [   2   35 1018]]

Multinomial regression (5)

Code

# Excluded:
# - citric_acid
# - density
# - log_fixed_acidity
# - log_total_sulfur_dioxide
# - log_chlorides
formula_mnl_5 = "quality_gr_num ~ " + " + ".join(
    [
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "ph",
        "alcohol",
    ]
)

_ = do_mnlogit(formula_mnl_5, wine_train_log)

Optimization terminated successfully.
         Current function value: 0.403161
         Iterations 8
                           Results: MNLogit
=======================================================================
Model:                MNLogit            Method:             MLE       
Dependent Variable:   quality_gr_num     Pseudo R-squared:   0.276     
Date:                 2023-07-30 16:02   AIC:                1059.2852 
No. Observations:     1279               BIC:                1131.4389 
Df Model:             12                 Log-Likelihood:     -515.64   
Df Residuals:         1265               LL-Null:            -712.31   
Converged:            1.0000             LLR p-value:        9.7571e-77
No. Iterations:       8.0000             Scale:              1.0000    
-----------------------------------------------------------------------
   quality_gr_num = 0    Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
-----------------------------------------------------------------------
              Intercept 10.6176   4.2785  2.4816 0.0131  2.2319 19.0033
          log_sulphates 11.8845   2.5466  4.6669 0.0000  6.8934 16.8757
     log_residual_sugar -1.2226   1.1130 -1.0985 0.2720 -3.4039  0.9588
log_free_sulfur_dioxide  1.4828   0.6371  2.3275 0.0199  0.2342  2.7315
       volatile_acidity -7.7299   1.1044 -6.9989 0.0000 -9.8946 -5.5652
                     ph -4.5934   1.3800 -3.3286 0.0009 -7.2981 -1.8887
                alcohol  1.0818   0.2073  5.2173 0.0000  0.6754  1.4882
-----------------------------------------------------------------------
   quality_gr_num = 1    Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
-----------------------------------------------------------------------
              Intercept 16.6041   3.8108  4.3571 0.0000  9.1351 24.0731
          log_sulphates  5.6195   2.3597  2.3814 0.0172  0.9945 10.2445
     log_residual_sugar -2.0352   0.9648 -2.1095 0.0349 -3.9262 -0.1442
log_free_sulfur_dioxide  2.2960   0.5667  4.0516 0.0001  1.1853  3.4066
       volatile_acidity -4.0292   0.8773 -4.5929 0.0000 -5.7486 -2.3097
                     ph -3.7697   1.2280 -3.0699 0.0021 -6.1764 -1.3629
                alcohol  0.1371   0.1875  0.7313 0.4646 -0.2304  0.5047
=======================================================================

kappa = 0.308
              precision    recall  f1-score   support

           0       0.78      0.14      0.24        50
           1       0.57      0.32      0.41       174
           2       0.86      0.96      0.91      1055

    accuracy                           0.84      1279
   macro avg       0.74      0.47      0.52      1279
weighted avg       0.82      0.84      0.81      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    0   43]
 [   0   55  119]
 [   2   42 1011]]

4.4 Modelling Alcohol Content

The purpose of this section is to create a linear regression model that predicts the fraction of alcohol content in wine.

Quick pair-wise summary of how other variables relate to the fraction of alcohol content in wine is presented in the details section below. And summary table of correlation between alcohol and other variables is shown below. The correlation analysis was carried out in section “Section 4.1”.

Details: plots and summaries of variables by alcohol content

Let’s investigate relationship between alcohol content and other variables.

R code

py$wine_train_log %>%
  mutate(quality = ordered(quality)) %>%
  DescTools::Desc(. ~ alcohol, data = ., verbose = 3)

------------------------------------------------------------------------------ 
log_fixed_acidity ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.104
Spearman corr.: -0.077
Kendall corr. : -0.056

------------------------------------------------------------------------------ 
volatile_acidity ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.192
Spearman corr.: -0.214
Kendall corr. : -0.144

------------------------------------------------------------------------------ 
citric_acid ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.101
Spearman corr.: 0.083
Kendall corr. : 0.055

------------------------------------------------------------------------------ 
log_residual_sugar ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.069
Spearman corr.: 0.109
Kendall corr. : 0.076

------------------------------------------------------------------------------ 
log_chlorides ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.315
Spearman corr.: -0.306
Kendall corr. : -0.213

------------------------------------------------------------------------------ 
log_free_sulfur_dioxide ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.084
Spearman corr.: -0.080
Kendall corr. : -0.055

------------------------------------------------------------------------------ 
log_total_sulfur_dioxide ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.231
Spearman corr.: -0.253
Kendall corr. : -0.175

------------------------------------------------------------------------------ 
density ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.493
Spearman corr.: -0.461
Kendall corr. : -0.329

------------------------------------------------------------------------------ 
ph ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.227
Spearman corr.: 0.204
Kendall corr. : 0.144

------------------------------------------------------------------------------ 
log_sulphates ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.150
Spearman corr.: 0.212
Kendall corr. : 0.148

------------------------------------------------------------------------------ 
quality ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean     10.075   10.420    9.912   10.624   11.408   11.900
median   10.050   10.350    9.700   10.500   11.500   11.700
sd        0.845    0.968    0.747    1.057    0.972    1.180
IQR       0.975    1.575    0.800    1.575    1.350    1.600
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       8.400    9.000    8.500    8.400    9.200    9.800
max      11.000   13.100   14.900   14.000   14.000   14.000
Q1        9.775    9.600    9.400    9.800   10.750   11.150
Q3       10.750   11.175   10.200   11.375   12.100   12.750
mad       0.741    1.112    0.445    1.186    1.038    1.334
skew     -0.659    0.373    1.911    0.583   -0.019   -0.168
kurt     -0.730   -0.508    5.899   -0.123   -0.586   -1.005

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 309.38, df = 5, p-value < 2.2e-16





Proportions of quality in the quantiles of alcohol:
   
         Q1      Q2      Q3      Q4
  3    0.3%    1.2%    1.0%    0.0%
  4    2.1%    3.9%    3.7%    3.6%
  5   68.2%   54.0%   31.7%   11.9%
  6   28.2%   34.7%   49.3%   49.3%
  7    1.2%    5.6%   13.7%   31.5%
  8    0.0%    0.6%    0.7%    3.6%

------------------------------------------------------------------------------ 
quality_gr ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean     10.365   10.256   11.451
median   10.250   10.000   11.500
sd        0.950    0.977    0.998
IQR       1.375    1.350    1.400
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       8.400    8.400    9.200
max      13.100   14.900   14.000
Q1        9.625    9.500   10.800
Q3       11.000   10.850   12.200
mad       1.038    0.890    1.038
skew      0.300    1.121    0.010
kurt     -0.267    1.117   -0.564

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 173.51, df = 2, p-value < 2.2e-16





Proportions of quality_gr in the quantiles of alcohol:
        
              Q1      Q2      Q3      Q4
  low       2.4%    5.0%    4.7%    3.6%
  medium   96.5%   88.7%   81.0%   61.3%
  high      1.2%    6.2%   14.3%   35.1%

------------------------------------------------------------------------------ 
quality_gr_num ~ alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.257
Spearman corr.: -0.320
Kendall corr. : -0.261

Correlation strength between alcohol concentration and the remaining variables:

Code

corr_df_log.query("X=='alcohol' | Y == 'alcohol'")

                           X        Y      r           CI95%   p_adj
58                   density  alcohol -0.493  [-0.53, -0.45]  <0.001
65                   alcohol  quality  0.446     [0.4, 0.49]  <0.001
43             log_chlorides  alcohol -0.315  [-0.36, -0.26]  <0.001
54  log_total_sulfur_dioxide  alcohol -0.231  [-0.28, -0.18]  <0.001
61                        ph  alcohol  0.227    [0.17, 0.28]  <0.001
19          volatile_acidity  alcohol -0.192  [-0.24, -0.14]  <0.001
63             log_sulphates  alcohol  0.150      [0.1, 0.2]  <0.001
9          log_fixed_acidity  alcohol -0.104  [-0.16, -0.05]   0.004
28               citric_acid  alcohol  0.101    [0.05, 0.15]   0.006
49   log_free_sulfur_dioxide  alcohol -0.084  [-0.14, -0.03]   0.047
36        log_residual_sugar  alcohol  0.069    [0.01, 0.12]   0.182

Results:

Correlation analysis of alcohol content revealed that the strongest correlation is between density and alcohol (r = -0.49, 95% CI [-0.53, -0.45], p_adj < 0.001) as well as between alcohol and quality (r = 0.45 95% CI [0.40, 0.49], p_adj < 0.001).
This means that correlation strength is moderate so no single variable is enough for extremely accurate predictions.

4.4.1 Sequential Feature Selection: All Features

Let’s do sequential feature selection and investigate how many features might be enough.

Code

# For data without transformation
formula_full = "alcohol ~ " + " + ".join(
    [
        "total_sulfur_dioxide",
        "chlorides",
        "sulphates",
        "residual_sugar",
        "free_sulfur_dioxide",
        "volatile_acidity",
        "fixed_acidity",
        "density",
        "citric_acid",
        "ph",
        "quality",
    ]
)

Code

# For log-transformed data
formula_log_full = "alcohol ~ " + " + ".join(
    [
        "log_total_sulfur_dioxide",
        "log_chlorides",
        "log_sulphates",
        "log_residual_sugar",
        "log_free_sulfur_dioxide",
        "volatile_acidity",
        "log_fixed_acidity",
        "density",
        "citric_acid",
        "ph",
        "quality",
    ]
)

SFS results

Code

np.random.seed(251)
sfs_res1 = do_sfs_lin_reg(formula_full, wine_train)
show_sfs_results_lin_reg(sfs_res1, "(All Variables on Original Scale)")

Forward Feature Selection with 5-fold CV 
(All Variables on Original Scale)

Selected variables:
[ n = 9, avg. R² = 0.672 ]

1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. volatile_acidity
5. fixed_acidity
6. density
7. citric_acid
8. ph
9. quality

SFS results

Code

np.random.seed(251)
sfs_res2 = do_sfs_lin_reg(formula_full, wine_train, False)
show_sfs_results_lin_reg(sfs_res2, "(All Variables on Original Scale)")

Backward Feature Selection with 5-fold CV 
(All Variables on Original Scale)

Selected variables:
[ n = 7, avg. R² = 0.667 ]

1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. fixed_acidity
5. density
6. ph
7. quality

SFS results

Code

np.random.seed(251)
sfs_res3 = do_sfs_lin_reg(formula_log_full, wine_train_log)
show_sfs_results_lin_reg(sfs_res3, "(Log-Transformed Variables Included)")

Forward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 7, avg. R² = 0.689 ]

1. log_sulphates
2. log_residual_sugar
3. log_free_sulfur_dioxide
4. log_fixed_acidity
5. density
6. ph
7. quality

SFS results

Code

sfs_res4 = do_sfs_lin_reg(formula_log_full, wine_train_log, False)
show_sfs_results_lin_reg(sfs_res4, "(Log-Transformed Variables Included)")

Backward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 6, avg. R² = 0.686 ]

1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph
6. quality

Despite the fact that algorithm suggests more, but 5-6 variables would be enough as additional features give just a slight improvement. Log-transformed features give a slightly better performance than the features on the original scale.

Code

n_features = 6

List selected features and regression performance

Code

print("All features on original scale")

All features on original scale

Code

get_sfs_performance_lin_reg(sfs_res2, n_features)

{'n_features': 6, 'mean R²': 0.664, 'median R²': 0.661, 'sd R²': 0.014}

Code

sfs_res2.get_metric_dict()[n_features]["feature_names"]

('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph', 'quality')

Code

print("With log-transformed features")

With log-transformed features

Code

get_sfs_performance_lin_reg(sfs_res4, n_features)

{'n_features': 6, 'mean R²': 0.686, 'median R²': 0.669, 'sd R²': 0.03}

Code

formula_selected_6 = sfs_res4.get_metric_dict()[n_features]["feature_names"]
formula_selected_6

('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph', 'quality')

Code

formula_selected_6 = "alcohol ~ " + " + ".join(formula_selected_6)

4.4.2 Sequential Feature Selection: No Quality

Quality score seems to be the most expensive feature and might be not available at prediction time. So let’s remove it from options to select features.

Code

# Formulas that exclude "quality"
formula_q = formula_full.replace(" + quality", "")
formula_log_q = formula_log_full.replace(" + quality", "")

SFS results

Code

np.random.seed(251)
sfs_res1q = do_sfs_lin_reg(formula_q, wine_train)
show_sfs_results_lin_reg(sfs_res1q, "(All Variables on Original Scale)")

Forward Feature Selection with 5-fold CV 
(All Variables on Original Scale)

Selected variables:
[ n = 7, avg. R² = 0.655 ]

1. total_sulfur_dioxide
2. sulphates
3. residual_sugar
4. fixed_acidity
5. density
6. citric_acid
7. ph

SFS results

Code

np.random.seed(251)
sfs_res2q = do_sfs_lin_reg(formula_q, wine_train, False)
show_sfs_results_lin_reg(sfs_res2q, "(All Variables on Original Scale)")

Backward Feature Selection with 5-fold CV 
(All Variables on Original Scale)

Selected variables:
[ n = 5, avg. R² = 0.647 ]

1. sulphates
2. residual_sugar
3. fixed_acidity
4. density
5. ph

SFS results

Code

np.random.seed(251)
sfs_res3q = do_sfs_lin_reg(formula_log_q, wine_train_log)
show_sfs_results_lin_reg(sfs_res3q, "(Log-Transformed Variables Included)")

Forward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 6, avg. R² = 0.677 ]

1. log_total_sulfur_dioxide
2. log_sulphates
3. log_residual_sugar
4. log_fixed_acidity
5. density
6. ph

SFS results

Code

sfs_res4q = do_sfs_lin_reg(formula_log_q, wine_train_log, False)
show_sfs_results_lin_reg(sfs_res4q, "(Log-Transformed Variables Included)")

Backward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 5, avg. R² = 0.674 ]

1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph

Despite the fact that the algorithm suggests using more features, it seems that 5 features would be enough as additional features give just a slight improvement. Again, log-transformed features give a slightly better performance than the features on the original scale.

Code

n_features_q = 5

List selected features and regression performance

Code

print("All features on original scale")

All features on original scale

Code

get_sfs_performance_lin_reg(sfs_res2q, n_features_q)

{'n_features': 5, 'mean R²': 0.647, 'median R²': 0.641, 'sd R²': 0.021}

Code

sfs_res2q.get_metric_dict()[n_features_q]["feature_names"]

('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph')

Code

print("With log-transformed features")

With log-transformed features

Code

get_sfs_performance_lin_reg(sfs_res4q, n_features_q)

{'n_features': 5, 'mean R²': 0.674, 'median R²': 0.663, 'sd R²': 0.033}

Code

formula_selected_5 = sfs_res4q.get_metric_dict()[n_features_q]["feature_names"]
formula_selected_5

('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph')

Code

formula_selected_5 = "alcohol ~ " + " + ".join(formula_selected_5)

4.4.3 Compare Null, Full and Selected Models

Comparison of null, full and 2 selected models to predict alcohol contents.
Model	R²	AIC
Null	0.000	3779
Full	0.703	2250
6 predictors	0.692	2284
5 predictors	0.679	2336

Alcohol prediction: null model

Code

formula_null = "alcohol ~ 1"

model_null = smf.ols(formula_null, wine_train_log)
regression_model_null = model_null.fit()
print(f"R² = {regression_model_null.rsquared:.3f}")

R² = -0.000

Code

print(regression_model_null.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                alcohol   R-squared:                      -0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                       nan
Date:                Sun, 30 Jul 2023   Prob (F-statistic):                nan
Time:                        16:03:04   Log-Likelihood:                -1888.5
No. Observations:                1279   AIC:                             3779.
Df Residuals:                    1278   BIC:                             3784.
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.4228      0.030    351.749      0.000      10.365      10.481
==============================================================================
Omnibus:                      126.420   Durbin-Watson:                   1.978
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              163.802
Skew:                           0.869   Prob(JB):                     2.70e-36
Kurtosis:                       3.233   Cond. No.                         1.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))

Alcohol prediction: full model

Code

model_full = smf.ols(formula_log_full, wine_train_log)
regression_model_full = model_full.fit()
print(f"R² = {regression_model_full.rsquared:.3f}")

R² = 0.703

Code

print(regression_model_full.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                alcohol   R-squared:                       0.703
Model:                            OLS   Adj. R-squared:                  0.700
Method:                 Least Squares   F-statistic:                     272.1
Date:                Sun, 30 Jul 2023   Prob (F-statistic):          4.94e-324
Time:                        16:03:05   Log-Likelihood:                -1112.9
No. Observations:                1279   AIC:                             2250.
Df Residuals:                    1267   BIC:                             2312.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  582.5403     15.608     37.324      0.000     551.921     613.160
log_total_sulfur_dioxide    -0.1779      0.096     -1.859      0.063      -0.365       0.010
log_chlorides               -0.3171      0.136     -2.332      0.020      -0.584      -0.050
log_sulphates                1.8950      0.199      9.518      0.000       1.504       2.286
log_residual_sugar           2.8405      0.121     23.554      0.000       2.604       3.077
log_free_sulfur_dioxide     -0.0633      0.094     -0.672      0.502      -0.248       0.122
volatile_acidity             0.4518      0.120      3.768      0.000       0.217       0.687
log_fixed_acidity            9.7684      0.456     21.422      0.000       8.874      10.663
density                   -597.2859     16.092    -37.117      0.000    -628.855    -565.716
citric_acid                  0.7387      0.141      5.222      0.000       0.461       1.016
ph                           3.6857      0.171     21.577      0.000       3.351       4.021
quality                      0.1779      0.025      7.207      0.000       0.129       0.226
==============================================================================
Omnibus:                      160.987   Durbin-Watson:                   1.956
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              281.782
Skew:                           0.816   Prob(JB):                     6.48e-62
Kurtosis:                       4.621   Cond. No.                     9.89e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))

Alcohol prediction: 6-predictor model

Model with 6 selected variables (including quality score):

Code

model_6 = smf.ols(formula_selected_6, wine_train_log)
regression_model_6 = model_6.fit()
print(f"R² = {regression_model_6.rsquared:.3f}")

R² = 0.692

Code

print(regression_model_6.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                alcohol   R-squared:                       0.692
Model:                            OLS   Adj. R-squared:                  0.691
Method:                 Least Squares   F-statistic:                     476.9
Date:                Sun, 30 Jul 2023   Prob (F-statistic):          2.99e-321
Time:                        16:03:06   Log-Likelihood:                -1134.9
No. Observations:                1279   AIC:                             2284.
Df Residuals:                    1272   BIC:                             2320.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            605.6237     14.560     41.595      0.000     577.060     634.188
log_sulphates          1.8327      0.190      9.623      0.000       1.459       2.206
log_residual_sugar     2.9160      0.120     24.327      0.000       2.681       3.151
log_fixed_acidity     11.2386      0.383     29.357      0.000      10.488      11.990
density             -621.9934     14.939    -41.637      0.000    -651.300    -592.686
ph                     3.8674      0.160     24.103      0.000       3.553       4.182
quality                0.1782      0.024      7.456      0.000       0.131       0.225
==============================================================================
Omnibus:                      166.331   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              294.610
Skew:                           0.834   Prob(JB):                     1.06e-64
Kurtosis:                       4.658   Cond. No.                     8.60e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))

Alcohol prediction: null model

Model with 5 selected variables (without quality score):

Code

model_5 = smf.ols(formula_selected_5, wine_train_log)
regression_model_5 = model_5.fit()
print(f"R² = {regression_model_5.rsquared:.3f}")

R² = 0.679

Code

print(regression_model_5.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                alcohol   R-squared:                       0.679
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     538.1
Date:                Sun, 30 Jul 2023   Prob (F-statistic):          7.72e-311
Time:                        16:03:07   Log-Likelihood:                -1162.2
No. Observations:                1279   AIC:                             2336.
Df Residuals:                    1273   BIC:                             2367.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            651.0897     13.502     48.221      0.000     624.601     677.579
log_sulphates          2.3627      0.180     13.094      0.000       2.009       2.717
log_residual_sugar     3.0885      0.120     25.714      0.000       2.853       3.324
log_fixed_acidity     12.1631      0.370     32.885      0.000      11.438      12.889
density             -667.9833     13.894    -48.075      0.000    -695.242    -640.725
ph                     4.0415      0.162     24.930      0.000       3.723       4.360
==============================================================================
Omnibus:                      162.823   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              288.272
Skew:                           0.819   Prob(JB):                     2.53e-63
Kurtosis:                       4.651   Cond. No.                     4.30e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  else dtype(mean))

4.4.4 Model Diagnostics and Feature Importance

Create mean center and standard deviation scaled data for feature importance analysis.

Code

wine_train_log_num = wine_train_log.select_dtypes(include="number")

scaler_log = StandardScaler().fit(wine_train_log_num)
wine_train_log_scaled = scaler_log.transform(wine_train_log_num)
wine_train_log_scaled = pd.DataFrame(
    wine_train_log_scaled, columns=wine_train_log_num.columns
)

Feature importance in the model with 6 selected variables:

Code

model_6_scaled = smf.ols(formula_selected_6, wine_train_log_scaled)
regression_model_6_scaled = model_6_scaled.fit()
diagnostics_6_scaled = Linear_Reg_Diagnostic(regression_model_6_scaled)
(
    pd.concat(
        [
            diagnostics_6_scaled.vif_table().set_index("Features"),
            regression_model_6_scaled.params.rename(
                "standardized_coefficients"
            ),
        ],
        axis=1,
    ).sort_values("standardized_coefficients", key=abs, ascending=False)
)

                    VIF Factor  standardized_coefficients
density                  2.910                     -1.105
log_fixed_acidity        4.050                      0.919
ph                       2.220                      0.559
log_residual_sugar       1.320                      0.435
log_sulphates            1.220                      0.165
quality                  1.370                      0.136
Intercept                1.000                     -0.000

Quality is the least important feature in this set while density is the most important. VIF value for log_fixed_acidity is 4.05. Some authors claim that VIF > 4 indicate multicollinearity. Other sources suggest VIF > 5 to indicate multicollinearity.

Feature importance in the model with 5 selected variables (without quality score):

Code

model_5_scaled = smf.ols(formula_selected_5, wine_train_log_scaled)
regression_model_5_scaled = model_5_scaled.fit()
diagnostics_5_scaled = Linear_Reg_Diagnostic(regression_model_5_scaled)
(
    pd.concat(
        [
            diagnostics_5_scaled.vif_table().set_index("Features"),
            regression_model_5_scaled.params.rename(
                "standardized_coefficients"
            ),
        ],
        axis=1,
    ).sort_values("standardized_coefficients", key=abs, ascending=False)
)

                    VIF Factor  standardized_coefficients
density                  2.420                     -1.187
log_fixed_acidity        3.630                      0.995
ph                       2.180                      0.584
log_residual_sugar       1.270                      0.461
log_sulphates            1.050                      0.213
Intercept                1.000                     -0.000

Again, density is the most important feature. And no multicollinearity here as all VIF < 4.

Important.

In models with 5 and 6 features, R² is respectively 0.679 and 0.692. The difference is small and the 6-th variable quality least important. Looking from the practical point of view, quality score is the most subjective and hardest to get feature as a trained expert is required and the result may differ from expert to expert. So the model with 5 variables is selected as the final model.

Now let’s do some model diagnostics by analyzing model residuals.

Code

diagnostics_5 = Linear_Reg_Diagnostic(regression_model_5)

Distribution of residuals:

Code

plt.clf()
ax = regression_model_5.resid.plot.density()
ax.set_xlabel("Residuals")
plt.axvline(x=0, color="darkred")

<matplotlib.lines.Line2D object at 0x000001F41DBA41D0>

Code

plt.show()

It seems that residuals are slightly skewed and their mean shifted to the side of negative values but the shift is insignificant (see results below).

Do some formal assumption checking:

Is mean of residuals equal to 0?

Code

_, p_ttest = ttest_1samp(regression_model_5.resid, 0)
print(f"t test that mean = 0, p: {format_p0(p_ttest)}")

t test that mean = 0, p: >0.999

Are residuals normally distributed?

Code

pg.normality(regression_model_5.resid)

      W  pval  normal
0 0.964 0.000   False

Are residuals homoscedastic?

Code

_, p_bp, _, _ = sms.het_breuschpagan(
    regression_model_5.resid, regression_model_5.model.exog
)
print(f"Breusch-Pagan test for heteroscedasticity, p: {format_p0(p_bp)}")

Breusch-Pagan test for heteroscedasticity, p: <0.001

are there any outliers?

Code

(regression_model_5.get_influence().cooks_distance[0] > 1).any()

False

Let’s look to the plots of residuals.

Code

diagnostics_5()

The analysis of residuals revealed no outliers and the assumption that mean of residuals equals to 0 is met. But it seems that normality and homoscedasticity assumptions are at least slightly violated.

(Slightly) violated normality and homoscedasticity assumptions may have influence on formal inference about linear models, e.g., may lead to too large confidence intervals. To cope with heteroscedasticity, more robust heteroskedasticity-consistent (HC) estimators may be used. On the other hand, these violations should not have much influence on prediction accuracy.

4.5 Test Final Models

4.5.1 Preprocess Test Set

Pre-processing of test data is presented there to clarify, which are the most important pre-processing steps.

Code

wine_test = wine_test.clean_names()

skewed_vars = [
    "chlorides",
    "residual_sugar",
    "sulphates",
    "total_sulfur_dioxide",
    "free_sulfur_dioxide",
    "fixed_acidity",
]

new_names_map = {i: "log_" + str(i) for i in skewed_vars}

wine_test_log = wine_test.transform_columns(skewed_vars, np.log10).rename(
    new_names_map, axis=1
)

wine_test_log["quality_gr_num"] = wine_test_log["quality"].apply(
    lambda x: to_3_quality_groups(x, numeric=True)
)

4.5.2 Alcohol Content Prediction

Linear Regression Model to predict alcohol content.

In test set, RMSE is slightly higher, which is expected. R² score is the same (R² = 0.68). This result show that model generalizes quite well. We can interpret, that on average, true value will differ by 0.617 % vol. from the true value of alcohol content.

Code

y_train_true = wine_train_log.alcohol
y_train_pred = regression_model_5.predict(wine_train_log)

y_test_true = wine_test_log.alcohol
y_test_pred = regression_model_5.predict(wine_test_log)

pd.concat(
    [
        get_regression_performance(y_train_true, y_train_pred, "training"),
        get_regression_performance(y_test_true, y_test_pred, "test"),
    ]
)

        set     n    SD  RMSE    R²  RMSE_SD_ratio  SD_RMSE_ratio
0  training  1279 1.059 0.600 0.679          0.567          1.764
0      test   320 1.089 0.617 0.679          0.566          1.766

4.5.3 Wine Quality Class Prediction

The performance of Multinomial Logistic Regression model that predicts wine quality in training set is kappa = 0.334, in test set kappa = 0.333. The difference is negligible, but performance is poor. Prediction value (precision) in test set for high quality wines is 0.71, for medium quality group is 0.86, but none of low quality wines was predicted correctly. This shows that the selected model is not suitable for this class identification. The reasons might be class imbalance and here techniques like regularization or other type of model (e.g., tree-based model) might help to achieve better performance. A more advanced feature engineering might also be helpful (e.g., merging “low” and “medium” classes).

Training set:

Code

true_train = wine_train_log["quality_gr_num"]
pred_train = predict_class_mnlogit(
    classification_model_selected, wine_train_log
)
print_classification_report(true_train, pred_train)

kappa = 0.334
              precision    recall  f1-score   support

           0       0.78      0.14      0.24        50
           1       0.61      0.33      0.43       174
           2       0.87      0.96      0.91      1055

    accuracy                           0.85      1279
   macro avg       0.75      0.48      0.53      1279
weighted avg       0.83      0.85      0.82      1279

Confusion matrix (rows - true, columns - predicted):
[[   7    2   41]
 [   0   57  117]
 [   2   35 1018]]

Test set:

Code

true_test = wine_test_log["quality_gr_num"]
pred_test = predict_class_mnlogit(classification_model_selected, wine_test_log)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    print_classification_report(true_test, pred_test)

kappa = 0.333
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        13
           1       0.71      0.35      0.47        43
           2       0.86      0.98      0.92       264

    accuracy                           0.85       320
   macro avg       0.53      0.44      0.46       320
weighted avg       0.81      0.85      0.82       320

Confusion matrix (rows - true, columns - predicted):
[[  0   0  13]
 [  0  15  28]
 [  0   6 258]]

5 Conclusions

Alcohol contents prediction:

Linear regression model to predict alcohol content was created.
The performance of the model is R² = 0.68, RMSE = 0.617.
Predictor variables in descending order by importance: density, fixed acidity, pH, residual sugar, and sulphates.
Some linear model assumptions were violated so further statistical inference (e.g., calculation of confidence intervals) was not performed.
After logarithmic transformation of right-skewed variables, model showed better performance.

Wine quality class prediction:

Multinomial logistic regression model to identify low, medium and high quality wines was created.
Predictor variables are chlorides, sulphates, residual sugar, free sulfur dioxide, volatile acidity, pH, and alcohol.
Overall performance of the model is not good: kappa (test set) = 0.333.
Predictive value (precision) for high quality wines is 0.71, for medium quality group is 0.86. Unfortunately, none of low quality wines was predicted correctly. This suggest that there is a room for model improvement (e.g., cope with class imbalance or use another type wine quality grouping) which can be the scope of the future studies.

6 Looker Studio Dashboard

A part of this project was to create Looker Studio dashboard. A dashboard to explore some properties of red wine were introduced. You can find a print screen of the dashboard and a link to it below.

Link to the dashboard.

7 Limitations and Suggestions on Improvement

Class-imbalance in classification task should be addressed better (by using regularization, non-linear models, etc.).
More feature engineering can be introduced into analysis (e.g., merging “low” and “medium” wine quality groups).
In PCA, table or plot with feature contribution score could be added.
For .qmd files, there is no Python code linter and formatter yet (or I’m not aware of one). So there might be style issues left. On the other hand, Quarto enforces leaving no critical code issues as otherwise HTML output is not created.
Results in different sections should be presented in a more consistent manner.
More detailed written explanations could be added to some sections (e.g., to the section on PCA).
If thinking about production pipelines, it would be easier to maintain code if it would be written in a single language (either R or Python). Bilingual code requires an expert, who knows both languages.

References

Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53. https://doi.org/10.1016/j.dss.2009.05.016.

“Red Wine Quality — Kaggle.com.” https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009/versions/2.

“Wine Quality Datasets — www3.dsi.uminho.pt.” http://www3.dsi.uminho.pt/pcortez/wine/.