Red Wine Quality Prediction
Data Analysis Project
Data analysis tools: Python, R, Looker Studio
Helper tools: Quarto, RStudio, Git
Skills:
- data pre-processing
- exploratory data analysis (EDA):
- descriptive statistics
- data visualization
- exploratory PCA (principal component analysis)
- inferential statistics:
- hypothesis testing
- confidence intervals
- statistical modelling:
- linear regression
- logistic regression
- literate programming
- statistical programming
- dashboarding
Technical requirements:
- Use a combination of R and Python programming languages in a single document.
- Create a dashboard in Looker Studio.
Abbreviations
- AIC – Akaike information criterion.
- CI – 95% confidence interval.
- CV – cross-validation.
- Dim – dimension (principal component, the same as PC).
- EDA – exploratory data analysis.
- n – either sample or group size.
- p – p-value.
- p_adj – p-value (adjusted).
- PCA – principal component analysis.
- PC – principal component.
- r – correlation coefficient.
- R² – r squared, coefficient of determination.
- RNG – (pseudo)random number generator.
- RMSE – root mean squared error.
- SFS – sequential feature selection.
- VIF – variation inflation factor.
1 Introduction
Wine quality assessment and certification are important for wine making and selling processes. Certification helps to prevent counterfeiting and in this way protects people’s health as well as assures quality for the wine market. To identify the most influential factors is crucial for effective quality evaluation while to classify wines into quality groups is useful for setting prices (Cortez et al. 2009).
Vinho verde is wine from the Minho region, which is located in the northwest part of Portugal. In this project, data of red vinho verde wine specimens are investigated. The dataset was originally provided by (Cortez et al. 2009). This project mainly focuses on:
- prediction quality (to investigate how accurately (1) wine quality and (2) alcohol content in wine can be predicted) and
- explanation (to identify, which are the most important factors for the prediction tasks).
1.1 The Dataset
Various aspects of the dataset are described in various sources. Pieces of description from article of (Cortez et al. 2009) as well as from websites (“Red Wine Quality — Kaggle.com”; “Wine Quality Datasets — www3.dsi.uminho.pt”) were copied, merged and presented in this sub-section.
Variables based on physicochemical tests:
- fixed acidity (\(g_{\textrm{ tartaric acid}}/l\)): most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
- volatile acidity (\(g_{\textrm{ acetic acid}}/l\)): the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste.
- citric acid (\(g/l\)): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
- residual sugar (\(g/l\)): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
- chlorides (\(g_{\textrm{ sodium chloride}}/l\)): the amount of salt in the wine.
- free sulfur dioxide (\(mg/l\)): the free form of SO₂ exists in equilibrium between molecular SO₂ (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
- total sulfur dioxide (\(mg/l\)): amount of free and bound forms of SO₂; in low concentrations, SO₂ is mostly undetectable in wine, but at free SO₂ concentrations over 50 ppm, SO₂ becomes evident in the nose and taste of wine.
- density (\(g/cm^3\)): the density of water is close to that of water depending on the percent alcohol and sugar content.
- pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
- sulphates: a wine additive which can contribute to sulfur dioxide gas (SO₂) levels, which acts as an antimicrobial and antioxidant.
- alcohol (% vol.): the percent alcohol content of the wine.
Variable based on sensory data:
- quality (score between 0 and 10).
2 Methods
2.1 R and Python in a Single Analysis
RStudio is an IDE for R and Python. In Quarto (.qmd
) and R Markdown (.rmd
) documents, RStudio supports an interface that allows using R and Python in a single document (e.g., we can access Python objects in R code cells by using py$<python_object_name>
accessor, and access R objects in Python code cells by using r.<r_object_name>
accessor). So additional technical tasks for this project were to:
- Try RStudio capabilities for Python programming (documentation, data analysis, plotting, etc.).
- Try RStudio interface between R and Python: use these 2 tools in the same file.
Main principles how it was leveraged between R and Python in this project:
- Python is used for all steps which are directly related to putting model into production.
- Some exploratory analysis steps and other tasks that are easier to perform in R, are done in R.
2.2 Population and Sample
The population of interest is red vinho verde wines. It is assumed that the data is a simple random sample or its equivalent that allows using methods of inferential statistics.
2.3 Training and Test Sets
As the prediction step will be included in the analysis, the whole sample is divided into model training and test sets using 80:20 train/test hold-out strategy with stratification by wine quality scores.
Training set was used for:
- exploratory analysis;
- statistical inference;
- model creation and selection.
Test set was used only to evaluate the performance of the final models.
2.4 General: Inferential Statistics
For statistical inference, significance level \(\alpha=0.05\) and 95% confidence level are used.
2.5 Differences Between Groups
For statistical inference about several groups (categories), the following strategy is used:
- first, 95% confidence intervals (CI) for each group are calculated,
- second, an omnibus test is performed.
- In the analysis of counts (sizes of wine quality groups),
the following methods are used:- confidence intervals: Goodman’s simultaneous confidence intervals of proportions;
- omnibus: Pearson’s chi-square (χ²) goodness-of-fit (GOF) test,
- hypotheses:
\(H_0:\) All proportions are equal: \(~ \pi_1 = \pi_2 = ... = \pi_i = ... = \pi_k\)
\(H_1:\) At least two proportions differ: \(\pi_i \ne \pi_j\) for at least single pair of i and j.
- hypotheses:
\(i\) and \(j\) are group indices (\(i\) = 1, 2, …, \(k\); \(j\) = 1, 2, …, \(k\); \(i \ne j\)),
\(k\) – total number of groups.
2.6 Correlation Between Variables
For relationship between numeric variables, parametric Pearson’s correlation analysis is used.
Hypotheses:
\(H_0:\) \(\rho = 0\) (variables do not correlate),
\(H_1:\) \(\rho \ne 0\) (variables correlate).Here \(\rho\) (rho) is population Pearson’s correlation coefficient.
In statistical significance testing, as hypotheses are tested multiple times, to prevent inflated type 1 error rate, Holm correction is applied.
2.7 Classification Task
To predict wine quality, quality scores were merged into 3 groups (“low”, “medium”, “high”) and this variable was modeled by using multinomial logistic regression. For feature selection, statistical inference based strategy is applied. The main measure to evaluate overall classification performance is Cohen’s kappa (κ).
2.8 Regression Task
To predict alcohol content, linear regression is used. For feature selection sequential feature selection (SFS) with 5-fold cross validation was used. The main metric for model performance was R². For the final evaluation, RMSE was included too.
2.9 Feature Engineering
In some cases data were right-skewed. So this type of variables were investigated on the original scale (not transformed) and after the logarithmic transformation.
3 Preparation and Inspection
3.1 Setup
Both R and Python will be used in this project. R code is marked with blue and Python code with yellow stripes.
Code: R setup
# R setup ================================================
# R packages ---------------------------------------------
library(tidyverse)
library(reticulate)
library(factoextra)
library(DescTools)
library(patchwork)
# R options ----------------------------------------------
options(scipen = 5)
DescToolsOptions(stamp = NULL, digits = 3)
# Default ggplot2 theme
theme_set(theme_bw())
# Other options
::opts_chunk$set(fig.height = 3.5, fig.width = 6, fig.align = "center")
knitr
# RNG state
set.seed(100)
Code: Python setup
# Python setup =========================================
# Packages and modules ---------------------------------
# Utilities
import warnings
# Data wrangling, math
import numpy as np
import pandas as pd
import janitor
from numpy import log
# Statistical analysis and modeling
import patsy
import scipy.stats as sps
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import pingouin as pg
import scikit_posthocs as sp
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import ttest_1samp
# Machine learning
from sklearn.preprocessing import StandardScaler
Code: Python setup
from sklearn.metrics import mean_squared_error as mse, r2_score
Code: Python setup
from sklearn.metrics import cohen_kappa_score, classification_report
from sklearn.metrics import confusion_matrix, balanced_accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Python Settings -----------------------------------------
# Default plot options
"figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))
plt.rc(
# Pandas options
"display.max_rows", 1000)
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_colwidth", 40) # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.3f}")
pd.set_option("styler.format.thousands", ",")
pd.set_option(
# RNG state
100) np.random.seed(
Code: Ad-hoc Python functions
# Ad-hoc Python functions
def format_p0(p):
"""Format p values at 3 decimal places.
Args:
p (float): p value (number between 0 and 1).
"""
if p < 0.001:
return "<0.001"
elif p > 0.999:
return ">0.999"
else:
return f"{p:.3f}"
# Inferential statistics
def ci_proportion_multinomial(
counts,str = "goodman",
method: str = "n",
n_label: str = "percent",
percent_label: -> pd.DataFrame:
) """Calculate simultaneous confidence intervals for multinomial proportion.
More information in documentation of statsmodels'
multinomial_proportions_confint.
Args:
x (int): ps.Series, list or tuple with count data.
method (str, optional): Method. Defaults to "goodman".
n_label (str, optional): Name for column for counts.
percent_label (str, optional): Name for column for percentage values.
Returns:
pd.DataFrame: _description_
Examples:
>>> ci_proportion_multinomial([62, 33, 55])
"""
assert type(counts) in [pd.Series, list, tuple]
if type(counts) is not pd.Series:
= pd.Series(counts)
counts
return pd.concat(
[
(counts).rename(n_label),/ sum(counts)).rename(percent_label) * 100,
(counts
pd.DataFrame(=method),
sms.multinomial_proportions_confint(counts, method=counts.index,
index=["ci_lower", "ci_upper"],
columns
)* 100,
],=1,
axis
)
def test_chi_square_gof(f_obs, f_exp="all equal") -> str:
"""Chi squared (χ²) goodness-of-fit (gof) test
Args:
f_obs (list[int]): Observed frequencies
f_exp str, list[int]: List of expected frequencies or "all equal" if
all frequencies are equal to the mean of observed frequencies.
Defaults to "all equal".
Returns:
str: test p value.
"""
= len(f_obs)
k = sum(f_obs)
n = n / k
exp = k - 1
dof if f_exp == "all equal":
= [exp for _ in range(k)]
f_exp = sps.chisquare(f_obs=f_obs, f_exp=f_exp)
stat, p # May also be formatted this way:
return (
"Chi square test, "
f"χ²({dof}, n = {n}) = {round(stat, 2)}, p{format_p0(p)}"
)
def pairwise_correlation(data, method="pearson"):
"""Correlation analysis (including p value calculation)
Args:
data (pandas.DataFrame): dataset with numeric variables
method (str): correlation method, e.g., "pearson" or "spearman".
Return:
pandas.DataFrame with the resilts of correlation analysis.
"""
return (
=method, padjust="holm")
pg.pairwise_corr(data, method="r", key=abs, ascending=False)
.sort_values(by"p-corr", format_p0, "p_adj")
.transform_column("X", "Y", "r", "CI95%", "p_adj"])
.select_columns([
)
def to_3_quality_groups(score, numeric=False):
"""
Wine quality score will be recoded as 3 quality groups:
- score ≤ 4 = low,
- score [5-6] = medium,
- score ≥ 7 = high.
args:
score (int, float): numeric value from 0 to 10
numeric (bool):
If True, this output dedicated for multinomial logistic regression:
low = 0, medium = 2 (reference group), high = 1.
"""
assert 0 <= score <= 10
= int(score)
score
if numeric:
if score <= 4:
return 0
elif 5 <= score <= 6:
# Used as a reference group
return 2
elif 7 <= score:
return 1
else:
if score <= 4:
return "low"
elif 5 <= score <= 6:
return "medium"
elif 7 <= score:
return "high"
# Functions for regression/classification
def do_sfs_lin_reg(formula, data, forward=True):
"""Do sequential feature selection for Linear Regression
Args.:
formula (str): R style formula
data (pandas.DataFrame)
forward (bool): True – forward selection
False – backward selection
"""
= LinearRegression(fit_intercept=True)
lin_regression
= SequentialFeatureSelector(
sfs
lin_regression,="parsimonious",
k_features=forward,
forward=True,
floating="r2",
scoring=0,
verbose=5,
cv
)
= patsy.dmatrices(formula + "+ 0", data, return_type="dataframe")
y, X
return sfs.fit(X, y)
def get_sfs_performance_lin_reg(sfs_object, n_features):
"""Return performance measures with certain number of features.
Args.:
sfs_object: result of function do_sfs_lin_reg()
n_features (int): number of features.
"""
= round(
md "cv_scores"]), 3
np.median(sfs_object.get_metric_dict()[n_features][
)return {
"n_features": n_features,
"mean R²": round(
"avg_score"], 3
sfs_object.get_metric_dict()[n_features][
),"median R²": md,
"sd R²": round(sfs_object.get_metric_dict()[n_features]["std_dev"], 3),
}
def show_sfs_results_lin_reg(sfs_object, sub_title=""):
"""Show results of do_sfs_lin_reg()
Args.:
sfs_object: result of function do_sfs_lin_reg()
sub_title (str): second line of title.
"""
if sfs_object.forward:
= "Forward"
sfs_type else:
= "Backward"
sfs_type
plt.clf()= plt.subplots(1, 2)
fig, ax
= [(i, c["avg_score"]) for i, c in sfs_object.subsets_.items()]
avg_score
(
pd.DataFrame(=["n_features", "avg_score"]
avg_score, columns
).plot.scatter(="n_features",
x="avg_score",
y="Number of features included",
xlabel="Average R²",
ylabel=(0.1, 0.8),
ylim=ax[0],
ax
)
)
= {i: c["cv_scores"] for i, c in sfs_object.subsets_.items()}
cv_scores
(
pd.DataFrame(cv_scores).plot.box(="Number of features included",
xlabel="R²",
ylabel=(0.1, 0.8),
ylim=ax[1],
ax
)
)
if not sfs_object.forward:
1].invert_xaxis()
ax[
= (
main_title f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV "
+ f"\n{sub_title}"
)
fig.suptitle(main_title)
plt.tight_layout()
plt.show()
# Print results
= f"n = {len(sfs_object.k_feature_names_)}"
n_selected = f"avg. R² = {sfs_object.k_score_:.3f}"
r_sq_selected
print(main_title)
print("\nSelected variables:")
print(f"[ {n_selected}, {r_sq_selected} ]\n")
for i, name in enumerate(sfs_object.k_feature_names_, start=1):
print(f"{i}. {name}")
def get_regression_performance(y_true, y_pred, name=""):
"""Evaluate regression model performance
Calculate R², RMSE, and SD of predicted variable
Args.:
y_true, y_pred: true and predicted numeric values.
name (str): the name of investigated set.
"""
return (
pd.DataFrame(
{"set": name,
"n": len(y_true),
"SD": [float(np.std(y_true))],
"RMSE": [float(np.sqrt(mse(y_true, y_pred)))],
"R²": [r2_score(y_true, y_pred)],
}
)eval("RMSE_SD_ratio = RMSE/SD")
.eval("SD_RMSE_ratio = SD/RMSE")
.
)
def predict_class_mnlogit(model, data):
"""Predict classes from multinomial logit (mnlogit) model.
Args.:
model: trained mnlogit model.
data: pandas.DataFrame
Return: numeric class codes.
"""
return np.asarray(model.predict(data)).argmax(1)
def print_classification_report(true_class, predicted_class):
"""Print summary of classification performance
Args.:
true_class, predicted_class: data sequences of the same length:
with class names/indicators.
"""
= cohen_kappa_score(true_class, predicted_class)
kappa print(f"kappa = {kappa:.3f}")
print(classification_report(true_class, predicted_class))
print("Confusion matrix (rows - true, columns - predicted):")
print(confusion_matrix(true_class, predicted_class))
def do_mnlogit(formula, data):
"""Perform multivatiate logistic regression and print the results.
formula (str): model formula
data (pandas.DataFrame): dataset
"""
= formula.split("~")[0].strip()
response_var
# Do multinomial logistic regression
= smf.mnlogit(formula, data).fit()
mnlogit_model
# Evaluate performance
print(mnlogit_model.summary2())
# Classification report
= predict_class_mnlogit(mnlogit_model, data)
gr_pred = data[response_var]
gr_true
print_classification_report(gr_true, gr_pred)
# Output
return mnlogit_model
Code: Python functions by other authors
# Class for linear regression diagnostics with R-style diagnostic plots
# NOTE: This class for linear regression assumption checking should be in
# a separate file, but RStudio failed to load custom Python modules.
# Source:
# https://www.statsmodels.org/dev/examples/notebooks/generated/linear_regression_diagnostics_plots.html
# Base code
import numpy as np
import seaborn as sns
import statsmodels
from statsmodels.tools.tools import maybe_unwrap_results
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from typing import Type
= "seaborn-talk" # refer to plt.style.available
style_talk
class Linear_Reg_Diagnostic:
"""
Diagnostic plots to identify potential problems in a linear regression fit.
Mainly,
a. non-linearity of data
b. Correlation of error terms
c. non-constant variance
d. outliers
e. high-leverage points
f. collinearity
Author:
Prajwal Kafle (p33ajkafle@gmail.com, where 3 = r)
Does not come with any sort of warranty.
Please test the code one your end before using.
"""
def __init__(
self,
results: Type[
statsmodels.regression.linear_model.RegressionResultsWrapper
],-> None:
) """
For a linear regression model, generates following diagnostic plots:
a. residual
b. qq
c. scale location and
d. leverage
and a table
e. vif
Args:
results (Type[statsmodels.regression.linear_model.RegressionResultsWrapper]):
must be instance of statsmodels.regression.linear_model object
Raises:
TypeError: if instance does not belong to above object
Example:
>>> import numpy as np
>>> import pandas as pd
>>> import statsmodels.formula.api as smf
>>> x = np.linspace(-np.pi, np.pi, 100)
>>> y = 3*x + 8 + np.random.normal(0,1, 100)
>>> df = pd.DataFrame({'x':x, 'y':y})
>>> res = smf.ols(formula= "y ~ x", data=df).fit()
>>> cls = Linear_Reg_Diagnostic(res)
>>> cls(plot_context="seaborn-paper")
In case you do not need all plots you can also independently
make an individual plot/table
in following ways
>>> cls = Linear_Reg_Diagnostic(res)
>>> cls.residual_plot()
>>> cls.qq_plot()
>>> cls.scale_location_plot()
>>> cls.leverage_plot()
>>> cls.vif_table()
"""
if (
isinstance(
results,
statsmodels.regression.linear_model.RegressionResultsWrapper,
)is False
):raise TypeError(
"result must be instance of "
"statsmodels.regression.linear_model.RegressionResultsWrapper "
"object"
)
self.results = maybe_unwrap_results(results)
self.y_true = self.results.model.endog
self.y_predict = self.results.fittedvalues
self.xvar = self.results.model.exog
self.xvar_names = self.results.model.exog_names
self.residual = np.array(self.results.resid)
= self.results.get_influence()
influence self.residual_norm = influence.resid_studentized_internal
self.leverage = influence.hat_matrix_diag
self.cooks_distance = influence.cooks_distance[0]
self.nparams = len(self.results.params)
def __call__(self, plot_context="seaborn-paper"):
# print(plt.style.available)
with plt.style.context(plot_context):
= plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
fig, ax self.residual_plot(ax=ax[0, 0])
self.qq_plot(ax=ax[0, 1])
self.scale_location_plot(ax=ax[1, 0])
self.leverage_plot(ax=ax[1, 1])
plt.show()
# self.vif_table()
return fig, ax
def residual_plot(self, ax=None):
"""
Residual vs Fitted Plot
Graphical tool to identify non-linearity.
(Roughly) Horizontal red line is an indicator that the residual has
a linear pattern
"""
if ax is None:
= plt.subplots()
fig, ax
sns.residplot(=self.y_predict,
x=self.residual,
y=True,
lowess={"alpha": 0.5},
scatter_kws={"color": "red", "lw": 1, "alpha": 0.8},
line_kws=ax,
ax
)
# annotations
= np.abs(self.residual)
residual_abs = np.flip(np.sort(residual_abs))
abs_resid = abs_resid[:3]
abs_resid_top_3 for i, _ in enumerate(abs_resid_top_3):
=(self.y_predict[i], self.residual[i]), color="C3")
ax.annotate(i, xy
"Residuals vs Fitted", fontweight="bold")
ax.set_title("Fitted values")
ax.set_xlabel("Residuals")
ax.set_ylabel(return ax
def qq_plot(self, ax=None):
"""
Standarized Residual vs Theoretical Quantile plot
Used to visually check if residuals are normally distributed.
Points spread along the diagonal line will suggest so.
"""
if ax is None:
= plt.subplots()
fig, ax
= ProbPlot(self.residual_norm)
QQ ="45", alpha=0.5, lw=1, ax=ax)
QQ.qqplot(line
# annotations
= np.flip(np.argsort(np.abs(self.residual_norm)), 0)
abs_norm_resid = abs_norm_resid[:3]
abs_norm_resid_top_3 for r, i in enumerate(abs_norm_resid_top_3):
ax.annotate(
i,=(
xy0)[r],
np.flip(QQ.theoretical_quantiles, self.residual_norm[i],
),="right",
ha="C3",
color
)
"Normal Q-Q", fontweight="bold")
ax.set_title("Theoretical Quantiles")
ax.set_xlabel("Standardized Residuals")
ax.set_ylabel(return ax
def scale_location_plot(self, ax=None):
"""
Sqrt(Standarized Residual) vs Fitted values plot
Used to check homoscedasticity of the residuals.
Horizontal line will suggest so.
"""
if ax is None:
= plt.subplots()
fig, ax
= np.sqrt(np.abs(self.residual_norm))
residual_norm_abs_sqrt
self.y_predict, residual_norm_abs_sqrt, alpha=0.5)
ax.scatter(
sns.regplot(=self.y_predict,
x=residual_norm_abs_sqrt,
y=False,
scatter=False,
ci=True,
lowess={"color": "red", "lw": 1, "alpha": 0.8},
line_kws=ax,
ax
)
# annotations
= np.flip(np.argsort(residual_norm_abs_sqrt), 0)
abs_sq_norm_resid = abs_sq_norm_resid[:3]
abs_sq_norm_resid_top_3 for i in abs_sq_norm_resid_top_3:
ax.annotate(=(self.y_predict[i], residual_norm_abs_sqrt[i]), color="C3"
i, xy
)"Scale-Location", fontweight="bold")
ax.set_title("Fitted values")
ax.set_xlabel(r"$\sqrt{|\mathrm{Standardized\ Residuals}|}$")
ax.set_ylabel(return ax
def leverage_plot(self, ax=None):
"""
Residual vs Leverage plot
Points falling outside Cook's distance curves are considered
observation that can sway the fit aka are influential.
Good to have none outside the curves.
"""
if ax is None:
= plt.subplots()
fig, ax
self.leverage, self.residual_norm, alpha=0.5)
ax.scatter(
sns.regplot(=self.leverage,
x=self.residual_norm,
y=False,
scatter=False,
ci=True,
lowess={"color": "red", "lw": 1, "alpha": 0.8},
line_kws=ax,
ax
)
# annotations
= np.flip(np.argsort(self.cooks_distance), 0)[:3]
leverage_top_3 for i in leverage_top_3:
ax.annotate(=(self.leverage[i], self.residual_norm[i]), color="C3"
i, xy
)
= self.__cooks_dist_line(0.5) # 0.5 line
xtemp, ytemp
ax.plot(="Cook's distance", lw=1, ls="--", color="red"
xtemp, ytemp, label
)= self.__cooks_dist_line(1) # 1 line
xtemp, ytemp =1, ls="--", color="red")
ax.plot(xtemp, ytemp, lw
0, max(self.leverage) + 0.01)
ax.set_xlim("Residuals vs Leverage", fontweight="bold")
ax.set_title("Leverage")
ax.set_xlabel("Standardized Residuals")
ax.set_ylabel(="upper right")
ax.legend(locreturn ax
def vif_table(self):
"""
VIF table
VIF, the variance inflation factor, is a measure of multicollinearity.
VIF > 5 for a variable indicates that it is highly collinear with the
other input variables.
"""
= pd.DataFrame()
vif_df "Features"] = self.xvar_names
vif_df["VIF Factor"] = [
vif_df[self.xvar, i)
variance_inflation_factor(for i in range(self.xvar.shape[1])
]
return vif_df.sort_values("VIF Factor").round(2)
def __cooks_dist_line(self, factor):
"""
Helper function for plotting Cook's distance curves
"""
= self.nparams
p = lambda x: np.sqrt((factor * p * (1 - x)) / x)
formula = np.linspace(0.001, max(self.leverage), 50)
x = formula(x)
y return x, y
3.2 Import Data
Data was provided as a text file.
Preview of data file
A few lines from the text file with data:
R code
read_lines("data/winequality-red.csv", n_max = 6) |> writeLines()
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
Code
= pd.read_csv("data/winequality-red.csv", delimiter=";") wine_raw
The dataset contains 1599 rows and 12 columns. All variables are numeric. No missing values were found.
Details: inspect wine_raw
dataset
The properties of the imported dataset:
Code
wine_raw.shape
(1599, 12)
R code
head(py$wine_raw)
R code
glimpse(py$wine_raw)
Rows: 1,599
Columns: 12
$ `fixed acidity` <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7…
$ `volatile acidity` <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600…
$ `citric acid` <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00,…
$ `residual sugar` <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.…
$ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069…
$ `free sulfur dioxide` <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, …
$ `total sulfur dioxide` <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 10…
$ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978,…
$ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39,…
$ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47,…
$ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 1…
$ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5,…
Number of missing values:
Code
sum() wine_raw.isna().
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
Code
wine_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
3.3 Recoding and Renaming Variables
Variable names will be changed into lower snake case (e.g., to make them easier to be used in statistical model formulas). Wine quality score will be merged into quality groups as follows:
- ≤ 4 – low,
- [5-6] – medium,
- ≥ 7 – high.
Code
= wine_raw.clean_names()
wine
"quality_gr"] = pd.Categorical(
wine["quality"].apply(to_3_quality_groups),
wine[=["low", "medium", "high"],
categories=True,
ordered
)
del wine_raw
How do quality scores and quality groups match?
R code
$wine |>
pywith(table(quality_gr, quality)) |>
addmargins() |>
::pander() pander
3 | 4 | 5 | 6 | 7 | 8 | Sum | |
---|---|---|---|---|---|---|---|
low | 10 | 53 | 0 | 0 | 0 | 0 | 63 |
medium | 0 | 0 | 681 | 638 | 0 | 0 | 1319 |
high | 0 | 0 | 0 | 0 | 199 | 18 | 217 |
Sum | 10 | 53 | 681 | 638 | 199 | 18 | 1599 |
Variable recoding was performed without errors as numbers add up correctly.
Details: inspect wine
dataset
Column names:
R code
colnames(py$wine)
[1] "fixed_acidity" "volatile_acidity" "citric_acid"
[4] "residual_sugar" "chlorides" "free_sulfur_dioxide"
[7] "total_sulfur_dioxide" "density" "ph"
[10] "sulphates" "alcohol" "quality"
[13] "quality_gr"
Dataset:
R code
head(py$wine)
R code
glimpse(py$wine)
Rows: 1,599
Columns: 13
$ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
$ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
$ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
$ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
$ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
$ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
$ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
$ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
$ ph <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
$ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
$ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
$ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
$ quality_gr <ord> medium, medium, medium, medium, medium, medium, m…
3.4 Split into Training/Test Sets
In this project, for more accurate models’ prediction power, a 80:20 train/test hold-out split with stratification by wine quality scores will be used.
Code
= train_test_split(
wine_train, wine_test =0.2, random_state=50, stratify=wine["quality"]
wine, test_size
)
del wine
The sizes of training and test sets after splitting are 1279:320 wine samples. If needed, find more details below.
Details: inspect data after splitting to train/test sets
The dimensions of training set:
Code
wine_train.shape
(1279, 13)
The dimensions of test set:
Code
wine_test.shape
(320, 13)
Is the split really stratified? Yes as sample size ratios (column ratio
) in all strata are around 0.80:
R code
full_join(
$wine_train |> count(quality, name = "n_train"),
py$wine_test |> count(quality, name = "n_test"),
pyby = "quality"
|>
) mutate(ratio = round(n_train / (n_train + n_test), digits = 2))
Dataset (train only):
R code
head(py$wine_train)
R code
glimpse(py$wine_train)
Rows: 1,279
Columns: 13
$ fixed_acidity <dbl> 6.7, 8.6, 10.9, 7.7, 7.4, 10.3, 10.0, 9.0, 9.0, 8…
$ volatile_acidity <dbl> 0.480, 0.490, 0.530, 1.005, 0.630, 0.500, 0.490, …
$ citric_acid <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.25, 0…
$ residual_sugar <dbl> 2.1, 2.0, 4.6, 2.1, 2.4, 2.0, 11.0, 2.0, 2.8, 2.0…
$ chlorides <dbl> 0.064, 0.110, 0.118, 0.102, 0.090, 0.069, 0.071, …
$ free_sulfur_dioxide <dbl> 18, 19, 10, 11, 11, 21, 13, 8, 28, 11, 9, 40, 32,…
$ total_sulfur_dioxide <dbl> 34, 133, 17, 32, 37, 51, 50, 21, 104, 27, 22, 83,…
$ density <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, 0.99…
$ ph <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.27, 3…
$ sulphates <dbl> 0.64, 1.98, 0.56, 0.48, 0.76, 0.72, 0.69, 0.72, 0…
$ alcohol <dbl> 9.70000, 9.80000, 11.70000, 10.00000, 9.70000, 11…
$ quality <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, 4, 6…
$ quality_gr <ord> medium, medium, medium, medium, medium, medium, m…
To avoid additional biases, test set will not be investigated until the final prediction models are created. Now let’s focus on training set only.
From here, further exploratory and inferential analysis as well as statistical modelling will use training set only.
3.5 Further Inspection of Training Set
Graphical and numeric inspection of each variable (find the details below) revealed that some variables are either moderately (skewness > 0.5) or highly (skewness > 1) right-skewed. As it is planned to use linear models in this project, skewed data may lead to non-linearities. This issue will be accounted for in the next subsection.
Details: plots and summaries of each variable on original scale
R code
::Desc(
DescTools$wine_train |> mutate(quality = ordered(quality)),
pyverbose = 3
)
------------------------------------------------------------------------------
Describe mutate(py$wine_train, quality = ordered(quality)) (data.frame):
data frame: 1279 obs. of 13 variables
1279 complete cases (100.0%)
Nr ColName Class NAs Levels
1 fixed_acidity numeric .
2 volatile_acidity numeric .
3 citric_acid numeric .
4 residual_sugar numeric .
5 chlorides numeric .
6 free_sulfur_dioxide numeric .
7 total_sulfur_dioxide numeric .
8 density numeric .
9 ph numeric .
10 sulphates numeric .
11 alcohol numeric .
12 quality ordered, factor . (6): 1-3, 2-4, 3-5, 4-6,
5-7, ...
13 quality_gr ordered, factor . (3): 1-low, 2-medium,
3-high
------------------------------------------------------------------------------
1 - fixed_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 94 0 8.30 8.21
100.0% 0.0% 0.0% 8.40
.05 .10 .25 median .75 .90 .95
6.10 6.50 7.10 7.90 9.20 10.60 11.71
range sd vcoef mad IQR skew kurt
11.30 1.75 0.21 1.48 2.10 1.02 1.26
lowest : 4.6, 4.9, 5.0 (4), 5.1 (3), 5.2 (6)
highest: 14.3, 15.0 (2), 15.5, 15.6 (2), 15.9
' 95%-CI (classic)
------------------------------------------------------------------------------
2 - volatile_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 0.53 0.52
100.0% 0.0% 0.0% 0.54
.05 .10 .25 median .75 .90 .95
0.27 0.31 0.40 0.52 0.64 0.76 0.85
range sd vcoef mad IQR skew kurt
1.46 0.18 0.34 0.18 0.24 0.71 1.32
lowest : 0.12 (3), 0.16 (2), 0.18 (8), 0.19, 0.2 (2)
highest: 1.18, 1.18, 1.24, 1.33 (2), 1.58
' 95%-CI (classic)
------------------------------------------------------------------------------
3 - citric_acid (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 78 110 0.27 0.26
100.0% 0.0% 8.6% 0.28
.05 .10 .25 median .75 .90 .95
0.00 0.01 0.09 0.25 0.42 0.53 0.60
range sd vcoef mad IQR skew kurt
0.79 0.19 0.73 0.25 0.33 0.32 -0.88
lowest : 0.0 (110), 0.01 (27), 0.02 (40), 0.03 (21), 0.04 (25)
highest: 0.74 (4), 0.75, 0.76 (2), 0.78, 0.79
heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)
' 95%-CI (classic)
------------------------------------------------------------------------------
4 - residual_sugar (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 85 0 2.55 2.47
100.0% 0.0% 0.0% 2.63
.05 .10 .25 median .75 .90 .95
1.50 1.70 1.90 2.20 2.60 3.61 5.20
range sd vcoef mad IQR skew kurt
14.50 1.44 0.57 0.44 0.70 4.42 26.74
lowest : 0.9, 1.2 (7), 1.3 (5), 1.4 (33), 1.5 (22)
highest: 12.9, 13.4, 13.8 (2), 13.9, 15.4 (2)
heap(?): remarkable frequency (9.9%) for the mode(s) (= 2)
' 95%-CI (classic)
------------------------------------------------------------------------------
5 - chlorides (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 0.0867 0.0843
100.0% 0.0% 0.0% 0.0891
.05 .10 .25 median .75 .90 .95
0.0540 0.0600 0.0700 0.0790 0.0900 0.1062 0.1242
range sd vcoef mad IQR skew kurt
0.4520 0.0437 0.5045 0.0148 0.0200 5.0275 31.2563
lowest : 0.012 (2), 0.034, 0.038 (2), 0.039 (4), 0.041 (4)
highest: 0.413, 0.414 (2), 0.415 (2), 0.422, 0.464
' 95%-CI (classic)
------------------------------------------------------------------------------
6 - free_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 58 0 15.70 15.12
100.0% 0.0% 0.0% 16.29
.05 .10 .25 median .75 .90 .95
4.00 5.00 7.00 13.00 21.00 30.00 35.00
range sd vcoef mad IQR skew kurt
71.00 10.60 0.68 10.38 14.00 1.31 2.24
lowest : 1.0 (2), 3.0 (44), 4.0 (34), 5.0 (86), 5.5
highest: 54.0, 55.0 (2), 66.0, 68.0 (2), 72.0
heap(?): remarkable frequency (9.1%) for the mode(s) (= 6)
' 95%-CI (classic)
------------------------------------------------------------------------------
7 - total_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 140 0 45.67 43.85
100.0% 0.0% 0.0% 47.49
.05 .10 .25 median .75 .90 .95
11.00 14.00 21.00 37.00 61.00 94.00 112.00
range sd vcoef mad IQR skew kurt
283.00 33.16 0.73 26.69 40.00 1.61 4.46
lowest : 6.0 (3), 7.0 (4), 8.0 (10), 9.0 (13), 10.0 (24)
highest: 155.0, 160.0, 165.0, 278.0, 289.0
' 95%-CI (classic)
------------------------------------------------------------------------------
8 - density (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 400 0 0.996736 0.996633
100.0% 0.0% 0.0% 0.996839
.05 .10 .25 median .75 .90 .95
0.993598 0.994480 0.995600 0.996720 0.997825 0.999108 1.000000
range sd vcoef mad IQR skew kurt
0.013620 0.001883 0.001889 0.001661 0.002225 0.127046 0.867452
lowest : 0.99007, 0.99064 (2), 0.99084, 0.9912, 0.9915
highest: 1.0026, 1.00289, 1.00315 (2), 1.0032, 1.00369 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
9 - ph (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 86 0 3.31 3.30
100.0% 0.0% 0.0% 3.32
.05 .10 .25 median .75 .90 .95
3.07 3.13 3.21 3.31 3.40 3.51 3.57
range sd vcoef mad IQR skew kurt
1.15 0.15 0.05 0.15 0.19 0.25 0.88
lowest : 2.86, 2.87, 2.88 (2), 2.89 (2), 2.9
highest: 3.72 (3), 3.75, 3.78 (2), 3.9 (2), 4.01 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
10 - sulphates (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 88 0 0.653 0.644
100.0% 0.0% 0.0% 0.662
.05 .10 .25 median .75 .90 .95
0.470 0.490 0.550 0.620 0.730 0.840 0.921
range sd vcoef mad IQR skew kurt
1.650 0.163 0.250 0.119 0.180 2.263 10.489
lowest : 0.33, 0.37, 0.39 (5), 0.4 (4), 0.42 (4)
highest: 1.59, 1.61, 1.62, 1.95, 1.98
' 95%-CI (classic)
------------------------------------------------------------------------------
11 - alcohol (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 61 0 10.42 10.36
100.0% 0.0% 0.0% 10.48
.05 .10 .25 median .75 .90 .95
9.20 9.30 9.50 10.20 11.10 12.00 12.50
range sd vcoef mad IQR skew kurt
6.50 1.06 0.10 1.04 1.60 0.87 0.23
lowest : 8.4 (2), 8.5, 8.7, 8.8 (2), 9.0 (22)
highest: 13.4 (2), 13.5, 13.6, 14.0 (6), 14.9
heap(?): remarkable frequency (8.8%) for the mode(s) (= 9.5)
' 95%-CI (classic)
------------------------------------------------------------------------------
12 - quality (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 6 6 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 3 8 0.6% 8 0.6%
2 4 42 3.3% 50 3.9%
3 5 545 42.6% 595 46.5%
4 6 510 39.9% 1'105 86.4%
5 7 159 12.4% 1'264 98.8%
6 8 15 1.2% 1'279 100.0%
------------------------------------------------------------------------------
13 - quality_gr (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 3 3 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 low 50 3.9% 50 3.9%
2 medium 1'055 82.5% 1'105 86.4%
3 high 174 13.6% 1'279 100.0%
3.6 Log-Transformation of Right-Skewed Variables
The amount of skewness in variables was investigated before (columns skewness_original
in the table below) and after (skewness_after_log
) log-transformation, which was applied for variables with skewness above 1.
Code
= (
skewness =True)
wine_train.skew(numeric_only=False)
.sort_values(ascending"skewness_original")
.rename(
)
= skewness.where(skewness > 1).dropna().index.tolist()
skewed_vars
= (
skewness_after
wine_train.transform_columns(skewed_vars, np.log10)=True)
.skew(numeric_only"skewness_after_log")
.rename(
)
= (skewness_after - skewness).rename("difference")
skew_diff
=1).reset_index(
pd.concat([skewness, skewness_after, skew_diff], axis="variable"
names )
variable skewness_original skewness_after_log difference
0 chlorides 5.039 1.530 -3.509
1 residual_sugar 4.435 1.803 -2.632
2 sulphates 2.268 0.852 -1.416
3 total_sulfur_dioxide 1.616 -0.033 -1.649
4 free_sulfur_dioxide 1.313 -0.172 -1.485
5 fixed_acidity 1.025 0.426 -0.599
6 alcohol 0.870 0.870 0.000
7 volatile_acidity 0.710 0.710 0.000
8 citric_acid 0.323 0.323 0.000
9 ph 0.254 0.254 0.000
10 quality 0.226 0.226 0.000
11 density 0.127 0.127 0.000
Guidelines to evaluate degree of skewness (coefficient of asymmetry)
I use these guidelines to interpret skewness coefficient.
- Interpret the direction of skewness:
- \(< 0\): left-skewed data;
- \(0\): ideally symmetric data;
- \(0 <\): right-skewed data.
- Interpret the amount of skewness (rule of thumb):
- \(0\): ideally symmetric;
- \(-0.5\) – \(+0.5\): almost symmetric;
- \(-1\) – \(-0.5\) and \(+0.5\) – \(+1\): moderate asymmetry;
- \(< -1\) and \(+1 < ~\): high asymmetry.
In all cases, amount of skewness was reduced.
Apply the transformation to the dataset:
Code
# Create a copy of data frame that contains log-transformed variables
= {i: "log_" + str(i) for i in skewed_vars}
new_names_map
= wine_train.transform_columns(skewed_vars, np.log10).rename(
wine_train_log =1
new_names_map, axis )
Details: inspect data after log-transformation
Column names:
R code
colnames(py$wine_train_log)
[1] "log_fixed_acidity" "volatile_acidity"
[3] "citric_acid" "log_residual_sugar"
[5] "log_chlorides" "log_free_sulfur_dioxide"
[7] "log_total_sulfur_dioxide" "density"
[9] "ph" "log_sulphates"
[11] "alcohol" "quality"
[13] "quality_gr"
Dataset:
R code
head(py$wine_train_log)
R code
glimpse(py$wine_train_log)
Rows: 1,279
Columns: 13
$ log_fixed_acidity <dbl> 0.8260748, 0.9344985, 1.0374265, 0.8864907, 0…
$ volatile_acidity <dbl> 0.480, 0.490, 0.530, 1.005, 0.630, 0.500, 0.4…
$ citric_acid <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.2…
$ log_residual_sugar <dbl> 0.3222193, 0.3010300, 0.6627578, 0.3222193, 0…
$ log_chlorides <dbl> -1.1938200, -0.9586073, -0.9281180, -0.991399…
$ log_free_sulfur_dioxide <dbl> 1.2552725, 1.2787536, 1.0000000, 1.0413927, 1…
$ log_total_sulfur_dioxide <dbl> 1.531479, 2.123852, 1.230449, 1.505150, 1.568…
$ density <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, …
$ ph <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.2…
$ log_sulphates <dbl> -0.19382003, 0.29666519, -0.25181197, -0.3187…
$ alcohol <dbl> 9.70000, 9.80000, 11.70000, 10.00000, 9.70000…
$ quality <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, …
$ quality_gr <ord> medium, medium, medium, medium, medium, mediu…
Details: plots and summaries of variables after log-transformation
R code
::Desc(
DescTools$wine_train_log |> mutate(quality = ordered(quality)),
pyverbose = 3, digits = 3
)
------------------------------------------------------------------------------
Describe mutate(py$wine_train_log, quality = ordered(quality)) (data.frame):
data frame: 1279 obs. of 13 variables
1279 complete cases (100.0%)
Nr ColName Class NAs Levels
1 log_fixed_acidity numeric .
2 volatile_acidity numeric .
3 citric_acid numeric .
4 log_residual_sugar numeric .
5 log_chlorides numeric .
6 log_free_sulfur_dioxide numeric .
7 log_total_sulfur_dioxide numeric .
8 density numeric .
9 ph numeric .
10 log_sulphates numeric .
11 alcohol numeric .
12 quality ordered, factor . (6): 1-3, 2-4, 3-5,
4-6, 5-7, ...
13 quality_gr ordered, factor . (3): 1-low, 2-medium,
3-high
------------------------------------------------------------------------------
1 - log_fixed_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 94 0 0.910 0.906
100.0% 0.0% 0.0% 0.915
.05 .10 .25 median .75 .90 .95
0.785 0.813 0.851 0.898 0.964 1.025 1.069
range sd vcoef mad IQR skew kurt
0.539 0.087 0.095 0.078 0.113 0.425 0.131
lowest : 0.663, 0.690, 0.699 (4), 0.708 (3), 0.716 (6)
highest: 1.155, 1.176 (2), 1.190, 1.193 (2), 1.201
' 95%-CI (classic)
------------------------------------------------------------------------------
2 - volatile_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 0.532 0.522
100.0% 0.0% 0.0% 0.542
.05 .10 .25 median .75 .90 .95
0.270 0.310 0.400 0.520 0.640 0.760 0.850
range sd vcoef mad IQR skew kurt
1.460 0.183 0.344 0.178 0.240 0.709 1.321
lowest : 0.120 (3), 0.160 (2), 0.180 (8), 0.190, 0.200 (2)
highest: 1.180, 1.185, 1.240, 1.330 (2), 1.580
' 95%-CI (classic)
------------------------------------------------------------------------------
3 - citric_acid (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 78 110 0.268 0.257
100.0% 0.0% 8.6% 0.278
.05 .10 .25 median .75 .90 .95
0.000 0.010 0.090 0.250 0.420 0.530 0.600
range sd vcoef mad IQR skew kurt
0.790 0.195 0.728 0.252 0.330 0.322 -0.882
lowest : 0.000 (110), 0.010 (27), 0.020 (40), 0.030 (21), 0.040 (25)
highest: 0.740 (4), 0.750, 0.760 (2), 0.780, 0.790
heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)
' 95%-CI (classic)
------------------------------------------------------------------------------
4 - log_residual_sugar (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 85 0 0.369 0.361
100.0% 0.0% 0.0% 0.378
.05 .10 .25 median .75 .90 .95
0.176 0.230 0.279 0.342 0.415 0.558 0.716
range sd vcoef mad IQR skew kurt
1.233 0.158 0.428 0.094 0.136 1.799 4.740
lowest : -0.046, 0.079 (7), 0.114 (5), 0.146 (33), 0.176 (22)
highest: 1.111, 1.127, 1.140 (2), 1.143, 1.188 (2)
heap(?): remarkable frequency (9.9%) for the mode(s) (= 0.301029995663981)
' 95%-CI (classic)
------------------------------------------------------------------------------
5 - log_chlorides (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 -1.090 -1.098
100.0% 0.0% 0.0% -1.083
.05 .10 .25 median .75 .90 .95
-1.268 -1.222 -1.155 -1.102 -1.046 -0.974 -0.906
range sd vcoef mad IQR skew kurt
1.587 0.141 -0.129 0.078 0.109 1.527 8.730
lowest : -1.921 (2), -1.469, -1.420 (2), -1.409 (4), -1.387 (4)
highest: -0.384, -0.383 (2), -0.382 (2), -0.375, -0.333
' 95%-CI (classic)
------------------------------------------------------------------------------
6 - log_free_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 58 2 1.098 1.082
100.0% 0.0% 0.2% 1.115
.05 .10 .25 median .75 .90 .95
0.602 0.699 0.845 1.114 1.322 1.477 1.544
range sd vcoef mad IQR skew kurt
1.857 0.301 0.274 0.339 0.477 -0.172 -0.566
lowest : 0.000 (2), 0.477 (44), 0.602 (34), 0.699 (86), 0.740
highest: 1.732, 1.740 (2), 1.820, 1.833 (2), 1.857
heap(?): remarkable frequency (9.1%) for the mode(s) (= 0.778151250383644)
' 95%-CI (classic)
------------------------------------------------------------------------------
7 - log_total_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 140 0 1.553 1.536
100.0% 0.0% 0.0% 1.570
.05 .10 .25 median .75 .90 .95
1.041 1.146 1.322 1.568 1.785 1.973 2.049
range sd vcoef mad IQR skew kurt
1.683 0.310 0.199 0.335 0.463 -0.033 -0.693
lowest : 0.778 (3), 0.845 (4), 0.903 (10), 0.954 (13), 1.000 (24)
highest: 2.190, 2.204, 2.217, 2.444, 2.461
' 95%-CI (classic)
------------------------------------------------------------------------------
8 - density (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 400 0 0.997 0.997
100.0% 0.0% 0.0% 0.997
.05 .10 .25 median .75 .90 .95
0.994 0.994 0.996 0.997 0.998 0.999 1.000
range sd vcoef mad IQR skew kurt
0.014 0.002 0.002 0.002 0.002 0.127 0.867
lowest : 0.990, 0.991 (2), 0.991, 0.991, 0.992
highest: 1.003, 1.003, 1.003 (2), 1.003, 1.004 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
9 - ph (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 86 0 3.312 3.304
100.0% 0.0% 0.0% 3.321
.05 .10 .25 median .75 .90 .95
3.070 3.130 3.210 3.310 3.400 3.510 3.570
range sd vcoef mad IQR skew kurt
1.150 0.153 0.046 0.148 0.190 0.254 0.881
lowest : 2.860, 2.870, 2.880 (2), 2.890 (2), 2.900
highest: 3.720 (3), 3.750, 3.780 (2), 3.900 (2), 4.010 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
10 - log_sulphates (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 88 1 -0.196 -0.201
100.0% 0.0% 0.1% -0.191
.05 .10 .25 median .75 .90 .95
-0.328 -0.310 -0.260 -0.208 -0.137 -0.076 -0.036
range sd vcoef mad IQR skew kurt
0.778 0.096 -0.487 0.089 0.123 0.850 1.854
lowest : -0.481, -0.432, -0.409 (5), -0.398 (4), -0.377 (4)
highest: 0.201, 0.207, 0.210, 0.290, 0.297
' 95%-CI (classic)
------------------------------------------------------------------------------
11 - alcohol (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 61 0 10.423 10.365
100.0% 0.0% 0.0% 10.481
.05 .10 .25 median .75 .90 .95
9.200 9.300 9.500 10.200 11.100 12.000 12.500
range sd vcoef mad IQR skew kurt
6.500 1.060 0.102 1.038 1.600 0.868 0.228
lowest : 8.400 (2), 8.500, 8.700, 8.800 (2), 9.000 (22)
highest: 13.400 (2), 13.500, 13.600, 14.000 (6), 14.900
heap(?): remarkable frequency (8.8%) for the mode(s) (= 9.5)
' 95%-CI (classic)
------------------------------------------------------------------------------
12 - quality (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 6 6 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 3 8 0.625% 8 0.625%
2 4 42 3.284% 50 3.909%
3 5 545 42.611% 595 46.521%
4 6 510 39.875% 1'105 86.396%
5 7 159 12.432% 1'264 98.827%
6 8 15 1.173% 1'279 100.000%
------------------------------------------------------------------------------
13 - quality_gr (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 3 3 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 low 50 3.909% 50 3.909%
2 medium 1'055 82.486% 1'105 86.396%
3 high 174 13.604% 1'279 100.000%
4 Analysis
4.1 Correlation: All Pairs
In this section we will look into the relationships between each pair of numeric variables. We will start from graphical analysis (scatter plot matrices) and later will proceed to linear correlation analysis.
Scatter plot matrices of non-transformed and log-transformed variables are presented in Fig. 4.1 and Fig. 4.2. In the plot we see, that some variables are related (e.g., density and fixed acidity). It can also be noticed, that after log-transformation the relationship between some variables (e.g., pH and fixed acidity) look more linear. On the other hand, the plots are small and there are many points, so further investigation is needed to understand data better.
Code
= sns.pairplot(
fig_sns_1
wine_train,="hist",
kind=1,
height=True,
corner
)
for ax in fig_sns_1.axes.flatten():
if ax:
# rotate x axis labels
=90)
ax.set_xlabel(ax.get_xlabel(), rotation# rotate y axis labels
=0)
ax.set_ylabel(ax.get_ylabel(), rotation# set y labels alignment
"right")
ax.yaxis.get_label().set_horizontalalignment(
fig_sns_1.tight_layout() plt.show()
Code
= sns.pairplot(
fig_sns_2
wine_train_log,="hist",
kind=1,
height=True,
corner
)
for ax in fig_sns_2.axes.flatten():
if ax:
# rotate x axis labels
=90)
ax.set_xlabel(ax.get_xlabel(), rotation# rotate y axis labels
=0)
ax.set_ylabel(ax.get_ylabel(), rotation# set y labels alignment
"right")
ax.yaxis.get_label().set_horizontalalignment(
fig_sns_2.tight_layout() plt.show()
Next, the influence of logarithmic transformation on change in linear correlation coefficient is analyzed and summarized in the table below. The negative values or r_abs_difference
(difference in absolute values of correlation coefficient) means that correlation decreased, positive value shows increase in correlation strength.
Code
# Calculate correlation
= pairwise_correlation(wine_train)
corr_df_orig = pairwise_correlation(wine_train_log)
corr_df_log
# Investigate the changes in coefficients due to transformation
(
pd.concat(
["r_orig_var"),
corr_df_orig.r.rename("r_log_var"),
corr_df_log.r.rename(
],=1,
axis
)eval("r_abs_difference = abs(r_log_var) - abs(r_orig_var)")
."r_abs_difference")
.select_columns("min", "mean", "std", "max"])
.agg([ )
r_abs_difference
min -0.114
mean 0.014
std 0.041
max 0.146
The average shift in correlation coefficient was only slight, as in some coefficient started showing stronger, in other cases weaker correlation. The cases with largest positive and negative change (indicated by the the min and max values of r_abs_difference
) are analyzed in the details sections. To sum up, it seams that in both cases log-transformation the true correlation was reflected better by the coefficients after the transformation.
Details: case study of chlorides
vs. sulphates
After log-transformation, linear correlation strength between chlorides
and sulphates
decreased most evidently (change in absolute values is -0.11). It seems that the linear correlation coefficient size between these two variables is driven by 17 outlying points (where chlorides
> 0.3, red dashed line in the plot below). Logarithmic transformation reduces this issue and true correlation strength is reflected better in this case. On the other hand, rank correlation coefficient suggests that there is no correlation or it is very weak.
R code
ggplot(py$wine_train, aes(x = chlorides, y = sulphates)) +
geom_point(alpha = 0.1, pch = 19) +
geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
ggtitle("Variables on Original Scale") +
ggplot(py$wine_train, aes(x = log(chlorides), y = log(sulphates))) +
geom_point(alpha = 0.1, pch = 19) +
ggtitle("Variables on Log Scale")
R code
$wine_train |> count(chlorides > 0.3) py
R code
c(
"r_original_variables" = with(
$wine_train, cor(chlorides, sulphates, method = "pearson")
py
),"r_log_variables" = with(
$wine_train,
pycor(log(chlorides), log(sulphates), method = "pearson")
),"r_without_17_outliers" = with(
$wine_train |> filter(chlorides < 0.3),
pycor(chlorides, sulphates, method = "pearson")
),"r_spearman_all_data" = with(
$wine_train, cor(chlorides, sulphates, method = "spearman")
py
)|> round(2) )
r_original_variables r_log_variables r_without_17_outliers
0.31 0.20 0.05
r_spearman_all_data
-0.01
Details: case study of chlorides
vs. density
After log-transformation, linear correlation size between chlorides
and density
increased most evidently (change in absolute values is +0.15). It seems that the linear correlation coefficient size again was influenced by a group of 17 outlying points (where chlorides
> 0.3, red dashed line in the plot below) and logarithmic transformation reduced this issue and correlation of log-transformed variables and of non-transformed variables without the 17 points is almost of the same size (r = 0.36 vs. r = 0.34).
R code
ggplot(py$wine_train, aes(x = chlorides, y = density)) +
geom_point(alpha = 0.1, pch = 19) +
geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
ggtitle("Variables on Original Scale") +
ggplot(py$wine_train, aes(x = log(chlorides), y = density)) +
geom_point(alpha = 0.1, pch = 19) +
ggtitle("Log-Transformed Chloride")
R code
$wine_train |> count(chlorides > 0.3) py
R code
c(
"r_original_variables" = with(
$wine_train, cor(chlorides, density, method = "pearson")
py
),"r_log_variables" = with(
$wine_train,
pycor(log(chlorides), density, method = "pearson")
),"r_without_17_outliers" = with(
$wine_train |> filter(chlorides < 0.3),
pycor(chlorides, density, method = "pearson")
),"r_spearman_all_data" = with(
$wine_train, cor(chlorides, density, method = "spearman")
py
)|> round(2) )
r_original_variables r_log_variables r_without_17_outliers
0.21 0.36 0.34
r_spearman_all_data
0.41
Fig. 4.3 contains a graphical representation of all pair-wise correlation coefficients. The “details” sections below contain numeric results while the two tables below the section list correlations with quality sore and with alcohol contents.
R code
$wine_train_log |>
pyselect_if(is.numeric) |>
::ggcorrmat() +
ggstatsplotscale_y_discrete(limits = rev)
Details: correlation table (variables on original scale)
It can be suspected that linear correlation coefficients may not reflect the true strength of relationship between skewed variables correctly.
R code
$corr_df_orig |>
pyrowwise() |>
mutate(
`CI95%_lower` = `CI95%`[[1]],
`CI95%_upper` = `CI95%`[[2]],
r = round(r, digits = 2)
|>
) select(-`CI95%`) |>
ungroup() |>
rownames_to_column("index")
Details: correlation table (after log-transformation)
R code
$corr_df_log |>
pyrowwise() |>
mutate(
`CI95%_lower` = `CI95%`[[1]],
`CI95%_upper` = `CI95%`[[2]],
r = round(r, digits = 2)
|>
) select(-`CI95%`) |>
ungroup() |>
rownames_to_column("index")
Results. The strongest correlation is detected between log_free_sulfur_dioxide
vs. log_total_sulfur_dioxide
(r =, 0.79 [0.77, 0.81], p_adj < 0.001) and log_fixed_acidity
and ph
(r = -0.71, 95% CI [-0.74, -0.69], p_adj < 0.001). In statistical modeling these variables will be used as explanatory variables. This means that in linear modeling, a multicollinearity problem may appear. So we have to pay attention to this while interpreting regression coefficients results.
4.2 Exploratory PCA
4.2.1 Model Diagnostics
Principal component analysis (PCA) will be used to explore relationships between target (wine quality score, quality group as well as alcohol content) and other variables.
R code
<- prcomp(py$wine_train_log |> select_if(is.numeric), scale. = TRUE) pca_model
R code
fviz_eig(pca_model, addlabels = TRUE) + ylim(NA, 29)
Details: table with eigenvalues and percentage of explained variance
R code
get_eigenvalue(pca_model) |> round(1)
- To explain more than 80 % of variance, 5 components are required.
- “Elbow” method suggests 3-4 components.
- Eigenvalue above 1 rule suggest 4-5 components.
These numbers might be important while doing feature selection. For visual inspection, it will be used the first 3 principal components that account for 26.8 %, 20.1 %, and 14.6 % respectively.
4.2.2 PCA Individuals Map
In PCA “individuals map”, no very clear clusters are visible. Only it can be noticed that in the space of PC1 and PC2 (denoted as Dim1 and Dim2), roughly half area with points is more dense, and the other half is less dense.
R code
fviz_pca_ind(pca_model, label = FALSE, alpha = 0.2)
R code
fviz_pca_ind(pca_model, axes = c(3, 2), label = FALSE, alpha = 0.2)
After adding colors to the points, 3 partly overlapping clusters are visible. The direction of cluster centroids match the order of quality groups from lowest to highest. It seems that all 3 first PCs have influence on cluster separability.
R code
<- "Quality"
legend_txt
fviz_pca_ind(pca_model,
label = FALSE, alpha = 0.3,
habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
palette = "Dark2"
+
) labs(color = legend_txt, fill = legend_txt, shape = legend_txt)
R code
fviz_pca_ind(pca_model,
axes = c(3, 2), label = FALSE, alpha = 0.3,
habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
palette = "Dark2"
+
) labs(color = legend_txt, fill = legend_txt, shape = legend_txt)
4.2.3 PCA Correlation Circle Plot
The purpose of correlation circle plot is to investigate relationship between variables. The angle between arrows shows the strength of relationship in the analyzed PCA space, the length of arrow shows how well variable is represented in the PCA space.
The variables of interest are quality score and fraction of alcohol content. These variables will be highlighted.
R code
<- recode_factor(
highlight_vars names(py$wine_train_log |> select_if(is.numeric)),
"quality" = "quality",
"alcohol" = "alcohol",
.default = "other"
)
R code
fviz_pca_var(pca_model,
axes = c(1, 2), repel = TRUE, alpha = 0.7,
col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
+
) lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
theme(legend.position = "none")
R code
fviz_pca_var(pca_model,
axes = c(3, 2), repel = TRUE, alpha = 0.7,
col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
+
) lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
theme(legend.position = "none")
It seems that alcohol and quality score are related positively (arrows point to almost the same directions) while alcohol and density are negatively correlated (arrows point to opposite directions) in PC2 and PC3 space.
4.2.4 PCA Biplot
The purpose of PCA biplots is to investigate the relationship between patterns in points (e.g., clusters) and variables.
R code
fviz_pca_biplot(pca_model,
axes = c(1, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
col.var = "black",
label = "var",
addEllipses = TRUE,
habillage = py$wine_train_log$quality_gr,
palette = "Dark2"
+
) labs(color = legend_txt, fill = legend_txt, shape = legend_txt)
R code
fviz_pca_biplot(pca_model,
axes = c(3, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
col.var = "black",
label = "var",
addEllipses = TRUE,
habillage = py$wine_train_log$quality_gr,
palette = "Dark2"
+
) labs(color = legend_txt, fill = legend_txt, shape = legend_txt)
We can assume, that volatile acidity may be related to differences between medium and low quality wines while (in PC1-PC2 space the direction of volatile acidity arrow almost coincides with the direction between the centroids of medium and low quality groups). And differences between medium and high may be related to alcohol contents (see biplot of PC2-PC3 space).
4.3 Modelling Wine Quality
The purpose of this section is to create a logistic regression model that predicts wine quality.
Quick pair-wise summary of how other variables relate to the quality score and quality group is presented in the “details” sections below. And summary table of correlation between quality score and other variables is shown below the “details” sections (the correlation analysis was carried out in section “Section 4.1”).
Details: plots and summaries of variables by wine quality scores
Code
names(py$wine_train_log)
[1] "log_fixed_acidity" "volatile_acidity"
[3] "citric_acid" "log_residual_sugar"
[5] "log_chlorides" "log_free_sulfur_dioxide"
[7] "log_total_sulfur_dioxide" "density"
[9] "ph" "log_sulphates"
[11] "alcohol" "quality"
[13] "quality_gr"
R code
$wine_train_log %>%
pymutate(quality = ordered(quality)) %>%
::Desc(. ~ quality, data = ., verbose = 3) DescTools
------------------------------------------------------------------------------
log_fixed_acidity ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 0.906 0.866 0.906 0.910 0.938 0.917
median 0.875 0.860 0.892 0.898 0.949 0.898
sd 0.073 0.083 0.077 0.090 0.099 0.107
IQR 0.083 0.098 0.092 0.123 0.135 0.130
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 0.833 0.663 0.716 0.699 0.690 0.699
max 1.017 1.097 1.201 1.155 1.193 1.100
Q1 0.860 0.816 0.857 0.845 0.869 0.863
Q3 0.944 0.914 0.949 0.968 1.004 0.993
mad 0.049 0.071 0.070 0.087 0.100 0.091
skew 0.651 0.198 0.690 0.401 -0.100 -0.309
kurt -1.457 0.611 0.845 -0.317 -0.062 -0.628
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 30.325, df = 5, p-value = 0.00001273
------------------------------------------------------------------------------
volatile_acidity ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 0.938 0.734 0.579 0.501 0.402 0.417
median 0.927 0.708 0.580 0.500 0.360 0.360
sd 0.350 0.215 0.167 0.164 0.147 0.143
IQR 0.298 0.324 0.210 0.230 0.185 0.080
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 0.440 0.330 0.180 0.160 0.120 0.300
max 1.580 1.130 1.330 1.040 0.915 0.850
Q1 0.764 0.580 0.460 0.380 0.300 0.340
Q3 1.061 0.904 0.670 0.610 0.485 0.420
mad 0.274 0.259 0.163 0.170 0.119 0.059
skew 0.348 0.101 0.694 0.382 0.934 1.853
kurt -0.938 -1.057 1.603 -0.104 0.765 2.821
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 204.18, df = 5, p-value < 2.2e-16
------------------------------------------------------------------------------
citric_acid ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 0.129 0.130 0.243 0.268 0.381 0.387
median 0.035 0.065 0.220 0.250 0.400 0.390
sd 0.203 0.151 0.179 0.197 0.198 0.179
IQR 0.143 0.193 0.260 0.340 0.200 0.210
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 3 8 46 47 6 0
min 0.000 0.000 0.000 0.000 0.000 0.050
max 0.490 0.540 0.790 0.780 0.760 0.720
Q1 0.000 0.022 0.100 0.090 0.310 0.305
Q3 0.143 0.215 0.360 0.430 0.510 0.515
mad 0.052 0.096 0.193 0.252 0.163 0.163
skew 0.940 1.250 0.526 0.251 -0.357 -0.218
kurt -1.151 0.469 -0.521 -1.014 -0.548 -0.672
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 89.261, df = 5, p-value < 2.2e-16
------------------------------------------------------------------------------
log_residual_sugar ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 0.392 0.393 0.367 0.361 0.395 0.366
median 0.322 0.322 0.342 0.342 0.342 0.342
sd 0.229 0.199 0.152 0.154 0.172 0.159
IQR 0.270 0.167 0.136 0.119 0.168 0.160
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 0.079 0.114 0.079 -0.046 0.079 0.146
max 0.756 1.111 1.140 1.188 0.919 0.806
Q1 0.286 0.301 0.279 0.279 0.279 0.255
Q3 0.556 0.468 0.415 0.398 0.447 0.415
mad 0.263 0.082 0.108 0.094 0.129 0.129
skew 0.231 1.437 1.730 2.195 1.152 1.247
kurt -1.505 2.237 4.073 7.826 0.841 1.373
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 4.3531, df = 5, p-value = 0.4998
------------------------------------------------------------------------------
log_chlorides ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean -0.921 -1.111 -1.068 -1.097 -1.139 -1.172
median -0.938 -1.100 -1.092 -1.108 -1.131 -1.155
sd 0.216 0.113 0.134 0.140 0.150 0.065
IQR 0.275 0.122 0.104 0.107 0.147 0.067
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min -1.215 -1.347 -1.409 -1.469 -1.921 -1.347
max -0.573 -0.764 -0.333 -0.382 -0.446 -1.097
Q1 -1.078 -1.174 -1.131 -1.167 -1.208 -1.201
Q3 -0.804 -1.052 -1.027 -1.060 -1.060 -1.134
mad 0.212 0.101 0.075 0.079 0.104 0.048
skew 0.235 0.416 2.404 1.901 -0.948 -1.227
kurt -1.511 0.600 10.046 6.959 10.440 0.920
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 66.782, df = 5, p-value = 4.782e-13
------------------------------------------------------------------------------
log_free_sulfur_dioxide ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 0.924 0.920 1.129 1.107 1.035 0.942
median 0.739 0.874 1.146 1.146 1.000 0.845
sd 0.371 0.272 0.298 0.294 0.309 0.288
IQR 0.529 0.406 0.477 0.477 0.465 0.438
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 2 0 0
min 0.477 0.477 0.477 0.000 0.477 0.477
max 1.531 1.613 1.833 1.857 1.732 1.531
Q1 0.699 0.699 0.903 0.845 0.778 0.739
Q3 1.228 1.105 1.380 1.322 1.243 1.176
mad 0.223 0.282 0.347 0.347 0.329 0.217
skew 0.427 0.318 -0.189 -0.327 0.145 0.387
kurt -1.606 -0.584 -0.731 -0.169 -0.650 -0.987
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 35.271, df = 5, p-value = 0.000001329
------------------------------------------------------------------------------
log_total_sulfur_dioxide ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 1.291 1.381 1.640 1.523 1.428 1.370
median 1.175 1.362 1.653 1.544 1.415 1.230
sd 0.284 0.321 0.321 0.271 0.301 0.255
IQR 0.397 0.546 0.531 0.382 0.403 0.311
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 0.954 0.845 0.778 0.778 0.845 1.079
max 1.690 2.053 2.190 2.217 2.461 1.944
Q1 1.120 1.114 1.398 1.342 1.230 1.204
Q3 1.517 1.660 1.929 1.724 1.633 1.515
mad 0.263 0.367 0.402 0.279 0.313 0.222
skew 0.370 0.322 -0.252 -0.182 0.490 0.721
kurt -1.716 -1.030 -0.869 -0.490 0.395 -0.683
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 95.311, df = 5, p-value < 2.2e-16
------------------------------------------------------------------------------
density ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 0.997 0.996 0.997 0.997 0.996 0.995
median 0.998 0.996 0.997 0.997 0.996 0.995
sd 0.002 0.001 0.002 0.002 0.002 0.002
IQR 0.002 0.002 0.002 0.002 0.003 0.003
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 0.995 0.993 0.993 0.990 0.991 0.992
max 0.999 1.001 1.003 1.004 1.003 0.999
Q1 0.996 0.996 0.996 0.995 0.995 0.994
Q3 0.998 0.997 0.998 0.998 0.997 0.997
mad 0.002 0.001 0.001 0.002 0.002 0.003
skew -0.315 0.555 0.425 0.136 0.335 -0.194
kurt -1.652 0.950 1.169 0.678 0.338 -1.281
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 49.245, df = 5, p-value = 0.000000001977
------------------------------------------------------------------------------
ph ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 3.397 3.414 3.303 3.320 3.291 3.274
median 3.390 3.380 3.300 3.320 3.290 3.230
sd 0.142 0.157 0.147 0.155 0.152 0.197
IQR 0.167 0.195 0.190 0.190 0.180 0.175
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 3.160 3.050 2.880 2.860 2.920 2.880
max 3.630 3.900 3.720 4.010 3.780 3.720
Q1 3.317 3.312 3.200 3.220 3.200 3.175
Q3 3.485 3.507 3.390 3.410 3.380 3.350
mad 0.126 0.119 0.148 0.148 0.133 0.119
skew -0.014 0.663 0.114 0.296 0.329 0.448
kurt -1.063 1.031 0.050 1.528 0.655 0.187
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 28.884, df = 5, p-value = 0.00002443
------------------------------------------------------------------------------
log_sulphates ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean -0.260 -0.265 -0.223 -0.182 -0.136 -0.113
median -0.276 -0.252 -0.237 -0.194 -0.125 -0.131
sd 0.096 0.076 0.095 0.087 0.083 0.062
IQR 0.052 0.080 0.104 0.112 0.109 0.072
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 1 0 0
min -0.398 -0.481 -0.432 -0.398 -0.409 -0.187
max -0.066 0.033 0.297 0.290 0.134 0.041
Q1 -0.297 -0.317 -0.284 -0.237 -0.187 -0.158
Q3 -0.245 -0.237 -0.180 -0.125 -0.078 -0.086
mad 0.038 0.051 0.070 0.076 0.073 0.066
skew 0.665 0.780 1.477 0.944 -0.255 0.896
kurt -0.343 4.793 4.020 2.414 1.024 0.172
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 207.01, df = 5, p-value < 2.2e-16
------------------------------------------------------------------------------
alcohol ~ quality (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 10.075 10.420 9.912 10.624 11.408 11.900
median 10.050 10.350 9.700 10.500 11.500 11.700
sd 0.845 0.968 0.747 1.057 0.972 1.180
IQR 0.975 1.575 0.800 1.575 1.350 1.600
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 8.400 9.000 8.500 8.400 9.200 9.800
max 11.000 13.100 14.900 14.000 14.000 14.000
Q1 9.775 9.600 9.400 9.800 10.750 11.150
Q3 10.750 11.175 10.200 11.375 12.100 12.750
mad 0.741 1.112 0.445 1.186 1.038 1.334
skew -0.659 0.373 1.911 0.583 -0.019 -0.168
kurt -0.730 -0.508 5.899 -0.123 -0.586 -1.005
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 309.38, df = 5, p-value < 2.2e-16
------------------------------------------------------------------------------
quality_gr ~ quality (.)
Summary:
n: 1'279, rows: 3, columns: 6
Pearson's Chi-squared test:
X-squared = 2558, df = 10, p-value < 2.2e-16
Pearson's Chi-squared test (cont. adj):
X-squared = 2558, df = 10, p-value < 2.2e-16
Log likelihood ratio (G-test) test of independence:
G = 1424.6, X-squared df = 10, p-value < 2.2e-16
Mantel-Haenszel Chi-squared:
X-squared = 843.76, df = 1, p-value < 2.2e-16
[31m
Warning message:
Exp. counts < 5: Chi-squared approx. may be incorrect!!
[39m
estimate lwr.ci upr.ci'
Contingency Coeff. 0.8165 - -
Cramer V 1.0000 0.9595 1.0000
Kendall Tau-b 0.6827 0.6534 0.7120
Goodman Kruskal Gamma 1.0000 1.0000 1.0000
Stuart Tau-c 0.4493 0.4049 0.4938
Somers D C|R 1.0000 1.0000 1.0000
Somers D R|C 0.4661 0.4078 0.5244
Pearson Correlation 0.8125 0.7930 0.8304
Spearman Correlation 0.7133 0.6853 0.7392
Lambda C|R 0.2738 0.2416 0.3061
Lambda R|C 1.0000 1.0000 1.0000
Lambda sym 0.4436 0.4054 0.4819
Uncertainty Coeff. C|R 0.4698 0.4462 0.4935
Uncertainty Coeff. R|C 1.0000 0.9999 1.0000
Uncertainty Coeff. sym 0.6393 0.6174 0.6612
Mutual Information 0.8035 - -
quality 3 4 5 6 7 8 Sum
quality_gr
low freq 8 42 0 0 0 0 50
perc 0.6% 3.3% 0.0% 0.0% 0.0% 0.0% 3.9%
p.row 16.0% 84.0% 0.0% 0.0% 0.0% 0.0% .
p.col 100.0% 100.0% 0.0% 0.0% 0.0% 0.0% .
medium freq 0 0 545 510 0 0 1'055
perc 0.0% 0.0% 42.6% 39.9% 0.0% 0.0% 82.5%
p.row 0.0% 0.0% 51.7% 48.3% 0.0% 0.0% .
p.col 0.0% 0.0% 100.0% 100.0% 0.0% 0.0% .
high freq 0 0 0 0 159 15 174
perc 0.0% 0.0% 0.0% 0.0% 12.4% 1.2% 13.6%
p.row 0.0% 0.0% 0.0% 0.0% 91.4% 8.6% .
p.col 0.0% 0.0% 0.0% 0.0% 100.0% 100.0% .
Sum freq 8 42 545 510 159 15 1'279
perc 0.6% 3.3% 42.6% 39.9% 12.4% 1.2% 100.0%
p.row . . . . . . .
p.col . . . . . . .
----------
' 95% conf. level
Details: plots and summaries of variables by wine quality group
R code
$wine_train_log %>%
py::Desc(. ~ quality_gr, data = ., verbose = 3) DescTools
------------------------------------------------------------------------------
log_fixed_acidity ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 0.873 0.908 0.937
median 0.866 0.892 0.944
sd 0.082 0.084 0.100
IQR 0.085 0.105 0.135
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 0.663 0.699 0.690
max 1.097 1.201 1.193
Q1 0.833 0.851 0.869
Q3 0.918 0.957 1.004
mad 0.070 0.078 0.103
skew 0.194 0.534 -0.128
kurt 0.533 0.185 -0.061
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 28.538, df = 2, p-value = 0.0000006354
------------------------------------------------------------------------------
volatile_acidity ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 0.766 0.542 0.404
median 0.760 0.540 0.360
sd 0.248 0.170 0.146
IQR 0.348 0.225 0.177
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 0.330 0.160 0.120
max 1.580 1.330 0.915
Q1 0.582 0.420 0.302
Q3 0.930 0.645 0.480
mad 0.263 0.163 0.104
skew 0.595 0.514 1.012
kurt 0.502 0.864 0.992
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 151.12, df = 2, p-value < 2.2e-16
------------------------------------------------------------------------------
citric_acid ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 0.129 0.255 0.381
median 0.055 0.240 0.400
sd 0.158 0.188 0.196
IQR 0.195 0.310 0.208
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 11 93 6
min 0.000 0.000 0.000
max 0.540 0.790 0.760
Q1 0.020 0.090 0.302
Q3 0.215 0.400 0.510
mad 0.082 0.237 0.156
skew 1.234 0.392 -0.353
kurt 0.234 -0.797 -0.522
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 85.537, df = 2, p-value < 2.2e-16
------------------------------------------------------------------------------
log_residual_sugar ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 0.393 0.364 0.393
median 0.322 0.342 0.342
sd 0.202 0.153 0.170
IQR 0.220 0.136 0.164
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 0.079 -0.046 0.079
max 1.111 1.188 0.919
Q1 0.301 0.279 0.279
Q3 0.521 0.415 0.443
mad 0.099 0.094 0.129
skew 1.216 1.961 1.172
kurt 1.583 5.936 0.923
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 3.7912, df = 2, p-value = 0.1502
------------------------------------------------------------------------------
log_chlorides ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean -1.081 -1.082 -1.142
median -1.089 -1.097 -1.137
sd 0.150 0.138 0.145
IQR 0.151 0.108 0.137
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min -1.347 -1.469 -1.921
max -0.573 -0.333 -0.446
Q1 -1.172 -1.149 -1.208
Q3 -1.021 -1.041 -1.071
mad 0.109 0.077 0.105
skew 1.105 2.101 -0.926
kurt 1.729 8.331 11.085
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 30.575, df = 2, p-value = 0.0000002295
------------------------------------------------------------------------------
log_free_sulfur_dioxide ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 0.921 1.118 1.027
median 0.845 1.146 1.000
sd 0.285 0.296 0.308
IQR 0.439 0.459 0.452
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 2 0
min 0.477 0.000 0.477
max 1.613 1.857 1.732
Q1 0.699 0.903 0.778
Q3 1.138 1.362 1.230
mad 0.319 0.347 0.329
skew 0.384 -0.253 0.173
kurt -0.690 -0.450 -0.658
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 32.821, df = 2, p-value = 0.00000007464
------------------------------------------------------------------------------
log_total_sulfur_dioxide ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 1.366 1.583 1.423
median 1.331 1.591 1.398
sd 0.315 0.303 0.297
IQR 0.546 0.451 0.427
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 0.845 0.778 0.845
max 2.053 2.217 2.461
Q1 1.114 1.362 1.204
Q3 1.660 1.813 1.631
mad 0.322 0.339 0.329
skew 0.364 -0.124 0.519
kurt -1.004 -0.701 0.391
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 58.932, df = 2, p-value = 1.596e-13
------------------------------------------------------------------------------
density ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 0.996 0.997 0.996
median 0.996 0.997 0.996
sd 0.002 0.002 0.002
IQR 0.002 0.002 0.003
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 0.993 0.990 0.991
max 1.001 1.004 1.003
Q1 0.996 0.996 0.995
Q3 0.997 0.998 0.997
mad 0.001 0.002 0.002
skew 0.414 0.149 0.298
kurt 0.142 1.021 0.301
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 25.761, df = 2, p-value = 0.000002547
------------------------------------------------------------------------------
ph ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 3.411 3.311 3.289
median 3.380 3.310 3.275
sd 0.153 0.151 0.156
IQR 0.188 0.190 0.178
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 3.050 2.860 2.880
max 3.900 4.010 3.780
Q1 3.312 3.210 3.200
Q3 3.500 3.400 3.378
mad 0.119 0.148 0.126
skew 0.606 0.219 0.344
kurt 0.986 0.887 0.718
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 23.989, df = 2, p-value = 0.000006178
------------------------------------------------------------------------------
log_sulphates ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean -0.264 -0.203 -0.134
median -0.256 -0.222 -0.128
sd 0.078 0.094 0.081
IQR 0.073 0.113 0.105
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 1 0
min -0.481 -0.432 -0.409
max 0.033 0.297 0.134
Q1 -0.310 -0.268 -0.185
Q3 -0.237 -0.155 -0.081
mad 0.058 0.080 0.077
skew 0.803 1.115 -0.244
kurt 3.735 2.769 1.126
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 134.41, df = 2, p-value < 2.2e-16
------------------------------------------------------------------------------
alcohol ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 10.365 10.256 11.451
median 10.250 10.000 11.500
sd 0.950 0.977 0.998
IQR 1.375 1.350 1.400
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 8.400 8.400 9.200
max 13.100 14.900 14.000
Q1 9.625 9.500 10.800
Q3 11.000 10.850 12.200
mad 1.038 0.890 1.038
skew 0.300 1.121 0.010
kurt -0.267 1.117 -0.564
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 173.51, df = 2, p-value < 2.2e-16
------------------------------------------------------------------------------
quality ~ quality_gr (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 3.840 5.483 7.086
median 4.000 5.000 7.000
sd 0.370 0.500 0.281
IQR 0.000 1.000 0.000
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 3.000 5.000 7.000
max 4.000 6.000 8.000
Q1 4.000 5.000 7.000
Q3 4.000 6.000 7.000
mad 0.000 0.000 0.000
skew -1.799 0.066 2.923
kurt 1.265 -1.997 6.583
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 650.26, df = 2, p-value < 2.2e-16
Correlation strength between wine quality score and the remaining variables:
Code
"X=='quality' | Y == 'quality'") corr_df_log.query(
X Y r CI95% p_adj
65 alcohol quality 0.446 [0.4, 0.49] <0.001
20 volatile_acidity quality -0.405 [-0.45, -0.36] <0.001
64 log_sulphates quality 0.345 [0.3, 0.39] <0.001
29 citric_acid quality 0.247 [0.19, 0.3] <0.001
44 log_chlorides quality -0.169 [-0.22, -0.12] <0.001
55 log_total_sulfur_dioxide quality -0.162 [-0.22, -0.11] <0.001
59 density quality -0.152 [-0.21, -0.1] <0.001
10 log_fixed_acidity quality 0.123 [0.07, 0.18] <0.001
62 ph quality -0.063 [-0.12, -0.01] 0.324
50 log_free_sulfur_dioxide quality -0.041 [-0.1, 0.01] >0.999
37 log_residual_sugar quality 0.016 [-0.04, 0.07] >0.999
Results:
- Correlation analysis of quality scores revealed that the strongest correlation is between
alcohol
andquality
(r = 0.45 95% CI [0.40, 0.49], p_adj < 0.001) andvolatile_acidity
andquality
(r = -0.41 95% CI [-0.45, -0.36], p_adj < 0.001). - This means that correlation strength is moderate so no single variable is enough for extremely accurate predictions.
4.3.1 Wine Quality Distribution
Different wine quality scores come at different rates:
Code
= wine_train_log.quality.value_counts().sort_index()
quality_score_counts print(test_chi_square_gof(quality_score_counts))
Chi square test, χ²(5, n = 1279) = 1462.79, p<0.001
Average quality wines are the most common (error bars show 95 percent confidence intervals):
Code
= ci_proportion_multinomial(
quality_score_distribution
quality_score_counts="quality_score")
).reset_index(names
quality_score_distribution
quality_score n percent ci_lower ci_upper
0 3 8 0.625 0.254 1.531
1 4 42 3.284 2.201 4.872
2 5 545 42.611 39.013 46.290
3 6 510 39.875 36.327 43.532
4 7 159 12.432 10.199 15.071
5 8 15 1.173 0.602 2.272
Code
ggplot(
$quality_score_distribution,
pyaes(x = factor(quality_score), y = percent)
+
) geom_col(fill = "skyblue3") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.3, lwd = 1) +
labs(
x = "Wine quality score",
y = "Percentage",
title = "The Distribution of Wines by Quality Scores"
)
Code
= wine_train_log.quality_gr.value_counts().sort_index()
quality_group_counts print(test_chi_square_gof(quality_group_counts))
Chi square test, χ²(2, n = 1279) = 1408.57, p<0.001
Code
= ci_proportion_multinomial(
quality_group_distribution
quality_group_counts="quality_group")
).reset_index(names
quality_group_distribution
quality_group n percent ci_lower ci_upper
0 low 50 3.909 2.804 5.426
1 medium 1055 82.486 79.799 84.884
2 high 174 13.604 11.471 16.062
Code
ggplot(
$quality_group_distribution,
pyaes(x = quality_group, y = percent)
+
) geom_col(fill = "skyblue3") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.3, lwd = 1) +
labs(
x = "Wine quality group",
y = "Percentage",
title = "The Distribution of Wines by Quality Groups"
)
4.3.2 Logistic Regression Model
Code
"quality_gr_num"] = wine_train_log["quality"].apply(
wine_train_log[lambda x: to_3_quality_groups(x, numeric=True)
)
# quality_gr_num=2 is a reference group
"quality_gr_num"], wine_train_log["quality"]) pd.crosstab(wine_train_log[
quality 3 4 5 6 7 8
quality_gr_num
0 8 42 0 0 0 0
1 0 0 0 0 159 15
2 0 0 545 510 0 0
Here:
- 0: low quality;
- 2: medium quality (reference group);
- 1: high quality.
Step | Selected model | Model / Additionally removed variable | Kappa | AIC | Pseudo R² |
---|---|---|---|---|---|
(NULL) | no | NULL | 0.000 | 1428.6 | 0.000 |
(FULL) | no | FULL | 0.327 | 1045.6 | 0.300 |
(1) | no | (FULL) - citric_acid | 0.339 | 1042.9 | 0.299 |
(2) | no | (1) - density | 0.334 | 1043.4 | 0.296 |
(3) | no | (2) - log_fixed_acidity | 0.329 | 1040.1 | 0.295 |
(4) | yes | (3) - log_total_sulfur_dioxide | 0.334 | 1053.2 | 0.283 |
(5) | no | (4) - log_chlorides | 0.308 | 1059.3 | 0.276 |
Find the details below.
Multinomial regression (null model)
Code
= "quality_gr_num ~ 1"
formula_mnl_null
with warnings.catch_warnings():
"ignore")
warnings.simplefilter(= do_mnlogit(formula_mnl_null, wine_train_log) _
Optimization terminated successfully.
Current function value: 0.556927
Iterations 6
Results: MNLogit
================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.000
Date: 2023-07-30 16:02 AIC: 1428.6187
No. Observations: 1279 BIC: 1438.9263
Df Model: 0 Log-Likelihood: -712.31
Df Residuals: 1277 LL-Null: -712.31
Converged: 1.0000 LLR p-value: nan
No. Iterations: 6.0000 Scale: 1.0000
--------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------
Intercept 1.2470 0.1605 7.7717 0.0000 0.9325 1.5615
---------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------
Intercept 3.0493 0.1447 21.0682 0.0000 2.7656 3.3329
================================================================
kappa = 0.000
precision recall f1-score support
0 0.00 0.00 0.00 50
1 0.00 0.00 0.00 174
2 0.82 1.00 0.90 1055
accuracy 0.82 1279
macro avg 0.27 0.33 0.30 1279
weighted avg 0.68 0.82 0.75 1279
Confusion matrix (rows - true, columns - predicted):
[[ 0 0 50]
[ 0 0 174]
[ 0 0 1055]]
Multinomial regression (full model)
Code
= "quality_gr_num ~ " + " + ".join(
formula_mnl_full
["log_total_sulfur_dioxide",
"log_chlorides",
"log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"log_fixed_acidity",
"density",
"citric_acid",
"ph",
"alcohol",
]
)
= do_mnlogit(formula_mnl_full, wine_train_log) _
Optimization terminated successfully.
Current function value: 0.390006
Iterations 12
Results: MNLogit
=============================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.300
Date: 2023-07-30 16:02 AIC: 1045.6357
No. Observations: 1279 BIC: 1169.3277
Df Model: 22 Log-Likelihood: -498.82
Df Residuals: 1255 LL-Null: -712.31
Converged: 1.0000 LLR p-value: 1.0879e-76
No. Iterations: 12.0000 Scale: 1.0000
-----------------------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------
Intercept 119.8409 280.2599 0.4276 0.6689 -429.4584 669.1402
log_total_sulfur_dioxide -0.8022 1.1834 -0.6778 0.4979 -3.1216 1.5173
log_chlorides -2.9954 1.5809 -1.8948 0.0581 -6.0939 0.1031
log_sulphates 13.1604 2.8123 4.6796 0.0000 7.6484 18.6723
log_residual_sugar -0.0340 1.6013 -0.0212 0.9831 -3.1725 3.1046
log_free_sulfur_dioxide 2.0438 1.1666 1.7520 0.0798 -0.2426 4.3302
volatile_acidity -6.3168 1.2915 -4.8911 0.0000 -8.8480 -3.7855
log_fixed_acidity 3.8494 6.8680 0.5605 0.5752 -9.6116 17.3103
density -118.3428 288.3712 -0.4104 0.6815 -683.5399 446.8542
citric_acid 0.1209 1.6703 0.0724 0.9423 -3.1528 3.3946
ph -3.5982 2.5049 -1.4365 0.1509 -8.5078 1.3114
alcohol 0.9063 0.3405 2.6619 0.0078 0.2390 1.5736
-----------------------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------
Intercept -143.9094 253.5512 -0.5676 0.5703 -640.8606 353.0417
log_total_sulfur_dioxide 1.2885 1.0042 1.2831 0.1995 -0.6798 3.2567
log_chlorides -0.4318 1.3705 -0.3151 0.7527 -3.1178 2.2543
log_sulphates 4.9014 2.5869 1.8947 0.0581 -0.1688 9.9716
log_residual_sugar -2.7214 1.4126 -1.9265 0.0540 -5.4899 0.0472
log_free_sulfur_dioxide 1.2659 1.0113 1.2518 0.2107 -0.7162 3.2479
volatile_acidity -4.1921 0.9725 -4.3105 0.0000 -6.0983 -2.2860
log_fixed_acidity -1.5063 6.3136 -0.2386 0.8114 -13.8807 10.8681
density 161.1611 260.9309 0.6176 0.5368 -350.2541 672.5763
citric_acid -0.7918 1.4244 -0.5559 0.5783 -3.5835 2.0000
ph -4.3257 2.2476 -1.9246 0.0543 -8.7309 0.0795
alcohol 0.3417 0.3092 1.1050 0.2692 -0.2643 0.9476
=============================================================================
kappa = 0.327
precision recall f1-score support
0 0.58 0.14 0.23 50
1 0.60 0.33 0.43 174
2 0.86 0.96 0.91 1055
accuracy 0.84 1279
macro avg 0.68 0.48 0.52 1279
weighted avg 0.82 0.84 0.82 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 1 42]
[ 0 58 116]
[ 5 38 1012]]
Multinomial regression (1)
Code
# Excluded:
# - citric_acid
= "quality_gr_num ~ " + " + ".join(
formula_mnl_1
["log_total_sulfur_dioxide",
"log_chlorides",
"log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"log_fixed_acidity",
"density",
"ph",
"alcohol",
]
)
= do_mnlogit(formula_mnl_1, wine_train_log) _
Optimization terminated successfully.
Current function value: 0.390481
Iterations 12
Results: MNLogit
=============================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.299
Date: 2023-07-30 16:02 AIC: 1042.8517
No. Observations: 1279 BIC: 1156.2360
Df Model: 20 Log-Likelihood: -499.43
Df Residuals: 1257 LL-Null: -712.31
Converged: 1.0000 LLR p-value: 9.0793e-78
No. Iterations: 12.0000 Scale: 1.0000
-----------------------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------
Intercept 96.9830 279.8112 0.3466 0.7289 -451.4369 645.4030
log_total_sulfur_dioxide -0.8883 1.1358 -0.7820 0.4342 -3.1144 1.3379
log_chlorides -2.9051 1.5873 -1.8301 0.0672 -6.0162 0.2061
log_sulphates 13.0336 2.7989 4.6566 0.0000 7.5478 18.5194
log_residual_sugar -0.0304 1.6022 -0.0190 0.9849 -3.1707 3.1098
log_free_sulfur_dioxide 2.0718 1.1431 1.8124 0.0699 -0.1687 4.3123
volatile_acidity -6.5900 1.1552 -5.7047 0.0000 -8.8541 -4.3259
log_fixed_acidity 3.4837 6.6004 0.5278 0.5976 -9.4529 16.4203
density -94.3171 287.8715 -0.3276 0.7432 -658.5348 469.9007
ph -3.7780 2.5068 -1.5071 0.1318 -8.6912 1.1352
alcohol 0.9255 0.3379 2.7393 0.0062 0.2633 1.5877
-----------------------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------
Intercept -154.0818 253.0695 -0.6089 0.5426 -650.0889 341.9253
log_total_sulfur_dioxide 1.1203 0.9506 1.1785 0.2386 -0.7429 2.9835
log_chlorides -0.4739 1.3808 -0.3432 0.7314 -3.1802 2.2323
log_sulphates 4.7820 2.5728 1.8587 0.0631 -0.2607 9.8247
log_residual_sugar -2.7823 1.4142 -1.9674 0.0491 -5.5541 -0.0105
log_free_sulfur_dioxide 1.3634 0.9892 1.3784 0.1681 -0.5753 3.3021
volatile_acidity -3.9864 0.9089 -4.3862 0.0000 -5.7678 -2.2051
log_fixed_acidity -2.5589 6.0456 -0.4233 0.6721 -14.4080 9.2903
density 172.2310 260.3805 0.6615 0.5083 -338.1055 682.5674
ph -4.3085 2.2549 -1.9107 0.0560 -8.7280 0.1111
alcohol 0.3257 0.3079 1.0575 0.2903 -0.2779 0.9292
=============================================================================
kappa = 0.339
precision recall f1-score support
0 0.58 0.14 0.23 50
1 0.61 0.34 0.44 174
2 0.87 0.96 0.91 1055
accuracy 0.84 1279
macro avg 0.69 0.48 0.53 1279
weighted avg 0.82 0.84 0.82 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 1 42]
[ 0 60 114]
[ 5 37 1013]]
Multinomial regression (2)
Code
# Excluded:
# - citric_acid
# - density
= "quality_gr_num ~ " + " + ".join(
formula_mnl_2
["log_total_sulfur_dioxide",
"log_chlorides",
"log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"log_fixed_acidity",
"ph",
"alcohol",
]
)
= do_mnlogit(formula_mnl_2, wine_train_log) _
Optimization terminated successfully.
Current function value: 0.392263
Iterations 8
Results: MNLogit
=========================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.296
Date: 2023-07-30 16:02 AIC: 1043.4087
No. Observations: 1279 BIC: 1146.4854
Df Model: 18 Log-Likelihood: -501.70
Df Residuals: 1259 LL-Null: -712.31
Converged: 1.0000 LLR p-value: 3.4228e-78
No. Iterations: 8.0000 Scale: 1.0000
-------------------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
-------------------------------------------------------------------------
Intercept 5.2484 9.4399 0.5560 0.5782 -13.2535 23.7502
log_total_sulfur_dioxide -1.1858 1.1280 -1.0513 0.2931 -3.3966 1.0250
log_chlorides -3.1511 1.5809 -1.9932 0.0462 -6.2497 -0.0526
log_sulphates 12.9201 2.6849 4.8121 0.0000 7.6578 18.1825
log_residual_sugar -0.3780 1.1593 -0.3261 0.7444 -2.6502 1.8942
log_free_sulfur_dioxide 2.2795 1.1356 2.0072 0.0447 0.0537 4.5053
volatile_acidity -7.0259 1.1412 -6.1566 0.0000 -9.2626 -4.7892
log_fixed_acidity 1.9618 3.7933 0.5172 0.6050 -5.4728 9.3965
ph -4.1767 1.9943 -2.0944 0.0362 -8.0854 -0.2680
alcohol 0.9960 0.2180 4.5682 0.0000 0.5687 1.4233
-------------------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
-------------------------------------------------------------------------
Intercept 13.1628 8.3845 1.5699 0.1164 -3.2705 29.5960
log_total_sulfur_dioxide 1.1572 0.9501 1.2180 0.2232 -0.7050 3.0194
log_chlorides -0.3448 1.3728 -0.2512 0.8017 -3.0354 2.3458
log_sulphates 5.3032 2.4627 2.1534 0.0313 0.4763 10.1301
log_residual_sugar -2.1039 0.9858 -2.1341 0.0328 -4.0362 -0.1717
log_free_sulfur_dioxide 1.3164 0.9863 1.3347 0.1820 -0.6167 3.2496
volatile_acidity -4.0310 0.9055 -4.4516 0.0000 -5.8058 -2.2562
log_fixed_acidity 0.7247 3.4907 0.2076 0.8355 -6.1169 7.5663
ph -3.3871 1.7697 -1.9139 0.0556 -6.8558 0.0815
alcohol 0.1733 0.1972 0.8789 0.3795 -0.2131 0.5597
=========================================================================
kappa = 0.334
precision recall f1-score support
0 0.58 0.14 0.23 50
1 0.61 0.34 0.44 174
2 0.87 0.96 0.91 1055
accuracy 0.84 1279
macro avg 0.69 0.48 0.52 1279
weighted avg 0.82 0.84 0.82 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 1 42]
[ 0 59 115]
[ 5 37 1013]]
Multinomial regression (3)
Code
# Excluded:
# - citric_acid
# - density
# - log_fixed_acidity
= "quality_gr_num ~ " + " + ".join(
formula_mnl_3
["log_total_sulfur_dioxide",
"log_chlorides",
"log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"ph",
"alcohol",
]
)
= do_mnlogit(formula_mnl_3, wine_train_log) _
Optimization terminated successfully.
Current function value: 0.392523
Iterations 8
Results: MNLogit
========================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.295
Date: 2023-07-30 16:02 AIC: 1040.0733
No. Observations: 1279 BIC: 1132.8423
Df Model: 16 Log-Likelihood: -502.04
Df Residuals: 1261 LL-Null: -712.31
Converged: 1.0000 LLR p-value: 1.7841e-79
No. Iterations: 8.0000 Scale: 1.0000
------------------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------------
Intercept 9.8512 4.6487 2.1191 0.0341 0.7399 18.9625
log_total_sulfur_dioxide -1.2134 1.1257 -1.0779 0.2811 -3.4197 0.9930
log_chlorides -3.0846 1.5759 -1.9573 0.0503 -6.1734 0.0041
log_sulphates 13.1280 2.6369 4.9785 0.0000 7.9597 18.2964
log_residual_sugar -0.2949 1.1536 -0.2557 0.7982 -2.5560 1.9661
log_free_sulfur_dioxide 2.2388 1.1280 1.9848 0.0472 0.0280 4.4495
volatile_acidity -7.0784 1.1341 -6.2412 0.0000 -9.3012 -4.8555
ph -4.9478 1.4362 -3.4452 0.0006 -7.7626 -2.1330
alcohol 0.9895 0.2177 4.5454 0.0000 0.5628 1.4162
------------------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------------
Intercept 14.6710 4.1401 3.5436 0.0004 6.5565 22.7856
log_total_sulfur_dioxide 1.1759 0.9511 1.2364 0.2163 -0.6882 3.0401
log_chlorides -0.3522 1.3767 -0.2558 0.7981 -3.0506 2.3461
log_sulphates 5.4129 2.4166 2.2399 0.0251 0.6765 10.1493
log_residual_sugar -2.0923 0.9879 -2.1179 0.0342 -4.0285 -0.1560
log_free_sulfur_dioxide 1.2830 0.9781 1.3118 0.1896 -0.6339 3.2000
volatile_acidity -4.0041 0.9001 -4.4487 0.0000 -5.7682 -2.2400
ph -3.6413 1.2705 -2.8661 0.0042 -6.1313 -1.1512
alcohol 0.1723 0.1971 0.8743 0.3819 -0.2140 0.5586
========================================================================
kappa = 0.329
precision recall f1-score support
0 0.58 0.14 0.23 50
1 0.60 0.33 0.43 174
2 0.87 0.96 0.91 1055
accuracy 0.84 1279
macro avg 0.68 0.48 0.52 1279
weighted avg 0.82 0.84 0.82 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 1 42]
[ 0 58 116]
[ 5 37 1013]]
Multinomial regression (4)
Code
# Excluded:
# - citric_acid
# - density
# - log_fixed_acidity
# - log_total_sulfur_dioxide
= "quality_gr_num ~ " + " + ".join(
formula_mnl_4
["log_chlorides",
"log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"ph",
"alcohol",
]
)
= do_mnlogit(formula_mnl_4, wine_train_log) classification_model_selected
Optimization terminated successfully.
Current function value: 0.399213
Iterations 8
Results: MNLogit
=======================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.283
Date: 2023-07-30 16:02 AIC: 1053.1867
No. Observations: 1279 BIC: 1135.6481
Df Model: 14 Log-Likelihood: -510.59
Df Residuals: 1263 LL-Null: -712.31
Converged: 1.0000 LLR p-value: 2.3990e-77
No. Iterations: 8.0000 Scale: 1.0000
-----------------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------
Intercept 9.7895 4.3226 2.2647 0.0235 1.3173 18.2617
log_chlorides -2.7688 1.5637 -1.7706 0.0766 -5.8337 0.2961
log_sulphates 12.7081 2.5966 4.8942 0.0000 7.6189 17.7973
log_residual_sugar -0.8818 1.1331 -0.7782 0.4364 -3.1026 1.3390
log_free_sulfur_dioxide 1.3545 0.6432 2.1059 0.0352 0.0938 2.6152
volatile_acidity -7.3250 1.1180 -6.5521 0.0000 -9.5162 -5.1338
ph -5.0049 1.4014 -3.5714 0.0004 -7.7515 -2.2583
alcohol 0.9959 0.2127 4.6814 0.0000 0.5790 1.4129
-----------------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------
Intercept 16.4577 3.8374 4.2888 0.0000 8.9366 23.9789
log_chlorides -0.3891 1.3754 -0.2829 0.7773 -3.0849 2.3067
log_sulphates 5.7134 2.3979 2.3827 0.0172 1.0137 10.4131
log_residual_sugar -1.9889 0.9859 -2.0174 0.0437 -3.9211 -0.0566
log_free_sulfur_dioxide 2.2719 0.5707 3.9811 0.0001 1.1534 3.3904
volatile_acidity -3.9980 0.8887 -4.4986 0.0000 -5.7398 -2.2561
ph -3.8146 1.2447 -3.0646 0.0022 -6.2542 -1.3749
alcohol 0.1256 0.1925 0.6528 0.5139 -0.2516 0.5029
=======================================================================
kappa = 0.334
precision recall f1-score support
0 0.78 0.14 0.24 50
1 0.61 0.33 0.43 174
2 0.87 0.96 0.91 1055
accuracy 0.85 1279
macro avg 0.75 0.48 0.53 1279
weighted avg 0.83 0.85 0.82 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 2 41]
[ 0 57 117]
[ 2 35 1018]]
Multinomial regression (5)
Code
# Excluded:
# - citric_acid
# - density
# - log_fixed_acidity
# - log_total_sulfur_dioxide
# - log_chlorides
= "quality_gr_num ~ " + " + ".join(
formula_mnl_5
["log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"ph",
"alcohol",
]
)
= do_mnlogit(formula_mnl_5, wine_train_log) _
Optimization terminated successfully.
Current function value: 0.403161
Iterations 8
Results: MNLogit
=======================================================================
Model: MNLogit Method: MLE
Dependent Variable: quality_gr_num Pseudo R-squared: 0.276
Date: 2023-07-30 16:02 AIC: 1059.2852
No. Observations: 1279 BIC: 1131.4389
Df Model: 12 Log-Likelihood: -515.64
Df Residuals: 1265 LL-Null: -712.31
Converged: 1.0000 LLR p-value: 9.7571e-77
No. Iterations: 8.0000 Scale: 1.0000
-----------------------------------------------------------------------
quality_gr_num = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------
Intercept 10.6176 4.2785 2.4816 0.0131 2.2319 19.0033
log_sulphates 11.8845 2.5466 4.6669 0.0000 6.8934 16.8757
log_residual_sugar -1.2226 1.1130 -1.0985 0.2720 -3.4039 0.9588
log_free_sulfur_dioxide 1.4828 0.6371 2.3275 0.0199 0.2342 2.7315
volatile_acidity -7.7299 1.1044 -6.9989 0.0000 -9.8946 -5.5652
ph -4.5934 1.3800 -3.3286 0.0009 -7.2981 -1.8887
alcohol 1.0818 0.2073 5.2173 0.0000 0.6754 1.4882
-----------------------------------------------------------------------
quality_gr_num = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
-----------------------------------------------------------------------
Intercept 16.6041 3.8108 4.3571 0.0000 9.1351 24.0731
log_sulphates 5.6195 2.3597 2.3814 0.0172 0.9945 10.2445
log_residual_sugar -2.0352 0.9648 -2.1095 0.0349 -3.9262 -0.1442
log_free_sulfur_dioxide 2.2960 0.5667 4.0516 0.0001 1.1853 3.4066
volatile_acidity -4.0292 0.8773 -4.5929 0.0000 -5.7486 -2.3097
ph -3.7697 1.2280 -3.0699 0.0021 -6.1764 -1.3629
alcohol 0.1371 0.1875 0.7313 0.4646 -0.2304 0.5047
=======================================================================
kappa = 0.308
precision recall f1-score support
0 0.78 0.14 0.24 50
1 0.57 0.32 0.41 174
2 0.86 0.96 0.91 1055
accuracy 0.84 1279
macro avg 0.74 0.47 0.52 1279
weighted avg 0.82 0.84 0.81 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 0 43]
[ 0 55 119]
[ 2 42 1011]]
4.4 Modelling Alcohol Content
The purpose of this section is to create a linear regression model that predicts the fraction of alcohol content in wine.
Quick pair-wise summary of how other variables relate to the fraction of alcohol content in wine is presented in the details section below. And summary table of correlation between alcohol and other variables is shown below. The correlation analysis was carried out in section “Section 4.1”.
Details: plots and summaries of variables by alcohol content
Let’s investigate relationship between alcohol content and other variables.
R code
$wine_train_log %>%
pymutate(quality = ordered(quality)) %>%
::Desc(. ~ alcohol, data = ., verbose = 3) DescTools
------------------------------------------------------------------------------
log_fixed_acidity ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.104
Spearman corr.: -0.077
Kendall corr. : -0.056
------------------------------------------------------------------------------
volatile_acidity ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.192
Spearman corr.: -0.214
Kendall corr. : -0.144
------------------------------------------------------------------------------
citric_acid ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.101
Spearman corr.: 0.083
Kendall corr. : 0.055
------------------------------------------------------------------------------
log_residual_sugar ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.069
Spearman corr.: 0.109
Kendall corr. : 0.076
------------------------------------------------------------------------------
log_chlorides ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.315
Spearman corr.: -0.306
Kendall corr. : -0.213
------------------------------------------------------------------------------
log_free_sulfur_dioxide ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.084
Spearman corr.: -0.080
Kendall corr. : -0.055
------------------------------------------------------------------------------
log_total_sulfur_dioxide ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.231
Spearman corr.: -0.253
Kendall corr. : -0.175
------------------------------------------------------------------------------
density ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.493
Spearman corr.: -0.461
Kendall corr. : -0.329
------------------------------------------------------------------------------
ph ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.227
Spearman corr.: 0.204
Kendall corr. : 0.144
------------------------------------------------------------------------------
log_sulphates ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.150
Spearman corr.: 0.212
Kendall corr. : 0.148
------------------------------------------------------------------------------
quality ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 10.075 10.420 9.912 10.624 11.408 11.900
median 10.050 10.350 9.700 10.500 11.500 11.700
sd 0.845 0.968 0.747 1.057 0.972 1.180
IQR 0.975 1.575 0.800 1.575 1.350 1.600
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 8.400 9.000 8.500 8.400 9.200 9.800
max 11.000 13.100 14.900 14.000 14.000 14.000
Q1 9.775 9.600 9.400 9.800 10.750 11.150
Q3 10.750 11.175 10.200 11.375 12.100 12.750
mad 0.741 1.112 0.445 1.186 1.038 1.334
skew -0.659 0.373 1.911 0.583 -0.019 -0.168
kurt -0.730 -0.508 5.899 -0.123 -0.586 -1.005
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 309.38, df = 5, p-value < 2.2e-16
Proportions of quality in the quantiles of alcohol:
Q1 Q2 Q3 Q4
3 0.3% 1.2% 1.0% 0.0%
4 2.1% 3.9% 3.7% 3.6%
5 68.2% 54.0% 31.7% 11.9%
6 28.2% 34.7% 49.3% 49.3%
7 1.2% 5.6% 13.7% 31.5%
8 0.0% 0.6% 0.7% 3.6%
------------------------------------------------------------------------------
quality_gr ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 10.365 10.256 11.451
median 10.250 10.000 11.500
sd 0.950 0.977 0.998
IQR 1.375 1.350 1.400
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 8.400 8.400 9.200
max 13.100 14.900 14.000
Q1 9.625 9.500 10.800
Q3 11.000 10.850 12.200
mad 1.038 0.890 1.038
skew 0.300 1.121 0.010
kurt -0.267 1.117 -0.564
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 173.51, df = 2, p-value < 2.2e-16
Proportions of quality_gr in the quantiles of alcohol:
Q1 Q2 Q3 Q4
low 2.4% 5.0% 4.7% 3.6%
medium 96.5% 88.7% 81.0% 61.3%
high 1.2% 6.2% 14.3% 35.1%
------------------------------------------------------------------------------
quality_gr_num ~ alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.257
Spearman corr.: -0.320
Kendall corr. : -0.261
Correlation strength between alcohol concentration and the remaining variables:
Code
"X=='alcohol' | Y == 'alcohol'") corr_df_log.query(
X Y r CI95% p_adj
58 density alcohol -0.493 [-0.53, -0.45] <0.001
65 alcohol quality 0.446 [0.4, 0.49] <0.001
43 log_chlorides alcohol -0.315 [-0.36, -0.26] <0.001
54 log_total_sulfur_dioxide alcohol -0.231 [-0.28, -0.18] <0.001
61 ph alcohol 0.227 [0.17, 0.28] <0.001
19 volatile_acidity alcohol -0.192 [-0.24, -0.14] <0.001
63 log_sulphates alcohol 0.150 [0.1, 0.2] <0.001
9 log_fixed_acidity alcohol -0.104 [-0.16, -0.05] 0.004
28 citric_acid alcohol 0.101 [0.05, 0.15] 0.006
49 log_free_sulfur_dioxide alcohol -0.084 [-0.14, -0.03] 0.047
36 log_residual_sugar alcohol 0.069 [0.01, 0.12] 0.182
Results:
- Correlation analysis of alcohol content revealed that the strongest correlation is between
density
andalcohol
(r = -0.49, 95% CI [-0.53, -0.45], p_adj < 0.001) as well as betweenalcohol
andquality
(r = 0.45 95% CI [0.40, 0.49], p_adj < 0.001). - This means that correlation strength is moderate so no single variable is enough for extremely accurate predictions.
4.4.1 Sequential Feature Selection: All Features
Let’s do sequential feature selection and investigate how many features might be enough.
Code
# For data without transformation
= "alcohol ~ " + " + ".join(
formula_full
["total_sulfur_dioxide",
"chlorides",
"sulphates",
"residual_sugar",
"free_sulfur_dioxide",
"volatile_acidity",
"fixed_acidity",
"density",
"citric_acid",
"ph",
"quality",
] )
Code
# For log-transformed data
= "alcohol ~ " + " + ".join(
formula_log_full
["log_total_sulfur_dioxide",
"log_chlorides",
"log_sulphates",
"log_residual_sugar",
"log_free_sulfur_dioxide",
"volatile_acidity",
"log_fixed_acidity",
"density",
"citric_acid",
"ph",
"quality",
] )
SFS results
Code
251)
np.random.seed(= do_sfs_lin_reg(formula_full, wine_train)
sfs_res1 "(All Variables on Original Scale)") show_sfs_results_lin_reg(sfs_res1,
Forward Feature Selection with 5-fold CV
(All Variables on Original Scale)
Selected variables:
[ n = 9, avg. R² = 0.672 ]
1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. volatile_acidity
5. fixed_acidity
6. density
7. citric_acid
8. ph
9. quality
SFS results
Code
251)
np.random.seed(= do_sfs_lin_reg(formula_full, wine_train, False)
sfs_res2 "(All Variables on Original Scale)") show_sfs_results_lin_reg(sfs_res2,
Backward Feature Selection with 5-fold CV
(All Variables on Original Scale)
Selected variables:
[ n = 7, avg. R² = 0.667 ]
1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. fixed_acidity
5. density
6. ph
7. quality
SFS results
Code
251)
np.random.seed(= do_sfs_lin_reg(formula_log_full, wine_train_log)
sfs_res3 "(Log-Transformed Variables Included)") show_sfs_results_lin_reg(sfs_res3,
Forward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 7, avg. R² = 0.689 ]
1. log_sulphates
2. log_residual_sugar
3. log_free_sulfur_dioxide
4. log_fixed_acidity
5. density
6. ph
7. quality
SFS results
Code
= do_sfs_lin_reg(formula_log_full, wine_train_log, False)
sfs_res4 "(Log-Transformed Variables Included)") show_sfs_results_lin_reg(sfs_res4,
Backward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 6, avg. R² = 0.686 ]
1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph
6. quality
Despite the fact that algorithm suggests more, but 5-6 variables would be enough as additional features give just a slight improvement. Log-transformed features give a slightly better performance than the features on the original scale.
Code
= 6 n_features
List selected features and regression performance
Code
print("All features on original scale")
All features on original scale
Code
get_sfs_performance_lin_reg(sfs_res2, n_features)
{'n_features': 6, 'mean R²': 0.664, 'median R²': 0.661, 'sd R²': 0.014}
Code
"feature_names"] sfs_res2.get_metric_dict()[n_features][
('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph', 'quality')
Code
print("With log-transformed features")
With log-transformed features
Code
get_sfs_performance_lin_reg(sfs_res4, n_features)
{'n_features': 6, 'mean R²': 0.686, 'median R²': 0.669, 'sd R²': 0.03}
Code
= sfs_res4.get_metric_dict()[n_features]["feature_names"]
formula_selected_6 formula_selected_6
('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph', 'quality')
Code
= "alcohol ~ " + " + ".join(formula_selected_6) formula_selected_6
4.4.2 Sequential Feature Selection: No Quality
Quality score seems to be the most expensive feature and might be not available at prediction time. So let’s remove it from options to select features.
Code
# Formulas that exclude "quality"
= formula_full.replace(" + quality", "")
formula_q = formula_log_full.replace(" + quality", "") formula_log_q
SFS results
Code
251)
np.random.seed(= do_sfs_lin_reg(formula_q, wine_train)
sfs_res1q "(All Variables on Original Scale)") show_sfs_results_lin_reg(sfs_res1q,
Forward Feature Selection with 5-fold CV
(All Variables on Original Scale)
Selected variables:
[ n = 7, avg. R² = 0.655 ]
1. total_sulfur_dioxide
2. sulphates
3. residual_sugar
4. fixed_acidity
5. density
6. citric_acid
7. ph
SFS results
Code
251)
np.random.seed(= do_sfs_lin_reg(formula_q, wine_train, False)
sfs_res2q "(All Variables on Original Scale)") show_sfs_results_lin_reg(sfs_res2q,
Backward Feature Selection with 5-fold CV
(All Variables on Original Scale)
Selected variables:
[ n = 5, avg. R² = 0.647 ]
1. sulphates
2. residual_sugar
3. fixed_acidity
4. density
5. ph
SFS results
Code
251)
np.random.seed(= do_sfs_lin_reg(formula_log_q, wine_train_log)
sfs_res3q "(Log-Transformed Variables Included)") show_sfs_results_lin_reg(sfs_res3q,
Forward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 6, avg. R² = 0.677 ]
1. log_total_sulfur_dioxide
2. log_sulphates
3. log_residual_sugar
4. log_fixed_acidity
5. density
6. ph
SFS results
Code
= do_sfs_lin_reg(formula_log_q, wine_train_log, False)
sfs_res4q "(Log-Transformed Variables Included)") show_sfs_results_lin_reg(sfs_res4q,
Backward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 5, avg. R² = 0.674 ]
1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph
Despite the fact that the algorithm suggests using more features, it seems that 5 features would be enough as additional features give just a slight improvement. Again, log-transformed features give a slightly better performance than the features on the original scale.
Code
= 5 n_features_q
List selected features and regression performance
Code
print("All features on original scale")
All features on original scale
Code
get_sfs_performance_lin_reg(sfs_res2q, n_features_q)
{'n_features': 5, 'mean R²': 0.647, 'median R²': 0.641, 'sd R²': 0.021}
Code
"feature_names"] sfs_res2q.get_metric_dict()[n_features_q][
('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph')
Code
print("With log-transformed features")
With log-transformed features
Code
get_sfs_performance_lin_reg(sfs_res4q, n_features_q)
{'n_features': 5, 'mean R²': 0.674, 'median R²': 0.663, 'sd R²': 0.033}
Code
= sfs_res4q.get_metric_dict()[n_features_q]["feature_names"]
formula_selected_5 formula_selected_5
('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph')
Code
= "alcohol ~ " + " + ".join(formula_selected_5) formula_selected_5
4.4.3 Compare Null, Full and Selected Models
Model | R² | AIC |
---|---|---|
Null | 0.000 | 3779 |
Full | 0.703 | 2250 |
6 predictors | 0.692 | 2284 |
5 predictors | 0.679 | 2336 |
Alcohol prediction: null model
Code
= "alcohol ~ 1"
formula_null
= smf.ols(formula_null, wine_train_log)
model_null = model_null.fit()
regression_model_null print(f"R² = {regression_model_null.rsquared:.3f}")
R² = -0.000
Code
print(regression_model_null.summary())
OLS Regression Results
==============================================================================
Dep. Variable: alcohol R-squared: -0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: nan
Date: Sun, 30 Jul 2023 Prob (F-statistic): nan
Time: 16:03:04 Log-Likelihood: -1888.5
No. Observations: 1279 AIC: 3779.
Df Residuals: 1278 BIC: 3784.
Df Model: 0
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 10.4228 0.030 351.749 0.000 10.365 10.481
==============================================================================
Omnibus: 126.420 Durbin-Watson: 1.978
Prob(Omnibus): 0.000 Jarque-Bera (JB): 163.802
Skew: 0.869 Prob(JB): 2.70e-36
Kurtosis: 3.233 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
Alcohol prediction: full model
Code
= smf.ols(formula_log_full, wine_train_log)
model_full = model_full.fit()
regression_model_full print(f"R² = {regression_model_full.rsquared:.3f}")
R² = 0.703
Code
print(regression_model_full.summary())
OLS Regression Results
==============================================================================
Dep. Variable: alcohol R-squared: 0.703
Model: OLS Adj. R-squared: 0.700
Method: Least Squares F-statistic: 272.1
Date: Sun, 30 Jul 2023 Prob (F-statistic): 4.94e-324
Time: 16:03:05 Log-Likelihood: -1112.9
No. Observations: 1279 AIC: 2250.
Df Residuals: 1267 BIC: 2312.
Df Model: 11
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 582.5403 15.608 37.324 0.000 551.921 613.160
log_total_sulfur_dioxide -0.1779 0.096 -1.859 0.063 -0.365 0.010
log_chlorides -0.3171 0.136 -2.332 0.020 -0.584 -0.050
log_sulphates 1.8950 0.199 9.518 0.000 1.504 2.286
log_residual_sugar 2.8405 0.121 23.554 0.000 2.604 3.077
log_free_sulfur_dioxide -0.0633 0.094 -0.672 0.502 -0.248 0.122
volatile_acidity 0.4518 0.120 3.768 0.000 0.217 0.687
log_fixed_acidity 9.7684 0.456 21.422 0.000 8.874 10.663
density -597.2859 16.092 -37.117 0.000 -628.855 -565.716
citric_acid 0.7387 0.141 5.222 0.000 0.461 1.016
ph 3.6857 0.171 21.577 0.000 3.351 4.021
quality 0.1779 0.025 7.207 0.000 0.129 0.226
==============================================================================
Omnibus: 160.987 Durbin-Watson: 1.956
Prob(Omnibus): 0.000 Jarque-Bera (JB): 281.782
Skew: 0.816 Prob(JB): 6.48e-62
Kurtosis: 4.621 Cond. No. 9.89e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
Alcohol prediction: 6-predictor model
Model with 6 selected variables (including quality score):
Code
= smf.ols(formula_selected_6, wine_train_log)
model_6 = model_6.fit()
regression_model_6 print(f"R² = {regression_model_6.rsquared:.3f}")
R² = 0.692
Code
print(regression_model_6.summary())
OLS Regression Results
==============================================================================
Dep. Variable: alcohol R-squared: 0.692
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 476.9
Date: Sun, 30 Jul 2023 Prob (F-statistic): 2.99e-321
Time: 16:03:06 Log-Likelihood: -1134.9
No. Observations: 1279 AIC: 2284.
Df Residuals: 1272 BIC: 2320.
Df Model: 6
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 605.6237 14.560 41.595 0.000 577.060 634.188
log_sulphates 1.8327 0.190 9.623 0.000 1.459 2.206
log_residual_sugar 2.9160 0.120 24.327 0.000 2.681 3.151
log_fixed_acidity 11.2386 0.383 29.357 0.000 10.488 11.990
density -621.9934 14.939 -41.637 0.000 -651.300 -592.686
ph 3.8674 0.160 24.103 0.000 3.553 4.182
quality 0.1782 0.024 7.456 0.000 0.131 0.225
==============================================================================
Omnibus: 166.331 Durbin-Watson: 1.945
Prob(Omnibus): 0.000 Jarque-Bera (JB): 294.610
Skew: 0.834 Prob(JB): 1.06e-64
Kurtosis: 4.658 Cond. No. 8.60e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
Alcohol prediction: null model
Model with 5 selected variables (without quality score):
Code
= smf.ols(formula_selected_5, wine_train_log)
model_5 = model_5.fit()
regression_model_5 print(f"R² = {regression_model_5.rsquared:.3f}")
R² = 0.679
Code
print(regression_model_5.summary())
OLS Regression Results
==============================================================================
Dep. Variable: alcohol R-squared: 0.679
Model: OLS Adj. R-squared: 0.678
Method: Least Squares F-statistic: 538.1
Date: Sun, 30 Jul 2023 Prob (F-statistic): 7.72e-311
Time: 16:03:07 Log-Likelihood: -1162.2
No. Observations: 1279 AIC: 2336.
Df Residuals: 1273 BIC: 2367.
Df Model: 5
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 651.0897 13.502 48.221 0.000 624.601 677.579
log_sulphates 2.3627 0.180 13.094 0.000 2.009 2.717
log_residual_sugar 3.0885 0.120 25.714 0.000 2.853 3.324
log_fixed_acidity 12.1631 0.370 32.885 0.000 11.438 12.889
density -667.9833 13.894 -48.075 0.000 -695.242 -640.725
ph 4.0415 0.162 24.930 0.000 3.723 4.360
==============================================================================
Omnibus: 162.823 Durbin-Watson: 1.958
Prob(Omnibus): 0.000 Jarque-Bera (JB): 288.272
Skew: 0.819 Prob(JB): 2.53e-63
Kurtosis: 4.651 Cond. No. 4.30e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
D:\Dokumentai\Data Science\TC\Projects-and-TC-Materials\my\ds-projects\red-wine-quality\renv\conda-env\Lib\site-packages\scipy\stats\_stats_py.py:1069: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
else dtype(mean))
4.4.4 Model Diagnostics and Feature Importance
Create mean center and standard deviation scaled data for feature importance analysis.
Code
= wine_train_log.select_dtypes(include="number")
wine_train_log_num
= StandardScaler().fit(wine_train_log_num)
scaler_log = scaler_log.transform(wine_train_log_num)
wine_train_log_scaled = pd.DataFrame(
wine_train_log_scaled =wine_train_log_num.columns
wine_train_log_scaled, columns )
Feature importance in the model with 6 selected variables:
Code
= smf.ols(formula_selected_6, wine_train_log_scaled)
model_6_scaled = model_6_scaled.fit()
regression_model_6_scaled = Linear_Reg_Diagnostic(regression_model_6_scaled)
diagnostics_6_scaled
(
pd.concat(
["Features"),
diagnostics_6_scaled.vif_table().set_index(
regression_model_6_scaled.params.rename("standardized_coefficients"
),
],=1,
axis"standardized_coefficients", key=abs, ascending=False)
).sort_values( )
VIF Factor standardized_coefficients
density 2.910 -1.105
log_fixed_acidity 4.050 0.919
ph 2.220 0.559
log_residual_sugar 1.320 0.435
log_sulphates 1.220 0.165
quality 1.370 0.136
Intercept 1.000 -0.000
Quality is the least important feature in this set while density is the most important. VIF value for log_fixed_acidity
is 4.05. Some authors claim that VIF > 4 indicate multicollinearity. Other sources suggest VIF > 5 to indicate multicollinearity.
Feature importance in the model with 5 selected variables (without quality score):
Code
= smf.ols(formula_selected_5, wine_train_log_scaled)
model_5_scaled = model_5_scaled.fit()
regression_model_5_scaled = Linear_Reg_Diagnostic(regression_model_5_scaled)
diagnostics_5_scaled
(
pd.concat(
["Features"),
diagnostics_5_scaled.vif_table().set_index(
regression_model_5_scaled.params.rename("standardized_coefficients"
),
],=1,
axis"standardized_coefficients", key=abs, ascending=False)
).sort_values( )
VIF Factor standardized_coefficients
density 2.420 -1.187
log_fixed_acidity 3.630 0.995
ph 2.180 0.584
log_residual_sugar 1.270 0.461
log_sulphates 1.050 0.213
Intercept 1.000 -0.000
Again, density
is the most important feature. And no multicollinearity here as all VIF < 4.
Important.
In models with 5 and 6 features, R² is respectively 0.679 and 0.692. The difference is small and the 6-th variable quality least important. Looking from the practical point of view, quality score is the most subjective and hardest to get feature as a trained expert is required and the result may differ from expert to expert. So the model with 5 variables is selected as the final model.
Now let’s do some model diagnostics by analyzing model residuals.
Code
= Linear_Reg_Diagnostic(regression_model_5) diagnostics_5
Distribution of residuals:
Code
plt.clf()= regression_model_5.resid.plot.density()
ax "Residuals")
ax.set_xlabel(=0, color="darkred") plt.axvline(x
<matplotlib.lines.Line2D object at 0x000001F41DBA41D0>
Code
plt.show()
It seems that residuals are slightly skewed and their mean shifted to the side of negative values but the shift is insignificant (see results below).
Do some formal assumption checking:
- Is mean of residuals equal to 0?
Code
= ttest_1samp(regression_model_5.resid, 0)
_, p_ttest print(f"t test that mean = 0, p: {format_p0(p_ttest)}")
t test that mean = 0, p: >0.999
- Are residuals normally distributed?
Code
pg.normality(regression_model_5.resid)
W pval normal
0 0.964 0.000 False
- Are residuals homoscedastic?
Code
= sms.het_breuschpagan(
_, p_bp, _, _
regression_model_5.resid, regression_model_5.model.exog
)print(f"Breusch-Pagan test for heteroscedasticity, p: {format_p0(p_bp)}")
Breusch-Pagan test for heteroscedasticity, p: <0.001
- are there any outliers?
Code
0] > 1).any() (regression_model_5.get_influence().cooks_distance[
False
Let’s look to the plots of residuals.
Code
diagnostics_5()
The analysis of residuals revealed no outliers and the assumption that mean of residuals equals to 0 is met. But it seems that normality and homoscedasticity assumptions are at least slightly violated.
(Slightly) violated normality and homoscedasticity assumptions may have influence on formal inference about linear models, e.g., may lead to too large confidence intervals. To cope with heteroscedasticity, more robust heteroskedasticity-consistent (HC) estimators may be used. On the other hand, these violations should not have much influence on prediction accuracy.
4.5 Test Final Models
4.5.1 Preprocess Test Set
Pre-processing of test data is presented there to clarify, which are the most important pre-processing steps.
Code
= wine_test.clean_names()
wine_test
= [
skewed_vars "chlorides",
"residual_sugar",
"sulphates",
"total_sulfur_dioxide",
"free_sulfur_dioxide",
"fixed_acidity",
]
= {i: "log_" + str(i) for i in skewed_vars}
new_names_map
= wine_test.transform_columns(skewed_vars, np.log10).rename(
wine_test_log =1
new_names_map, axis
)
"quality_gr_num"] = wine_test_log["quality"].apply(
wine_test_log[lambda x: to_3_quality_groups(x, numeric=True)
)
4.5.2 Alcohol Content Prediction
Linear Regression Model to predict alcohol content.
In test set, RMSE is slightly higher, which is expected. R² score is the same (R² = 0.68). This result show that model generalizes quite well. We can interpret, that on average, true value will differ by 0.617 % vol. from the true value of alcohol content.
Code
= wine_train_log.alcohol
y_train_true = regression_model_5.predict(wine_train_log)
y_train_pred
= wine_test_log.alcohol
y_test_true = regression_model_5.predict(wine_test_log)
y_test_pred
pd.concat(
["training"),
get_regression_performance(y_train_true, y_train_pred, "test"),
get_regression_performance(y_test_true, y_test_pred,
] )
set n SD RMSE R² RMSE_SD_ratio SD_RMSE_ratio
0 training 1279 1.059 0.600 0.679 0.567 1.764
0 test 320 1.089 0.617 0.679 0.566 1.766
4.5.3 Wine Quality Class Prediction
The performance of Multinomial Logistic Regression model that predicts wine quality in training set is kappa = 0.334, in test set kappa = 0.333. The difference is negligible, but performance is poor. Prediction value (precision) in test set for high quality wines is 0.71, for medium quality group is 0.86, but none of low quality wines was predicted correctly. This shows that the selected model is not suitable for this class identification. The reasons might be class imbalance and here techniques like regularization or other type of model (e.g., tree-based model) might help to achieve better performance. A more advanced feature engineering might also be helpful (e.g., merging “low” and “medium” classes).
Training set:
Code
= wine_train_log["quality_gr_num"]
true_train = predict_class_mnlogit(
pred_train
classification_model_selected, wine_train_log
) print_classification_report(true_train, pred_train)
kappa = 0.334
precision recall f1-score support
0 0.78 0.14 0.24 50
1 0.61 0.33 0.43 174
2 0.87 0.96 0.91 1055
accuracy 0.85 1279
macro avg 0.75 0.48 0.53 1279
weighted avg 0.83 0.85 0.82 1279
Confusion matrix (rows - true, columns - predicted):
[[ 7 2 41]
[ 0 57 117]
[ 2 35 1018]]
Test set:
Code
= wine_test_log["quality_gr_num"]
true_test = predict_class_mnlogit(classification_model_selected, wine_test_log)
pred_test
with warnings.catch_warnings():
"ignore")
warnings.simplefilter( print_classification_report(true_test, pred_test)
kappa = 0.333
precision recall f1-score support
0 0.00 0.00 0.00 13
1 0.71 0.35 0.47 43
2 0.86 0.98 0.92 264
accuracy 0.85 320
macro avg 0.53 0.44 0.46 320
weighted avg 0.81 0.85 0.82 320
Confusion matrix (rows - true, columns - predicted):
[[ 0 0 13]
[ 0 15 28]
[ 0 6 258]]
5 Conclusions
Alcohol contents prediction:
- Linear regression model to predict alcohol content was created.
- The performance of the model is R² = 0.68, RMSE = 0.617.
- Predictor variables in descending order by importance: density, fixed acidity, pH, residual sugar, and sulphates.
- Some linear model assumptions were violated so further statistical inference (e.g., calculation of confidence intervals) was not performed.
- After logarithmic transformation of right-skewed variables, model showed better performance.
Wine quality class prediction:
- Multinomial logistic regression model to identify low, medium and high quality wines was created.
- Predictor variables are chlorides, sulphates, residual sugar, free sulfur dioxide, volatile acidity, pH, and alcohol.
- Overall performance of the model is not good: kappa (test set) = 0.333.
- Predictive value (precision) for high quality wines is 0.71, for medium quality group is 0.86. Unfortunately, none of low quality wines was predicted correctly. This suggest that there is a room for model improvement (e.g., cope with class imbalance or use another type wine quality grouping) which can be the scope of the future studies.
6 Looker Studio Dashboard
A part of this project was to create Looker Studio dashboard. A dashboard to explore some properties of red wine were introduced. You can find a print screen of the dashboard and a link to it below.
Link to the dashboard.
7 Limitations and Suggestions on Improvement
- Class-imbalance in classification task should be addressed better (by using regularization, non-linear models, etc.).
- More feature engineering can be introduced into analysis (e.g., merging “low” and “medium” wine quality groups).
- In PCA, table or plot with feature contribution score could be added.
- For
.qmd
files, there is no Python code linter and formatter yet (or I’m not aware of one). So there might be style issues left. On the other hand, Quarto enforces leaving no critical code issues as otherwise HTML output is not created. - Results in different sections should be presented in a more consistent manner.
- More detailed written explanations could be added to some sections (e.g., to the section on PCA).
- If thinking about production pipelines, it would be easier to maintain code if it would be written in a single language (either R or Python). Bilingual code requires an expert, who knows both languages.