The Insights on European Football

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-03-23

Updated

2023-07-30

The Insights on European Football project logo. Generated with Leonardo.Ai.

Data analysis tools: Python, SQL, Looker Studio
Helper tools: VS Code, Quarto, Git
Skills:

data pre-processing
exploratory data analysis (EDA):
- descriptive statistics
- data visualization
inferential statistics:
- hypothesis testing
- confidence intervals
predictive modeling:
- classification task
- regression task
object-oriented programming (OOP)
statistical programming
literate programming
dashboarding

Abbreviations

Acc – accuracy.
BAcc – balanced accuracy.
BAcc_01 – balanced accuracy where 0 is the worst and 1 is the best result.
CI – 95% confidence interval.
CLD – compact letter display.
CV – cross-validation.
EDA – exploratory data analysis.
FIFA – International Federation of Association Football.
k – number of variables/features.
ML – machine learning.
n – either sample or group size.
NA, NAs – missing value(s).
p – p-value.
p_adj – p-value (adjusted).
PC, PCs – principal component(s).
PCA – principal component analysis.
r – Pearson’s correlation coefficient.
R² – coefficient of determination, r squared.
RMSE – root mean squared error.
RNG – (pseudo)random number generator.
SD – standard deviation.
SE – standard error.
SFS – sequential feature selection.
UK – United Kingdom.

1 Introduction

European Football (also known as Soccer) is one of the most popular games in Europe. Football is a big market with revenues of €27.6 billion in 2020/21 (source). The money is earned by, e.g., selling tickets to matches and rights to broadcast games, participating in betting, and advertising.

In this project, European Football data from seasons 2008/2009 to 2015/2016 was analyzed to get a better data-based understanding of this game. In each subsection of the “Analysis” section of this project, nine main questions are analyzed and insights are provided. At the beginning of each main subsection, the most important findings are presented and further parts of that subsection provide the details (plots, tables, etc.) on those findings.

Tip

Pay attention that some codes, analyses, results, or other details are hidden in collapsible sections (that are collapsed by default). These are:

either parts that have a lot of results that can clutter the report,
or less important or supplementary parts, e.g., to prove some claims in the text.

1.1 Setup

Code: The main Python setup

# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

# Plotting
%matplotlib inline

# Packages and modules -------------------------------
import os
import re
import warnings

# Working with SQL database
import sqlite3

# EDA
import ydata_profiling as eda
from skimpy import skim
import missingno as msno

# Data wrangling, maths
import numpy as np
import pandas as pd
import janitor  # imports additional Pandas methods

# Statistical analysis
import scipy.stats as sps

# Machine learning
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

# Maps
import geopandas as gpd
from shapely.geometry import Polygon

# Enable ability to run R in Python
os.environ["R_HOME"] = "C:/PROGRA~1/R/R-4.2.3"

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    import rpy2

    %load_ext rpy2.ipython

# Custom functions
import functions.fun_utils as my
import functions.pandas_methods
import functions.fun_analysis as an
import functions.fun_ml as ml

%aimport functions.fun_utils
%aimport functions.pandas_methods
%aimport functions.fun_analysis
%aimport functions.fun_ml

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_colwidth", 50)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# colors
green, blue, orange, red = "tab:green", "tab:blue", "tab:orange", "tab:red"

# Analysis parameters
do_eda = True

Contents of file "functions/fun_utils.py" (various functions imported via the main setup as `my`)

"""Various functions for data pre-processing, analysis and plotting."""

# OS module
import os

# Enable ability to run R code in Python
os.environ["R_HOME"] = "C:/PROGRA~1/R/R-4.2.3"
import rpy2.robjects as r_obj
from rpy2.robjects.conversion import localconverter

# Other Python libraries and modules
import pathlib
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms
import scipy.stats as sps
from scipy.stats import median_abs_deviation
from typing import Union
from IPython.display import display, HTML
from matplotlib.ticker import MaxNLocator

# Utilities ==================================================================

# For Pandas objects
def index_has_names(obj):
    """Check if index of an object has names.

    Args:
        obj: Object that has `.index` attribute

    Returns:
        bool: True if index has names, False otherwise.
    """
    return None not in list(obj.index.names)


# Display in Jupyter notebook
def display_collapsible(x, summary: str = "", sep=" ", is_open: bool = False):
    """Display data frame or other object surrounded by `<details>` tags

    (I.e., display in collapsible way)

    Args:
        x (pd.DataDrame, str, list): Object to display
        summary (str, optional): Collapsed section name. Defaults to "".
        sep (str, optional): Symbol used to join strings (when x is a list).
             Defaults to " ".
        is_open (bool, optional): Should the section be open by default
            Defaults to False.
    """
    if is_open:
        is_open = " open"
    else:
        is_open = ""

    if hasattr(x, "to_html") and callable(x.to_html):
        html_str = x.to_html()
    elif type(x) == str:
        html_str = x
    else:
        html_str = sep.join([str(i) for i in x])

    display(
        HTML(
            f"<details{is_open}><summary>{summary}</summary>"
            + html_str
            + "</details>"
        )
    )


def cached_results(file, fun, **kwargs):
    """If file does not exist, take results from file, otherwise
       calculate them, save to file and return the calculated result.

    Args:
        file (str): File name.
        fun (function): function.
        **kwargs: arguments passed to `fun`.

    Returns:
        The result of `fun()`
    """
    if pathlib.Path(file).is_file():
        with open(file, "rb") as f:
            results = pickle.load(f)

    else:
        results = fun(**kwargs)
        with open(file, "wb") as f:
            pickle.dump(results, f)

    return results


# Helper functions to work with R in Python -----------------------------------
def r_to_python(obj: str):
    """Import object from R environment to Python

    Import object from R environment created in ipynb cells via `rpy2` package.

    Args:
        obj (str): Object name in R global environment.

    Returns:
        Analogous Python object (NOTE: tested with data frames only).
    """
    return r_obj.pandas2ri.rpy2py(r_obj.globalenv[obj])


# Format values ------------------------------------------------------------
def format_p(p):
    """Format p values at 3 decimal places.

    Args:
        p (float): p value (number between 0 and 1).
    """
    if p < 0.001:
        return "p < 0.001"
    elif p > 0.999:
        return "p > 0.999"
    else:
        return f"p = {p:.3f}"


def format_percent(x: float):
    """Round percentages to 1 decimal place and format as strings

    Values between 0 and 0.05 are printed as <0.1%
    Values between 99.95 and 100 are printed as >100%

    Args:
        x (float): A sequence of percentage values ranging from 0 to 100.

    Returns:
        pd.Series[str]: Pandas series of formatted values.
        Values equal to 0 are formatted as "0%", values between
        0 and 0.05 are formatted as "<0.1%", values between 99.95 and 100
        are formatted as ">99.9%", and values equal to 100 are formatted
        as "100%".

    Author: Vilmantas Gėgžna
    """
    return pd.Series(
        [
            "0%"
            if i == 0
            else "<0.1%"
            if i < 0.05
            else ">99.9%"
            if 99.95 <= i < 100
            else f"{i:.1f}%"
            for i in x
        ],
        index=x.index,
    )


# Analysis =================================================================

# Exploratory analysis
def count_unique(data: pd.DataFrame):
    """Get number and percentage of unique values

    Args:
        data (pd.DataFrame): Data frame to analyze.

    Return: data frame with columns `n_unique` (int) and `percent_unique` (str)
    """
    n_unique = data.nunique()
    return pd.concat(
        [
            n_unique.rename("n_unique"),
            format_percent((n_unique / data.shape[0]).multiply(100)).rename(
                "percent_unique"
            ),
        ],
        axis=1,
    )


# Descriptive statistics ----------------------------------------------------
def calc_summaries(x, ndigits=None):
    """Calculate some common summary statistics.

    Args:
        x (pandas.Series): Numeric variable to summarize.
        ndigits (int, None, optional): Number of decimal digits to round to.
                Defaults to None.
    Return:
       pandas.DataFrame with summary statistics.
    """

    def mad(x):
        return median_abs_deviation(x)

    def range(x):
        return x.max() - x.min()

    res = x.agg(
        ["count", "min", "max", range, "mean", "median", "std", mad, "skew"]
    )

    if ndigits is not None:
        summary = pd.DataFrame(round(res, ndigits=ndigits)).T
    else:
        summary = pd.DataFrame(res).T
    # Present count data as integer:
    summary = summary.assign(count=lambda d: d["count"].astype(int))

    return summary


# Plot counts ---------------------------------------------------------------
def plot_counts_with_labels(
    counts,
    title="",
    x=None,
    y="n",
    x_lab=None,
    y_lab="Count",
    label="percent",
    label_rotation=0,
    title_fontsize=13,
    legend=False,
    ec="black",
    y_lim_max=None,
    ax=None,
    **kwargs,
):
    """Plot count data as bar plots with labels.

    Args:
        counts (pandas.DataFrame): Data frame with counts data.
        title (str, optional): Figure title. Defaults to "".
        x (str, optional): Column name from `counts` to plot on x axis.
                Defaults to None: first column.
        y (str, optional): Column name from `counts` to plot on y axis.
                Defaults to "n".
        x_lab (str, optional): X axis label.
              Defaults to value of `x` with capitalized first letter.
        y_lab (str, optional): Y axis label. Defaults to "Count".
        label (str, None, optional): Column name from `counts` for value labels.
                Defaults to "percent".
                If None, label is not added.
        label_rotation (int, optional): Angle of label rotation. Defaults to 0.
        legend (bool, optional): Should legend be shown?. Defaults to False.
        ec (str, optional): Edge color. Defaults to "black".
        y_lim_max (float, optional): Upper limit for Y axis.
                Defaults to None: do not change.
        ax (matplotlib.axes.Axes, optional): Axes object. Defaults to None.
        **kwargs: further arguments to pandas.DataFrame.plot.bar()

    Returns:
        matplotlib.axes.Axes: Axes object of the generate plot.

    Author: Vilmantas Gėgžna
    """
    if x is None:
        x = counts.columns[0]

    if x_lab is None:
        x_lab = x.capitalize()

    if y_lim_max is None:
        y_lim_max = counts[y].max() * 1.15

    ax = counts.plot.bar(x=x, y=y, legend=legend, ax=ax, ec=ec, **kwargs)
    ax.set_title(title, fontsize=title_fontsize)
    ax.set_xlabel(x_lab)
    ax.set_ylabel(y_lab)
    if label is not None:
        ax_add_value_labels_ab(
            ax, labels=counts[label], rotation=label_rotation
        )
    ax.set_ylim(0, y_lim_max)

    return ax


def ax_xaxis_integer_ticks(min_n_ticks: int, rot: int = 0):
    """Ensure that x axis ticks has integer values

    Args:
        min_n_ticks (int): Minimal number of ticks to use.
        rot (int, optional): Rotation angle of x axis tick labels.
        Defaults to 0.
    """
    ax = plt.gca()
    ax.xaxis.set_major_locator(
        MaxNLocator(min_n_ticks=min_n_ticks, integer=True)
    )
    plt.xticks(rotation=rot)


def ax_axis_comma_format(axis: str = "xy", ax=None):
    """Write values of X axis ticks with comma as thousands separator

    Args:
        axis (str, optional): which axis should be formatted:
           "x" X axis, "y" Y axis or "xy" (default) both axes.
        ax (axis object, None, optional):Axis of plot.
            Defaults to None: current axis.
    """

    if ax is None:
        ax = plt.gca()

    fmt = "{x:,.0f}"
    formatter = plt.matplotlib.ticker.StrMethodFormatter(fmt)
    if "x" in axis:
        ax.xaxis.set_major_formatter(formatter)

    if "y" in axis:
        ax.yaxis.set_major_formatter(formatter)


def ax_add_value_labels_ab(
    ax, labels=None, spacing=2, size=9, weight="bold", **kwargs
):
    """Add value labels above/below each bar in a bar chart.

    Arguments:
        ax (matplotlib.Axes): Plot (axes) to annotate.
        label (str or similar): Values to be used as labels.
        spacing (int): Number of points between bar and label.
        size (int): font size.
        weight (str): font weight.
        **kwargs: further arguments to axis.annotate.

    Source:
        This function is based on https://stackoverflow.com/a/48372659/4783029
    """

    # For each bar: Place a label
    for rect, label in zip(ax.patches, labels):
        # Get X and Y placement of label from rect.
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2

        space = spacing

        # Vertical alignment for positive values
        va = "bottom"

        # If the value of a bar is negative: Place label below the bar
        if y_value < 0:
            # Invert space to place label below
            space *= -1
            # Vertical alignment
            va = "top"

        # Use Y value as label and format number with one decimal place
        if labels is None:
            label = "{:.1f}".format(y_value)

        # Create annotation
        ax.annotate(
            label,
            (x_value, y_value),
            xytext=(0, space),
            textcoords="offset points",
            ha="center",
            va=va,
            fontsize=size,
            fontweight=weight,
            **kwargs,
        )


# Inferential statistics -----------------------------------------------------
def ci_proportion_multinomial(
    counts,
    method: str = "goodman",
    n_label: str = "n",
    percent_label: str = "percent",
) -> pd.DataFrame:
    """Calculate  simultaneous confidence intervals for multinomial proportion.

    More information in documentation of statsmodels'
    multinomial_proportions_confint.

    Args:
        x (int): ps.Series, list or tuple with count data.
        method (str, optional): Method. Defaults to "goodman".
       n_label (str, optional): Name for column for counts.
       percent_label (str, optional): Name for column for percentage values.

    Returns:
        pd.DataFrame: _description_

    Examples:
    >>> ci_proportion_multinomial([62, 33, 55])
    """
    assert type(counts) in [pd.Series, list, tuple]
    if type(counts) is not pd.Series:
        counts = pd.Series(counts)

    return pd.concat(
        [
            (counts).rename(n_label),
            (counts / sum(counts)).rename(percent_label) * 100,
            pd.DataFrame(
                sms.multinomial_proportions_confint(counts, method=method),
                index=counts.index,
                columns=["ci_lower", "ci_upper"],
            )
            * 100,
        ],
        axis=1,
    )


def test_chi_square_gof(
    f_obs: list[int], f_exp: Union[str, list[float]] = "all equal"
) -> str:
    """Chi squared (χ²) goodness-of-fit (gof) test

    Args:
        f_obs (list[int]): Observed frequencies
        f_exp str, list[int]: List of expected frequencies or "all equal" if
              all frequencies are equal to the mean of observed frequencies.
              Defaults to "all equal".

    Returns:
        str: formatted test results including p value.
    """
    k = len(f_obs)
    n = sum(f_obs)
    exp = n / k
    dof = k - 1
    if f_exp == "all equal":
        f_exp = [exp for _ in range(k)]
    stat, p = sps.chisquare(f_obs=f_obs, f_exp=f_exp)
    # May also be formatted this way:
    return (
        f"Chi square test, χ²({dof}, n = {n}) = {round(stat, 2)}, {format_p(p)}"
    )


def pairwise_chisq_gof_test(x: pd.Series):
    """Post-hoc Pairwise chi-squared Test

    Interface to R function `rstatix::pairwise_chisq_gof_test()`.

    Args:
        x (pandas.Series): data with group counts

    Returns:
        pandas.DataFrame: DataFrame with CLD results.
    """
    # Loading R package
    rstatix = r_obj.packages.importr("rstatix")
    dplyr = r_obj.packages.importr("dplyr")

    # Converting Pandas obj to R obj
    with localconverter(r_obj.default_converter + r_obj.pandas2ri.converter):
        x_in_r = r_obj.conversion.py2rpy(x)

    # Invoking the R function and getting the result
    df_result_r = rstatix.pairwise_chisq_gof_test(x_in_r)
    df_result_r = dplyr.relocate(df_result_r, "group1", "group2")

    # Converting the result to a Pandas dataframe
    return r_obj.pandas2ri.rpy2py(df_result_r)


def convert_pairwise_p_to_cld(
    data,
    group1: str = "group1",
    group2: str = "group2",
    p_name: str = "p.adj",
    output_gr_var: str = "group",
):
    """Convert p values from pairwise comparisons to CLD

    CLD - compact letter display: shared letter shows that difference
    is not significant. Interface to R function `convert_pairwise_p_to_cld()`.

    Args:
        data (pandas.DataFrame): Data frame with at least 3 columns:
              the first 2 columns contain names of both groups, one more
              column should contain p values.
        group1 (str, optional): Name of the  first column with group names.
               Defaults to "group1".
        group2 (str, optional): Name of the  first column with group names.
               Defaults to "group2".
        p_name (str, optional): Name of column with p values.
               Defaults to "p.adj".
        output_gr_var (str, optional): Name of column in output dataset
               with group names. Defaults to "group".

    Returns:
        pandas.DataFrame: DataFrame with CLD results.
    """
    # Loading R function from file
    r_obj.r["source"]("functions/functions.R")
    convert_pairwise_p_to_cld = r_obj.globalenv["convert_pairwise_p_to_cld"]

    # Converting Pandas data frame to R data frame
    with localconverter(r_obj.default_converter + r_obj.pandas2ri.converter):
        df_in_r = r_obj.conversion.py2rpy(data)

    # Invoking the R function and getting the result
    df_result_r = convert_pairwise_p_to_cld(
        df_in_r,
        group1=group1,
        group2=group2,
        p_name=p_name,
        output_gr_var=output_gr_var,
    )

    # Converting the result back to a Pandas dataframe
    return r_obj.pandas2ri.rpy2py(df_result_r)

Contents of file "fun_analysis.py" (classes imported via the main setup as `an`)

"""Classes to perform statistical analysis and output the results."""

import pandas as pd
import numpy as np
import pingouin as pg
import statsmodels.stats.api as sms
import scikit_posthocs as sp
import matplotlib.pyplot as plt

import functions.fun_utils as my  # Custom module
import functions.pandas_methods  # Custom module; imports method .to_df()


# Analyze count data ---------------------------------------------------------
class AnalyzeCounts:
    """The class to analyze count data.

    - Performs omnibus chi-squared and post-hoc pair-wise chi-squared test.
    - Compactly presents results of post-hoc test as compact letter display, CLD
      NOTE: for CLD calculations, R is required.
      (Shared CLD letter show no significant difference between groups).
    - Calculates percentages and their confidence intervals by using Goodman's
    method.
    - Creates summary of grouped values (group counts and percentages).
    - Plots results as bar plots with percentage labels.
    """

    def __init__(self, counts, by=None, counts_of=None):
        """
        Object initialization function.

        Args:
            counts (pandas.Series[int]): Count data to analyze.
            by (str, optional): Grouping variable name. Used to create labels.
                      If None, defaults to "Group"
            counts_of (str, optional): The thing that was counted.
                    This name is used for labels in plots and tables.
                    Defaults to `counts.name`.
        """
        assert isinstance(counts, pd.Series)

        # Set defaults
        if by is None:
            by = "Group"

        if counts_of is None:
            counts_of = counts.name

        # Set attributes: user inputs or defaults
        self.counts = counts
        self.counts_of = counts_of
        self.by = by

        # Set attributes: created/calculated
        self.n_label = f"n_{counts_of}"  # Create label for counts

        # Set attributes: results to be calculated
        self.results_are_calculated = False
        self.omnibus = None
        self.n_ci_and_cld = None
        self.descriptive_stats = None

    def fit(self):
        """Perform count data analysis: calculate the results."""

        # Alias attributes
        counts = self.counts
        by = self.by
        n_label = self.n_label

        # Omnibus test: perform and save the results
        self.omnibus = my.test_chi_square_gof(counts)

        # Post-hoc (pairwise chi-square): perform
        posthoc_p = my.pairwise_chisq_gof_test(counts)
        posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p, output_gr_var=by)

        # Confidence interval: calculate
        ci = (
            my.ci_proportion_multinomial(
                counts, method="goodman", n_label=n_label
            )
            .rename_axis(by)
            .reset_index()
        )

        # Make sure datasets are mergeable
        ci[by] = ci[by].astype(str)
        posthoc_cld[by] = posthoc_cld[by].astype(str)

        # Merge results
        n_ci_and_cld = pd.merge(ci, posthoc_cld, on=by)

        # Format percentages and counts
        vars = ["percent", "ci_lower", "ci_upper"]
        n_ci_and_cld[vars] = n_ci_and_cld[vars].apply(my.format_percent)

        # Save results
        self.n_ci_and_cld = n_ci_and_cld

        # Descriptive statistics: calculate
        to_format = ["min", "max", "range", "mean", "median", "std", "mad"]

        def format_0f(x):
            return [f"{i:,.0f}" for i in x]

        summary_count = my.calc_summaries(ci[n_label])
        summary_count[to_format] = summary_count[to_format].apply(format_0f)

        summary_perc = my.calc_summaries(ci["percent"])
        summary_perc[to_format] = summary_perc[to_format].apply(
            my.format_percent
        )
        # Save results
        self.descriptive_stats = pd.concat([summary_count, summary_perc])

        # Initialization status
        self.results_are_calculated = True

        # Output
        return self

    def print(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Print numeric results.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.fit()` method were not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            print("Omnibus (chi-squared) test results:")
            print(self.omnibus, "\n")

        # Post-hoc and CI
        if posthoc:
            print(
                f"Counts of {self.counts_of} with 95% CI "
                "and post-hoc (pairwise chi-squared) test results:"
            )
            print(self.n_ci_and_cld, "\n")

        # Descriptive statistics: display
        if descriptives:
            print(f"Descriptive statistics of group ({self.by}) counts:")
            print(self.descriptive_stats, "\n")

    def display(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Display numeric results in Jupyter Notebooks.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.analyze()` method were
            not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            my.display_collapsible(
                self.omnibus, "Omnibus (chi-squared) test results"
            )

        # Post-hoc and CI
        if posthoc:
            my.display_collapsible(
                self.n_ci_and_cld.style.format({self.n_label: "{:,.0f}"}),
                f"Counts of {self.counts_of} with 95% CI and post-hoc "
                " (pairwise chi-squared) test results",
            )

        # Descriptive statistics: display
        if descriptives:
            my.display_collapsible(
                self.descriptive_stats,
                f"Descriptive statistics of group ({self.by}) counts",
            )

    def plot(self, xlabel=None, ylabel=None, **kwargs):
        """Plot analysis results.

        Args:
            xlabel (str, None, optional): X axis label.
                    Defaults to None: autogenerated label.
            ylabel (str, None, optional): Y axis label.
                    Defaults to None: autogenerated label.
            **kwargs: further arguments passed to `my.plot_counts_with_labels()`

        Raises:
            Exception: if calculations with `.fit()` method were
            not performed.

        Returns:
            matplotlib.axes object
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Plot
        if xlabel is None:
            xlabel = self.by.capitalize()

        if ylabel is None:
            ylabel = f"Number of {self.counts_of}"

        ax = my.plot_counts_with_labels(
            self.n_ci_and_cld,
            x_lab=xlabel,
            y_lab=ylabel,
            y=self.n_label,
            **kwargs,
        )

        my.ax_axis_comma_format("y")

        return ax


# Analyze numeric groups ------------------------------------------------------
class AnalyzeNumericGroups:
    """Class to analyze numeric/continuous data by groups.

    - Calculates mean ratings per group and their confidence intervals using
        t distribution.
    - Performs omnibus (Kruskal-Wallis) and post-hoc (Conover-Iman) tests.
    - Compactly presents results of post-hoc test as compact letter display, CLD
      NOTE: for CLD calculations, R is required.
      (Shared CLD letter show no significant difference between groups).
    - Creates summary of grouped values (group counts and percentages).
    - Plots results as points with 95% confidence interval error bars.
    """

    def __init__(self, data, y: str, by: str):
        """Initialize the class.

        Args:
            y (str): Name of numeric/continuous (dependent) variable.
            by (str): Name of grouping (independent) variable.
            data (pandas.DataFrame): data frame with variables indicated in
                `y` and `by`.
        """
        assert isinstance(data, pd.DataFrame)

        # Set attributes: user inputs
        self.data = data
        self.y = y
        self.by = by

        # Set attributes: results to be calculated
        self.results_are_calculated = False
        self.omnibus = None
        self.ci_and_cld = None
        self.descriptive_stats = None

    def fit(self):
        # Aliases:
        data = self.data
        y = self.y
        by = self.by

        # Omnibus test: Kruskal-Wallis test
        omnibus = pg.kruskal(data=data, dv=y, between=by)
        omnibus["p-unc"] = my.format_p(omnibus["p-unc"][0])

        self.omnibus = omnibus

        # Confidence intervals
        ci_raw = data.groupby(by)[y].apply(
            lambda x: [np.mean(x), *sms.DescrStatsW(x).tconfint_mean()]
        )
        ci = pd.DataFrame(
            list(ci_raw),
            index=ci_raw.index,
            columns=["mean", "ci_lower", "ci_upper"],
        ).reset_index()

        # Post-hoc test: Conover-Iman test
        posthoc_p_matrix = sp.posthoc_conover(
            data, val_col=y, group_col=by, p_adjust="holm"
        )
        posthoc_p_df = posthoc_p_matrix.stack().to_df(
            "p.adj", ["group1", "group2"]
        )
        posthoc_cld = my.convert_pairwise_p_to_cld(
            posthoc_p_df, output_gr_var=by
        )

        # Make sure datasets are mergeable
        ci[by] = ci[by].astype(str)
        posthoc_cld[by] = posthoc_cld[by].astype(str)

        self.ci_and_cld = pd.merge(posthoc_cld, ci, on=by)

        # Descriptive statistics of means
        self.descriptive_stats = my.calc_summaries(ci["mean"])

        # Results are present
        self.results_are_calculated = True

        # Output:
        return self

    def print(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Print numeric results.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.fit()` method were
            not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            print("Omnibus (Kruskal-Wallis) test results:")
            print(self.omnibus, "\n")

        # Post-hoc and CI
        if posthoc:
            print(
                "Post-hoc (Conover-Iman) test results as CLD and "
                "Confidence intervals (CI):",
            )
            print(self.ci_and_cld, "\n")

        # Descriptive statistics
        if descriptives:
            print(f"Descriptive statistics of group ({self.by}) means:")
            print(self.descriptive_stats, "\n")

    def display(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Display numeric results in Jupyter Notebooks.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.fit()` method were
            not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            my.display_collapsible(
                self.omnibus, "Omnibus (Kruskal-Wallis) test results"
            )

        # Post-hoc and CI
        if posthoc:
            my.display_collapsible(
                self.ci_and_cld,
                "Post-hoc (Conover-Iman) test results as CLD and "
                "Confidence intervals (CI)",
            )

        # Descriptive statistics of means
        if descriptives:
            my.display_collapsible(
                self.descriptive_stats,
                f"Descriptive statistics of group ({self.by}) means",
            )

    def plot(self, title=None, xlabel=None, ylabel=None, **kwargs):
        """Plot the results

        Args:

            xlabel (str, None, optional): X axis label.
                    Defaults to None: capitalized value of `by`.
            ylabel (str, None, optional): Y axis label.
                    Defaults to None: capitalized value of `y`.
            title (str, None, optional): The title of the plot.
                    Defaults to None.

        Returns:
            Tuple with matplotlib figure and axis objects (fig, ax).
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Aliases:
        ci = self.ci_and_cld
        by = self.by
        y = self.y

        # Create figure and axes
        fig, ax = plt.subplots()

        # Construct plot
        x = ci.iloc[:, 0]

        ax.errorbar(
            x=x,
            y=ci["mean"],
            yerr=[ci["mean"] - ci["ci_lower"], ci["ci_upper"] - ci["mean"]],
            mfc="red",
            ms=2,
            mew=1,
            fmt="ko",
            zorder=3,
        )

        if xlabel is None:
            xlabel = by.capitalize()

        if ylabel is None:
            ylabel = y.capitalize()

        ax.set_xlabel(xlabel)
        ax.set_ylabel(ylabel)
        ax.set_ylim([0, None])
        ax.set_title(title)

        # Output
        return (fig, ax)

Contents of file "fun_ml.py" (functions for machine learning imported via the main setup as `ml`)

from typing import Union

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

# Machine learning
from sklearn.metrics import mean_squared_error as mse, r2_score
from sklearn.metrics import f1_score, accuracy_score, balanced_accuracy_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import classification_report, confusion_matrix
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


# Helpers
def as_formula(
    target: str = None,
    include: Union[list[str], pd.DataFrame] = None,
    exclude: list[str] = None,
    add: str = "",
):
    """
    Generates the R style formula for statsmodels (patsy) given
    the dataframe, dependent variable and optional excluded columns
    as strings.

    Args:
        target (str): name of target variable.
        include (pandas.DataFrame or list[str]):
            dataframe of column names to include.
        exclude (list[str], optional):
            columns to exclude.
        add (str): string to add to formula, e.g., "+ 0"

    Return:
        String with R style formula for `patsy` (e.g., "target ~ x1 + x2").

    See also: https://stackoverflow.com/a/44866142/4783029
    """
    if isinstance(include, pd.DataFrame):
        include = list(include.columns.values)

    if target in include:
        include.remove(target)

    if exclude is not None:
        for col in exclude:
            include.remove(col)

    return target + " ~ " + " + ".join(include) + add


def get_columns_by_purpose(data, target: str):
    """Split data frame to 3 data frames: for target, numeric, and remaining
    variables.

    Examples:
    >>> # Split
    >>> d_target, d_num, d_other = get_columns_by_purpose(data, "class")

    >>> # Merge back
    >>> pd.concat([d_target, d_num, d_other], axis=1)
    """
    d_num = data.drop(columns=target).select_dtypes("number")
    d_other = data.drop(columns=[target, *d_num.columns.values])

    return data[target].to_frame(), d_num, d_other


# Functions for feature selection
def sfs(estimator, est_type, k_features="parsimonious", forward=True):
    """Create SFS instance for classification

    Args.:
        est_type (str): classification or regression
        other arguments: see mlextend.SequentialFeatureSelector()
    """

    if est_type == "regression":
        scoring = "neg_root_mean_squared_error"
    elif est_type == "classification":
        scoring = "balanced_accuracy"
    else:
        raise Exception(f"Unrecognized learner/estimator type: {type}")

    return SequentialFeatureSelector(
        estimator,
        k_features=k_features,  # "parsimonious",
        forward=forward,
        floating=False,
        scoring=scoring,
        verbose=1,
        cv=5,
        n_jobs=-1,
    )


def sfs_get_score(sfs_object, k_features):
    """Return performance score achieved with certain number of features.

    Args.:
        sfs_object: result of function do_sfs_lin_reg()
        k_features (int): number of features.
    """
    md = round(
        np.median(sfs_object.get_metric_dict()[k_features]["cv_scores"]), 3
    )
    return {
        "k_features": k_features,
        "mean_score": round(
            sfs_object.get_metric_dict()[k_features]["avg_score"], 3
        ),
        "median_score": md,
        "sd_score": round(
            sfs_object.get_metric_dict()[k_features]["std_dev"], 3
        ),
    }


def sfs_plot_results(sfs_object, sub_title="", ref_y=None):
    """Plot results from SFS object

    Args.:
      sfs_object: object with SFS results.
      sub_title (str): second line of title.
      ref_y (float): Y coordinate of reference line.
    """

    scoring = sfs_object.get_params()["scoring"]

    if scoring == "neg_root_mean_squared_error":
        metric = "RMSE"
        sign = -1
    elif scoring == "balanced_accuracy":
        metric = "BAcc"
        sign = 1
    else:
        raise Exception(f"Unsupported scoring metric: {scoring}")

    if sfs_object.forward:
        sfs_type = "Forward"
    else:
        sfs_type = "Backward"

    fig, ax = plt.subplots(1, 2, sharey=True)

    xlab = "Number of predictors included"

    if ref_y is not None:
        ax[0].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5)
        ax[1].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5)

    avg_score = [
        (int(i), sign * c["avg_score"]) for i, c in sfs_object.subsets_.items()
    ]

    averages = pd.DataFrame(avg_score, columns=["k_features", "avg_score"])

    (
        averages.plot.scatter(
            x="k_features",
            y="avg_score",
            xlabel=xlab,
            ylabel=metric,
            title=f"Average {metric}",
            ax=ax[0],
        )
    )

    cv_scores = {
        int(i): sign * c["cv_scores"] for i, c in sfs_object.subsets_.items()
    }
    (
        pd.DataFrame(cv_scores).plot.box(
            xlabel=xlab,
            title=f"{metric} in CV splits",
            ax=ax[1],
        )
    )

    ax[0].xaxis.set_major_locator(mticker.MaxNLocator(integer=True))
    ax[1].xaxis.set_major_locator(mticker.MaxNLocator(integer=True))

    if not sfs_object.forward:
        ax[1].invert_xaxis()

    main_title = (
        f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV "
        + f"\n{sub_title}"
    )

    fig.suptitle(main_title)
    plt.tight_layout()
    plt.show()

    # Print results
    if not sfs_object.interrupted_:
        if sfs_object.is_parsimonious:
            note = "[Parsimonious]"
            k_selected = f"k = {len(sfs_object.k_feature_names_)}"
            score_at_k = f"avg. {metric} = {sign * sfs_object.k_score_:.3f}"
            note_2 = "Smallest number of predictors at best ± 1 SE score"
        else:
            note = "[Best]"
            if sign < 0:
                best = averages.nsmallest(1, "avg_score")
            else:
                best = averages.nlargest(1, "avg_score")
            k_selected = f"k = {int(best.k_features.values)}"
            score_at_k = f"avg. {metric} = {float(best.avg_score.values):.3f}"
            note_2 = "Number of predictors at best score"

        print(f"{k_selected}, {score_at_k} {note}\n({note_2})")


def sfs_list_features(sfs_result):
    """List features by order when they were added.
    Current implementation correctly works with forward selection only.

    Args:
        sfs_result (SFS object)
    """

    def rename_metric(x):
        return x.replace("score", metric)

    scoring = sfs_result.get_params()["scoring"]

    if scoring == "neg_root_mean_squared_error":
        metric = "RMSE"
        sign = -1
    elif scoring == "balanced_accuracy":
        metric = "BAcc"
        sign = 1
    else:
        raise Exception(f"Unsupported scoring metric: {scoring}")

    feature_dict = sfs_result.get_metric_dict()
    lst = [[*feature_dict[i]["feature_names"]] for i in feature_dict]
    feature = []
    for x, y in zip(lst[0::], lst[1::]):
        feature.append(*set(y).difference(x))

    return (
        pd.DataFrame(
            {
                "added_feature": [*lst[0], *feature],
                "score": [
                    sign * feature_dict[i]["avg_score"] for i in feature_dict
                ],
            }
        )
        .assign(score_improvement=lambda x: sign * x.score.diff())
        .assign(
            score_percentage_change=lambda x: sign * x.score.pct_change() * 100
        )
        .index_start_at(1)
        .rename_axis("step")
        .rename(columns=rename_metric)
    )


# Functions for regression/classification
def get_regression_performance(y_true, y_pred, name=""):
    """Evaluate regression model performance

    Calculate R², RMSE, and SD of predicted variable

    Args.:
      y_true, y_pred: true and predicted numeric values.
      name (str): the name of investigated set.
    """
    return (
        pd.DataFrame(
            {
                "set": name,
                "n": len(y_true),
                "SD": [float(np.std(y_true))],
                "RMSE": [float(np.sqrt(mse(y_true, y_pred)))],
                "R²": [r2_score(y_true, y_pred)],
            }
        )
        .eval("RMSE_SD_ratio = RMSE/SD")
        .eval("SD_RMSE_ratio = SD/RMSE")
    )


def get_classification_performance(true_class, predicted_class, name=""):
    """Evaluate classification model performance

    Calculate accuracy (Acc),
    Balanced accuracy (BAcc),
    Balanced accuracy adjusted to be between 0 and 1 (BAcc_01),
    F1 macro average (F1_macro),
    F1 weighted macro average (F1_weighted),
    Cohen's Kappa.

    Args.:
      true_class, predicted_class: true and predicted numeric values.
      name (str): the name of investigated set.
    """
    acc = accuracy_score(true_class, predicted_class)
    bacc = balanced_accuracy_score(true_class, predicted_class)
    bacc01 = balanced_accuracy_score(true_class, predicted_class, adjusted=True)
    f1_macro = f1_score(true_class, predicted_class, average="macro")
    f1_weighted = f1_score(true_class, predicted_class, average="weighted")
    kappa = cohen_kappa_score(true_class, predicted_class)

    return pd.DataFrame(
        {
            "set": name,
            "n": len(true_class),
            "Accuracy": [acc],
            "BAcc": [bacc],
            "BAcc_01": [bacc01],
            "f1_macro": [f1_macro],
            "f1_weighted": [f1_weighted],
            "Kappa": [kappa],
        }
    )


def print_classification_report(true_class, predicted_class, name=""):
    """Print summary of classification performance

    Args.:
        true_class, predicted_class: data sequences of the same length:
                                     with class names/indicators.
        name (str): the name of investigated set of data.
    """
    print(
        get_classification_performance(true_class, predicted_class, name=name)
    )
    print("")
    print(classification_report(true_class, predicted_class, zero_division=0))
    print("")
    print("Confusion matrix (rows - true, columns - predicted):")
    print(confusion_matrix(true_class, predicted_class))


# For Random Forests
def get_rf_importances(obj):
    """Get random forest feature importance

    Args:
        obj (fitted instance of RandomForestRegressor()):
            Random Forest.

    Returns:
        pandas.DataFrame: dataframe with feature names and their importance.
    """
    return pd.DataFrame(
        {
            "features": obj.feature_names_in_,
            "importance": obj.feature_importances_,
        }
    ).sort_values("importance", ascending=False)


def plot_importances(data, n=20):
    """Plot 2 plots with feature importance: "overview" and zoomed plot.

    Args:
        data (pandas.DataFrame):
            dataframe with columns `features` and `importance`
    """
    fig, ax = plt.subplots(2, 1, height_ratios=(1, 3))

    data.plot.bar(
        x="features",
        y="importance",
        ylabel="",
        xlabel="",
        legend=False,
        ax=ax[0],
    )

    ax[0].xaxis.set_ticklabels([])

    data.head(n).plot.bar(
        x="features", y="importance", ylabel="", xlabel="Features", ax=ax[1]
    )

    fig.suptitle("Feature Importance: All and Several Top Variables")

    return fig, ax


# PCA
def pca_screeplot(data, n_components=30):
    """Plot PCA screeplot

    Args:
        data (pandas.Dataframe): Numeric data
        n_components (int, optional):
            Max number of principal components to extract.
            Defaults to 30.

    Returns:
        3 objects: plot (fig and ax) and pca object.
    """
    scale = StandardScaler()
    pca = PCA(n_components=n_components)

    scaled_data = scale.fit_transform(data)
    pca.fit(scaled_data)

    pct_explained = pca.explained_variance_ratio_ * 100

    fig, ax = plt.subplots(figsize=(8, 4))
    ax.plot(pct_explained, "-o", color="tab:green")

    ax.set_xlabel("Number of components")
    ax.set_ylabel("% of explained variance")

    return fig, ax, pca


def do_pca(data, target: str, n_components: int = 10, scale=None, pca=None):
    """Do PCA on numeric non-target variables

    Args:
        data (pandas.Dataframe): data
        target (str): Target variable name
        n_components (int, optional):
            Number of PCA components to extract.
            Defaults to 10.
            n_components is ignored if `pca` is not None.
        scale (instance of sklearn.preprocessing.StandardScaler or None):
            Fitted object to scale data.
        pca (instance of sklearn.decomposition.PCA or None):
            Fitted PCA object.

    Returns:
        tuple with 6 elements:
          - 4 data frames: d_target, d_num, d_other, d_pca
          - fitted instance of sklearn.preprocessing.StandardScaler.
          - fitted instance of sklearn.decomposition.PCA.
    """
    d_target, d_num, d_other = get_columns_by_purpose(data, target)

    if scale is None:
        scale = StandardScaler()
        sc_data = scale.fit_transform(d_num)
    else:
        sc_data = scale.transform(d_num)

    if pca is None:
        pca = PCA(n_components=n_components)
        pc_num = pca.fit_transform(sc_data)
    else:
        pc_num = pca.transform(sc_data)
        n_components = pc_num.shape[1]

    # Convert to DataFrame and name columns (pc_1, pc_2, etc.)
    d_pca = pd.DataFrame(
        pc_num,
        index=d_num.index,
        columns=[f"pc_{i}" for i in np.arange(1, n_components + 1)],
    )

    return (d_target, d_num, d_other, d_pca, scale, pca)

Contents of file "pandas_methods.py" (methods for Pandas objects imported via the main setup)

"""New methods for Pandas Series and DataFrames"""

# Setup -----------------------------------------------------------------
import warnings
import pandas as pd
import pandas_flavor as pf
from typing import Union
import janitor  # imports additional Pandas methods

import functions.fun_utils as my  # Custom module

# Series methods --------------------------------------------------------
@pf.register_series_method
def to_df(
    self: pd.Series,
    values_name: str = None,
    key_name: Union[str, list[str], tuple[str, ...]] = None,
) -> pd.DataFrame:
    """Convert Series to DataFrame with desired or default column names.

    Similar to `pandas.Series.to_frame()`, but the main purpose of this method
    is to be used with the result of `.value_counts()`. So appropriate default
    column names are pre-defined. And index is always reset.

    Args:
        self (pandas.Series):
            The object the method is applied to.
        values_name (str):
            Name for series values (applied before conversion to DataFrame).
            Defaults "count".
        key_name (str or sequence of str):
            New name for the columns, that are created from Series index
            that was present before the conversion to DataFrame.
            Defaults to `self.index.names`, if index has names,
            to `self.name` if index has no names but series has name,
            or to "value" otherwise.

    Return:
        pandas.DataFrame

    Examples:
    >>> import pandas as pd
    >>> df = pd.Series({'right': 138409, 'left': 44733}).rename("foot")

    >>> df.to_df()

    >>> # Compared to .to_frame()
    >>> df.to_frame()
    """

    k_name = None
    v_name = None

    # Check if defaults can be set based on non-missing attribute values
    if my.index_has_names(self):
        k_name = self.index.names
        if self.name is not None:
            v_name = self.name
    else:
        k_name = self.name

    # Set user-defined values or defaults
    if key_name is not None:
        k_name = key_name
    elif k_name is None:
        k_name = "value"  # Default

    if values_name is not None:
        v_name = values_name
    elif v_name is None:
        v_name = "count"  # Default

    # Output
    return self.rename_axis(k_name).rename(v_name).reset_index()


@pf.register_series_method
def to_category(self, categories=None, ordered=False):
    """Convert variable to categorical one.

    NOTE: method with the same name but for DataFrame also exists.

    Args:
        self (pandas.Series):
            The object the method is applied to.
        categories (list of values, optional):
            Categories listed here will become the first categories.
            The remaining ones (not in this list) will follow.
            Defaults to None: use default order.
        ordered (bool, optional):
            Whether or not this categorical is treated as ordered categorical.
            Defaults to False.

    Return:
        pandas.Series
    """
    self = self.astype("category")
    all_cats = self.cat.categories.values
    if categories is not None:
        # new order
        all_cats = [
            *categories,
            *sorted(list(set(all_cats).difference(categories))),
        ]
    return self.cat.reorder_categories(all_cats, ordered=ordered)


# DataFrame methods --------------------------------------------------------
@pf.register_dataframe_method
def relocate(self, col, before=0):
    """Change position of a column.
    Do transformations in-place and return a data frame.

    Args:
        self (pd.DataFrame):
            The object the method is applied to.
        col (str):
            The name of column to relocate.
        before (int|str):
            The name or index of the column before which `col` will be inserted.

    Return:
        pandas.DataFrame

    Examples:
        >>> import pandas as pd
        >>> data = pd.DataFrame({"a": 1, "b":2, "c":3})
        >>> data.relocate("c")
        >>> data
        >>> data.relocate("b", before="a")
        >>> data
    """
    columns = self.columns
    assert col in columns

    if before is None:
        position = 0

    if isinstance(before, int) or isinstance(before, float):
        position = int(before)

    if isinstance(before, str):
        assert before in columns
        position = columns.get_loc(before)
        col_position = columns.get_loc(col)
        if col_position <= position:
            position -= 1

    col_to_relocate = self.pop(col)
    self.insert(loc=position, column=col, value=col_to_relocate)

    return self


@pf.register_dataframe_method
def index_start_at(self, start=1):
    """Create a new sequential index that starts at indicated number.

    Args.:
        self (pd.DataFrame):
            The object the method is applied to.
        start (int):
            The start of an index

    Return:
        pandas.DataFrame
    """
    i = self.index
    self.index = range(start, len(i) + start)
    return self


with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # There is a deprecated method with the same name in `pyjanitor` package
    # So warning is suppressed
    @pf.register_dataframe_method
    def to_datetime(self, columns, **kwargs):
        """Convert indicated columns to datetime.

        Args.:
            self (pd.DataFrame):
                The object the method is applied to.
            columns (str or list[str]):
                Column names to convert to datetime.
            **kwargs:
                Named arguments to be passed to pandas.to_datetime()

        Return:
            pandas.DataFrame
        """
        if isinstance(columns, str):
            columns = [columns]

        self[columns] = self[columns].apply(pd.to_datetime, **kwargs)
        return self


@pf.register_dataframe_method
def to_category(self, columns, categories=None, ordered=False):
    """Convert indicated columns to categorical variables.

    NOTE: method with the same name but for Series also exists.

    Args.:
        self (pd.DataFrame):
            The object the method is applied to.
        column (str or list[str]):
            Column names to convert to category.
        categories (list of values, optional):
            Categories listed here will become the first categories.
            The remaining ones (not in this list) will follow.
            Defaults to None: use default order.
        ordered (bool, optional):
            Whether or not this categorical is treated as ordered categorical.
            Defaults to False.

    Return:
        pandas.DataFrame
    """
    if isinstance(columns, str):
        columns = [columns]

    # res = self.loc[:, columns].copy(deep=True)

    # self.loc[:, columns] = res.apply(lambda x: x.to_category(
    #     categories=categories, ordered=ordered
    # ))

    self.transform_columns(
        columns,
        lambda x: x.to_category(categories=categories, ordered=ordered),
        elementwise=False,
    )

    return self


@pf.register_dataframe_method
def make_dummies(
    self, exclude=None, drop_first=True, prefix_sep="__", **kwargs
):
    """Convert categorical variables in data frame to dummies and remove
       target variable.

    Args:
        exclude (str or None, optional):
            Name of target variable (exclude from the feature list).
            Defaults to None.
        drop_first (bool, optional):
            See pandas.get_dummies(). Defaults to True.
        prefix_sep (str, optional):
            See pandas.get_dummies(). Defaults to "__".

    Returns:
        If `exclude` is not present: pandas.DataFrame (the X)
        If `exclude` is present: pandas.DataFrame and pandas.Series
                              (the X and y)
    """
    if exclude is not None:
        y = self[exclude]
        self = self.drop(columns=exclude)

    df_with_dummies = pd.get_dummies(
        self, drop_first=drop_first, prefix_sep=prefix_sep, **kwargs
    )

    if exclude is not None:
        return (df_with_dummies, y)

    return df_with_dummies

Contents of file "functions.R" (R functions used via "functions.py")

# R functions for data analysis

#' Convert Post-Hoc Test Results to CLD
#'
#' Convert p values from pairwise comparisons to CLD.
#'
#' CLD - compact letter display.
#' This function is a wrapper around [multcompView::multcompLetters()].
#'
#' @note
#' No hyphens are allowed in group names
#' (vaues of culumns `group1` and `group2`).
#'
#' @param .data (data frame with at least 3 columns)
#'        The result of pairwise comparison test usually from \pkg{rstatix}
#'        package.
#' @param group1,group2 Name of the columns in `.data`, which contain the names
#'        of first and second group. Defaults to "group1" and "group2".
#' @param p_name Name of the column, which contains p values.
#'        Defaults to `p.adj`.
#' @param alpha Significance level. Defaults to 0.05.
#'
#' @return Data frame with compared group names and CLD representation of
#'         test results. Contains columns with group names and CLD results
#'         (`cld` and `spaced_cld`).

convert_pairwise_p_to_cld <- function(.data,
                                      group1 = "group1",
                                      group2 = "group2",
                                      p_name = "p.adj",
                                      output_gr_var = "group",
                                      alpha = 0.05) {

  # Checking input
  col_names <- c(group1, group2, p_name)
  missing_col <- !col_names %in% colnames(.data)

  if (any(missing_col)) {
    stop(
      "Check you input as these columns are not present in data: ",
      paste(col_names[missing_col], sep = ",")
    )
  }

  # Analysis
  pair_names <- stringr::str_glue("{.data[[group1]]}-{.data[[group2]]}")

  # Prepare input data
   cld_obj <- purrr::set_names(.data[[p_name]], pair_names) |>
    # Get CLD
    multcompView::multcompLetters(threshold = alpha) 

    # If no differences are detected, then "$monospacedLetters" is not created,
    # then "$Letters" is used instead.
    if (is.null(cld_obj$monospacedLetters)) {
      cld_obj$monospacedLetters <- cld_obj$Letters
    }
    # Format the results
    cld_obj |>
      with(
        dplyr::full_join(
          Letters |>
            tibble::enframe(output_gr_var, "cld"),
          monospacedLetters |>
            tibble::enframe(output_gr_var, "spaced_cld") |>
            dplyr::mutate(
              spaced_cld = stringr::str_replace_all(spaced_cld, " ", "_")
            ),
          by = output_gr_var
        )
      )
}

2 Methods

This section shortly introduces the main aspects of inferential statistics and predictive modeling used in this project.

2.1 Statistical Inference

For difference in proportions, χ² (chi-squared) test was performed with pair-wise χ² as post-hoc. Goodman’s method was used to calculate confidence intervals of multinomial proportions.

For differences between groups of numeric variables, Kruskal-Wallis test was performed followed by Conover-Iman test as post-hoc. Confidence intervals of means were calculating using t-distribution based method.

In this project, confidence level is 95%, significance level is 0.05.

2.2 Predictive Modelling

For predictive modeling, training (data from all seasons except the last one) and test (data from the last season only) sets were used. The training set was used for model selection and the test set for performance evaluation of the selected models.

For the regression task, linear regression and random forests (RF) were used. For the classification task, logistic regression and RF were used. Forward sequential feature selection (SFS) with 5-fold cross-validation (CV) was used to find an optimal combination of variables. The optimized metric in the regression was RMSE** (root mean squared error), in classification BAcc (balanced accuracy), which takes into account class imbalance.

As some calculations take a lot of time, in some analyses either the total available number of features or the number of features allowed to be included in the analysis, or both were limited to fit into a reasonable amount of available time: the decision was made either based on the RF feature importance analysis or the results of previous calculations (number of possibly valuable features and time that was needed to perform a certain amount of calculations).

Models with greater performance were desirable but less complex models with almost the same level of performance as the best one were preferred.

3 Initial Exploration

In this section, the database is presented. Data summaries as well as database tables are explored to better understand the data itself and what steps of pre-precessing are needed.

3.1 Database

The “Ultimate 25k+ Matches Football Database – European” (v2) was downloaded from Kaggle. The database consists of 7 tables. The entity relationship diagram (ERD) is shown below (Fig. 3.1): pay attention that some columns from table Match are not shown in the ERD.

Code: Create connection to SQL database

db = sqlite3.connect("data/database.sqlite")

Code

query = """--sql
SELECT name 
FROM sqlite_master 
WHERE type = 'table' AND name != 'sqlite_sequence';
"""
cursor = db.cursor()
cursor.execute(query)

print("Data tables in the database: ")
for i, tbl in enumerate(cursor.fetchall(), start=1):
    print("  ", i, ". ", *tbl, sep="")

Data tables in the database: 
  1. Player_Attributes
  2. Player
  3. Match
  4. League
  5. Country
  6. Team
  7. Team_Attributes

**Fig. 3.1.** ERD of European Football Matches Database created with dbSchema. Some columns in `Match` table are hidden. Notation: `#` – numeric variable, `t` – text variable, `↗` – reference to other table, foreign key, `↙` – reference from other table.

3.2 Tables `Country` and `League`

In tables country and league has 11 distinct records each. As Scotland and England are regions of the United Kingdom, UK, there are 10 countries only.

Code

# Working with SQL database
import sqlite3

query = """--sql
SELECT
    (SELECT COUNT(DISTINCT name) FROM Country) n_regions,
    (SELECT COUNT(DISTINCT name) FROM League) n_leagues;
"""
pd.read_sql_query(query, db).style.hide(axis="index")

**Table 3.1.** Inspection: number of unique items in `country` and `league` tables.
n_regions	n_leagues
11	11

Code

pd.read_sql_query("SELECT * FROM Country", db).index_start_at(1).style

**Table 3.2.** Inspection: table `country`.
	id	name
1	1	Belgium
2	1,729	England
3	4,769	France
4	7,809	Germany
5	10,257	Italy
6	13,274	Netherlands
7	15,722	Poland
8	17,642	Portugal
9	19,694	Scotland
10	21,518	Spain
11	24,558	Switzerland

Code

pd.read_sql_query("SELECT * FROM League", db).index_start_at(1).style

**Table 3.3.** Inspection: table `league`.
	id	country_id	name
1	1	1	Belgium Jupiler League
2	1,729	1,729	England Premier League
3	4,769	4,769	France Ligue 1
4	7,809	7,809	Germany 1. Bundesliga
5	10,257	10,257	Italy Serie A
6	13,274	13,274	Netherlands Eredivisie
7	15,722	15,722	Poland Ekstraklasa
8	17,642	17,642	Portugal Liga ZON Sagres
9	19,694	19,694	Scotland Premier League
10	21,518	21,518	Spain LIGA BBVA
11	24,558	24,558	Switzerland Super League

League and county/region id codes coincide so these variables contain redundant information.

Details: Country/Region and league IDs are the same.

Code

query = """--sql
SELECT 
    id league_id, 
    country_id region_id, 
    IIF(id==country_id, 'yes', 'no') id_are_equal
FROM League;
"""
pd.read_sql_query(query, db).index_start_at(1).style

	league_id	region_id	id_are_equal
1	1	1	yes
2	1,729	1,729	yes
3	4,769	4,769	yes
4	7,809	7,809	yes
5	10,257	10,257	yes
6	13,274	13,274	yes
7	15,722	15,722	yes
8	17,642	17,642	yes
9	19,694	19,694	yes
10	21,518	21,518	yes
11	24,558	24,558	yes

3.3 Table `Match`

Table match includes information on 25,979 matches from 2008-07-18 to 2016-05-25 (seasons from 2008/2009 to 2015/2016), approximately 3,200-3,400 matches per season (except the season 2013/2014, where some data is likely to be missing). More details on match dataset in Tables 3.4–3.5.

Code

query = """--sql
SELECT 
    (SELECT COUNT(1) FROM Match) n_records,
    (SELECT COUNT(DISTINCT country_id) FROM Match) n_regions,
    (SELECT COUNT(DISTINCT league_id)  FROM Match) n_leagues,
    (SELECT COUNT(DISTINCT season)     FROM Match) n_seasons,
    (SELECT COUNT(DISTINCT team) FROM (
        SELECT home_team_api_id team FROM Match UNION
        SELECT away_team_api_id team FROM Match
    )) n_teams,
    (SELECT COUNT(DISTINCT player) FROM (
        SELECT home_player_1  player FROM Match UNION
        SELECT home_player_2  player FROM Match UNION
        SELECT home_player_3  player FROM Match UNION
        SELECT home_player_4  player FROM Match UNION
        SELECT home_player_5  player FROM Match UNION
        SELECT home_player_6  player FROM Match UNION
        SELECT home_player_7  player FROM Match UNION
        SELECT home_player_8  player FROM Match UNION
        SELECT home_player_9  player FROM Match UNION
        SELECT home_player_10 player FROM Match UNION
        SELECT home_player_11 player FROM Match UNION
        SELECT away_player_1  player FROM Match UNION
        SELECT away_player_2  player FROM Match UNION
        SELECT away_player_3  player FROM Match UNION
        SELECT away_player_4  player FROM Match UNION
        SELECT away_player_5  player FROM Match UNION
        SELECT away_player_6  player FROM Match UNION
        SELECT away_player_7  player FROM Match UNION
        SELECT away_player_8  player FROM Match UNION
        SELECT away_player_9  player FROM Match UNION
        SELECT away_player_10 player FROM Match UNION
        SELECT away_player_11 player FROM Match
    )) n_players,
    (SELECT COUNT(DISTINCT match_api_id) FROM Match) n_matches;
"""
n_matches = pd.read_sql_query(query, db)
n_matches.style.hide(axis="index")

**Table 3.4.** Inspection: number of unique items in `match` table.
n_records	n_regions	n_leagues	n_seasons	n_teams	n_players	n_matches
25,979	11	11	8	299	11,060	25,979

Code

query = """--sql
SELECT season, COUNT(season) n_matches FROM Match GROUP BY season;
"""
pd.read_sql_query(query, db).index_start_at(1).style

**Table 3.5.** Number of matches per season in `match` table.
	season	n_matches
1	2008/2009	3,326
2	2009/2010	3,230
3	2010/2011	3,260
4	2011/2012	3,220
5	2012/2013	3,260
6	2013/2014	3,032
7	2014/2015	3,325
8	2015/2016	3,326

Code: Import match

match = pd.read_sql_query("SELECT * FROM Match", db)
# Fix datetime data type
match = match.to_datetime("date")
# Print
match.head(2)

**Table 3.6.** Inspection: a few rows of table `match`.
	id	country_id	league_id	season	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	away_team_goal	home_player_X1	home_player_X2	home_player_X3	home_player_X4	home_player_X5	home_player_X6	home_player_X7	home_player_X8	home_player_X9	home_player_X10	home_player_X11	away_player_X1	away_player_X2	away_player_X3	away_player_X4	away_player_X5	away_player_X6	away_player_X7	away_player_X8	away_player_X9	away_player_X10	away_player_X11	home_player_Y1	home_player_Y2	home_player_Y3	home_player_Y4	home_player_Y5	home_player_Y6	home_player_Y7	home_player_Y8	home_player_Y9	home_player_Y10	home_player_Y11	away_player_Y1	away_player_Y2	away_player_Y3	away_player_Y4	away_player_Y5	away_player_Y6	away_player_Y7	away_player_Y8	away_player_Y9	away_player_Y10	away_player_Y11	home_player_1	home_player_2	home_player_3	home_player_4	home_player_5	home_player_6	home_player_7	home_player_8	home_player_9	home_player_10	home_player_11	away_player_1	away_player_2	away_player_3	away_player_4	away_player_5	away_player_6	away_player_7	away_player_8	away_player_9	away_player_10	away_player_11	goal	shoton	shotoff	foulcommit	card	cross	corner	possession	B365H	B365D	B365A	BWH	BWD	BWA	IWH	IWD	IWA	LBH	LBD	LBA	PSH	PSD	PSA	WHH	WHD	WHA	SJH	SJD	SJA	VCH	VCD	VCA	GBH	GBD	GBA	BSH	BSD	BSA
0	1	1	1	2008/2009	1	2008-08-17	492473	9987	9993	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	None	None	None	None	None	None	None	None	1.73	3.40	5.00	1.75	3.35	4.20	1.85	3.20	3.50	1.80	3.30	3.75	NaN	NaN	NaN	1.70	3.30	4.33	1.90	3.30	4.00	1.65	3.40	4.50	1.78	3.25	4.00	1.73	3.40	4.20
1	2	1	1	2008/2009	1	2008-08-16	492474	10000	9994	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	None	None	None	None	None	None	None	None	1.95	3.20	3.60	1.80	3.30	3.95	1.90	3.20	3.50	1.90	3.20	3.50	NaN	NaN	NaN	1.83	3.30	3.60	1.95	3.30	3.80	2.00	3.25	3.25	1.85	3.25	3.75	1.91	3.25	3.60

The following variables have non-cleaned HTML/XML-like text values and many missing values (45% cases with NAs), so they will not be included in the further analysis:

goal
shoton
shotoff
foulcommit
card
cross
corner
possession

Variables with player coordinates (such as home_player_X1 through away_player_Y11) will be excluded too.

Dataset contains columns with betting odds information from various betting websites. In betting odds-related variable names (e.g.: B365H), the first few symbols indicates betting websites and the meaning of the last letter is following:

A – Away wins,
D – Draw,
H – Home wins.

These variables can renamed to make easier-to-understand variable names. Next, betting odds from some websites abbreviated as PS (57% NAs), SJ (34%), GB (45%), BS (45%) have many missing values.

Other highlights from the profiling report:

as expected, distribution of matches show yearly patterns (section on variable date in data profiling report).
correlation between various betting odds is high (section on correlation in the report). This could be investigated in more detail.

Details: Text columns to exclude

This is just a short illustration of the issue (see column top with the most frequent values of lines goal and below). See the column of missing values in the overview of match table. More details can be explored in the data profiling report for match table.

Code

match.describe(include="O").T

	count	unique	top	freq
season	25979	8	2008/2009	3326
goal	14217	13225	<goal />	993
shoton	14217	8464	<shoton />	5754
shotoff	14217	8464	<shotoff />	5754
foulcommit	14217	8466	<foulcommit />	5752
card	14217	13777	<card />	441
cross	14217	8466	<cross />	5752
corner	14217	8465	<corner />	5753
possession	14217	8420	<possession />	5798

EDA: Overview of match table

Code

skim(match)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 25979  │ │ float64     │ 96    │                                                          │
│ │ Number of columns │ 115    │ │ int32       │ 9     │                                                          │
│ └───────────────────┴────────┘ │ string      │ 9     │                                                          │
│                                │ datetime64  │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name        ┃ NA     ┃ NA %  ┃ mean     ┃ sd      ┃ p0      ┃ p25     ┃ p75      ┃ p100    ┃ hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ id                 │      0 │     0 │    13000 │    7500 │       1 │    6500 │    19000 │   26000 │ ██████ │  │
│ │ country_id         │      0 │     0 │    12000 │    7600 │       1 │    4800 │    18000 │   25000 │ ▇█▄▆▆▇ │  │
│ │ league_id          │      0 │     0 │    12000 │    7600 │       1 │    4800 │    18000 │   25000 │ ▇█▄▆▆▇ │  │
│ │ stage              │      0 │     0 │       18 │      10 │       1 │       9 │       27 │      38 │ █▇▇▇▇▅ │  │
│ │ match_api_id       │      0 │     0 │  1200000 │  490000 │  480000 │  770000 │  1700000 │ 2200000 │ █▇▅▄▄▄ │  │
│ │ home_team_api_id   │      0 │     0 │    10000 │   14000 │    1600 │    8500 │     9900 │  270000 │   █    │  │
│ │ away_team_api_id   │      0 │     0 │    10000 │   14000 │    1600 │    8500 │     9900 │  270000 │   █    │  │
│ │ home_team_goal     │      0 │     0 │      1.5 │     1.3 │       0 │       1 │        2 │      10 │  █▅▁   │  │
│ │ away_team_goal     │      0 │     0 │      1.2 │     1.1 │       0 │       0 │        2 │       9 │  █▂▁   │  │
│ │ home_player_X1     │   1800 │     7 │        1 │   0.022 │       0 │       1 │        1 │       2 │     █  │  │
│ │ home_player_X2     │   1800 │     7 │      2.1 │    0.39 │       0 │       2 │        2 │       8 │   █▁   │  │
│ │ home_player_X3     │   1800 │   7.1 │      4.1 │    0.39 │       1 │       4 │        4 │       8 │    █   │  │
│ │ home_player_X4     │   1800 │   7.1 │        6 │    0.45 │       2 │       6 │        6 │       8 │     █▁ │  │
│ │ home_player_X5     │   1800 │   7.1 │      7.5 │     1.6 │       1 │       8 │        8 │       9 │ ▁    █ │  │
│ │ home_player_X6     │   1800 │   7.1 │      3.2 │     1.2 │       1 │       2 │        4 │       9 │  █▆▇▂  │  │
│ │ home_player_X7     │   1800 │   7.1 │      4.8 │     1.1 │       1 │       4 │        6 │       9 │   ▁▅█  │  │
│ │ home_player_X8     │   1800 │   7.1 │      5.3 │     1.7 │       1 │       3 │        7 │       9 │  ▅▁█▅▁ │  │
│ │ home_player_X9     │   1800 │   7.1 │      5.8 │       2 │       1 │       5 │        8 │       9 │  ▄▁█▁█ │  │
│ │ home_player_X10    │   1800 │   7.1 │      5.4 │     1.5 │       1 │       4 │        7 │       9 │   █▅▅▁ │  │
│ │ home_player_X11    │   1800 │   7.1 │      5.8 │    0.76 │       1 │       5 │        6 │       7 │     ▅█ │  │
│ │ away_player_X1     │   1800 │   7.1 │        1 │   0.033 │       1 │       1 │        1 │       6 │   █    │  │
│ │ away_player_X2     │   1800 │   7.1 │      2.1 │     0.4 │       1 │       2 │        2 │       8 │   █▁   │  │
│ │ away_player_X3     │   1800 │   7.1 │      4.1 │    0.39 │       2 │       4 │        4 │       9 │    █   │  │
│ │ away_player_X4     │   1800 │   7.1 │      6.1 │    0.45 │       1 │       6 │        6 │       8 │     █▁ │  │
│ │ away_player_X5     │   1800 │   7.1 │      7.5 │     1.6 │       1 │       8 │        8 │       9 │ ▁    █ │  │
│ │ away_player_X6     │   1800 │   7.1 │      3.2 │     1.3 │       1 │       2 │        4 │       9 │  █▆▇▂  │  │
│ │ away_player_X7     │   1800 │   7.1 │      4.7 │     1.1 │       1 │       4 │        6 │       9 │  ▁▁▅█  │  │
│ │ away_player_X8     │   1800 │   7.1 │      5.3 │     1.7 │       1 │       3 │        7 │       9 │  ▅▁█▅▁ │  │
│ │ away_player_X9     │   1800 │   7.1 │      5.8 │       2 │       1 │       5 │        8 │       9 │  ▄▁█▂█ │  │
│ │ away_player_X10    │   1800 │   7.1 │      5.5 │     1.5 │       1 │       4 │        7 │       9 │   █▆▆▂ │  │
│ │ away_player_X11    │   1800 │   7.1 │      5.8 │    0.76 │       3 │       5 │        6 │       8 │   █▇▄  │  │
│ │ home_player_Y1     │   1800 │     7 │        1 │   0.025 │       0 │       1 │        1 │       3 │    █   │  │
│ │ home_player_Y2     │   1800 │     7 │        3 │   0.064 │       0 │       3 │        3 │       3 │      █ │  │
│ │ home_player_Y3     │   1800 │   7.1 │        3 │   0.013 │       3 │       3 │        3 │       5 │   █    │  │
│ │ home_player_Y4     │   1800 │   7.1 │        3 │   0.029 │       3 │       3 │        3 │       5 │   █    │  │
│ │ home_player_Y5     │   1800 │   7.1 │      3.2 │    0.94 │       3 │       3 │        3 │       8 │   █    │  │
│ │ home_player_Y6     │   1800 │   7.1 │      6.5 │    0.74 │       3 │       6 │        7 │       9 │   ▁▅█  │  │
│ │ home_player_Y7     │   1800 │   7.1 │      6.7 │    0.59 │       3 │       6 │        7 │       9 │    ▄█  │  │
│ │ home_player_Y8     │   1800 │   7.1 │      7.2 │    0.59 │       3 │       7 │        8 │      10 │    █▄  │  │
│ │ home_player_Y9     │   1800 │   7.1 │        8 │     1.1 │       1 │       7 │        8 │      10 │     █▃ │  │
│ │ home_player_Y10    │   1800 │   7.1 │      9.2 │     1.1 │       3 │       8 │       10 │      11 │    ▅▁█ │  │
│ │ home_player_Y11    │   1800 │   7.1 │       10 │    0.51 │       1 │      10 │       11 │      11 │      █ │  │
│ │ away_player_Y1     │   1800 │   7.1 │        1 │   0.022 │       1 │       1 │        1 │       3 │   █    │  │
│ │ away_player_Y2     │   1800 │   7.1 │        3 │       0 │       3 │       3 │        3 │       3 │     █  │  │
│ │ away_player_Y3     │   1800 │   7.1 │        3 │   0.026 │       3 │       3 │        3 │       7 │   █    │  │
│ │ away_player_Y4     │   1800 │   7.1 │        3 │   0.029 │       3 │       3 │        3 │       7 │   █    │  │
│ │ away_player_Y5     │   1800 │   7.1 │      3.2 │    0.96 │       3 │       3 │        3 │       9 │ █   ▁  │  │
│ │ away_player_Y6     │   1800 │   7.1 │      6.5 │    0.76 │       3 │       6 │        7 │      10 │   ▁▅█  │  │
│ │ away_player_Y7     │   1800 │   7.1 │      6.7 │    0.59 │       3 │       6 │        7 │      10 │    ▄█  │  │
│ │ away_player_Y8     │   1800 │   7.1 │      7.2 │    0.58 │       3 │       7 │        8 │      10 │    █▄  │  │
│ │ away_player_Y9     │   1800 │   7.1 │        8 │     1.1 │       5 │       7 │        8 │      11 │   █▆▁▄ │  │
│ │ away_player_Y10    │   1800 │   7.1 │      9.2 │     1.1 │       6 │       8 │       10 │      11 │  ▁▄▁█  │  │
│ │ away_player_Y11    │   1800 │   7.1 │       10 │     0.5 │       7 │      10 │       11 │      11 │     █▇ │  │
│ │ home_player_1      │   1200 │   4.7 │    77000 │   88000 │    3000 │   31000 │    97000 │  700000 │   █▁   │  │
│ │ home_player_2      │   1300 │   5.1 │   110000 │  110000 │    2800 │   33000 │   160000 │  750000 │  █▂▁   │  │
│ │ home_player_3      │   1300 │   4.9 │    92000 │  100000 │    2800 │   31000 │   130000 │  710000 │  █▂▁   │  │
│ │ home_player_4      │   1300 │   5.1 │    95000 │  100000 │    2800 │   31000 │   150000 │  720000 │  █▂▁   │  │
│ │ home_player_5      │   1300 │   5.1 │   110000 │  110000 │    2800 │   34000 │   160000 │  730000 │  █▂▁   │  │
│ │ home_player_6      │   1300 │   5.1 │   100000 │  110000 │    2600 │   31000 │   150000 │  750000 │  █▂▁   │  │
│ │ home_player_7      │   1200 │   4.7 │    97000 │  110000 │    2600 │   31000 │   140000 │  690000 │  █▂▁   │  │
│ │ home_player_8      │   1300 │     5 │   110000 │  110000 │    2600 │   33000 │   160000 │  690000 │  █▂▁   │  │
│ │ home_player_9      │   1300 │   4.9 │   110000 │  120000 │    2600 │   33000 │   160000 │  730000 │  █▃▁   │  │
│ │ home_player_10     │   1400 │   5.5 │   110000 │  110000 │    2600 │   32000 │   160000 │  740000 │  █▂▁   │  │
│ │ home_player_11     │   1600 │     6 │   100000 │  110000 │    2800 │   33000 │   160000 │  730000 │  █▂▁   │  │
│ │ away_player_1      │   1200 │   4.7 │    77000 │   87000 │    2800 │   31000 │    97000 │  700000 │   █▁   │  │
│ │ away_player_2      │   1300 │   4.9 │   110000 │  110000 │    2800 │   33000 │   160000 │  750000 │  █▃▁   │  │
│ │ away_player_3      │   1300 │     5 │    91000 │  100000 │    2800 │   30000 │   120000 │  710000 │  █▂▁   │  │
│ │ away_player_4      │   1300 │   5.1 │    95000 │  100000 │    2800 │   31000 │   150000 │  730000 │  █▂▁   │  │
│ │ away_player_5      │   1300 │   5.1 │   110000 │  110000 │    2800 │   33000 │   160000 │  750000 │  █▂▁   │  │
│ │ away_player_6      │   1300 │   5.1 │   100000 │  110000 │    2600 │   31000 │   150000 │  720000 │  █▂▁   │  │
│ │ away_player_7      │   1200 │   4.8 │    98000 │  110000 │    2600 │   31000 │   140000 │  750000 │   █▂   │  │
│ │ away_player_8      │   1300 │   5.2 │   110000 │  120000 │    2600 │   33000 │   160000 │  720000 │  █▂▁   │  │
│ │ away_player_9      │   1300 │   5.1 │   110000 │  120000 │    2600 │   33000 │   160000 │  720000 │  █▃▁   │  │
│ │ away_player_10     │   1400 │   5.5 │   110000 │  110000 │    2800 │   33000 │   160000 │  720000 │  █▂▁   │  │
│ │ away_player_11     │   1600 │     6 │   100000 │  110000 │    2800 │   33000 │   160000 │  730000 │  █▂▁   │  │
│ │ B365H              │   3400 │    13 │      2.6 │     1.8 │       1 │     1.7 │      2.8 │      26 │   █    │  │
│ │ B365D              │   3400 │    13 │      3.8 │     1.1 │     1.4 │     3.3 │        4 │      17 │   █▂   │  │
│ │ B365A              │   3400 │    13 │      4.7 │     3.7 │     1.1 │     2.5 │      5.2 │      51 │   █▁   │  │
│ │ BWH                │   3400 │    13 │      2.6 │     1.6 │       1 │     1.6 │      2.8 │      34 │   █    │  │
│ │ BWD                │   3400 │    13 │      3.7 │       1 │     1.6 │     3.2 │      3.8 │      20 │   █▁   │  │
│ │ BWA                │   3400 │    13 │      4.4 │     3.3 │     1.1 │     2.5 │        5 │      51 │   █▁   │  │
│ │ IWH                │   3500 │    13 │      2.5 │     1.4 │       1 │     1.6 │      2.6 │      20 │   █▁   │  │
│ │ IWD                │   3500 │    13 │      3.6 │     0.8 │     1.5 │     3.2 │      3.7 │      11 │  ▁█▁   │  │
│ │ IWA                │   3500 │    13 │      4.2 │     2.9 │     1.1 │     2.5 │      4.6 │      25 │   █▂   │  │
│ │ LBH                │   3400 │    13 │      2.5 │     1.6 │       1 │     1.7 │      2.7 │      26 │   █    │  │
│ │ LBD                │   3400 │    13 │      3.7 │       1 │     1.4 │     3.2 │      3.8 │      19 │   █▁   │  │
│ │ LBA                │   3400 │    13 │      4.4 │     3.4 │     1.1 │     2.5 │        5 │      51 │   █▁   │  │
│ │ PSH                │  15000 │    57 │      2.8 │     2.2 │       1 │     1.7 │        3 │      36 │   █    │  │
│ │ PSD                │  15000 │    57 │      4.1 │     1.5 │     2.2 │     3.4 │      4.2 │      29 │   █    │  │
│ │ PSA                │  15000 │    57 │        5 │     4.5 │     1.1 │     2.6 │      5.4 │      48 │   █▁   │  │
│ │ WHH                │   3400 │    13 │      2.6 │     1.7 │       1 │     1.7 │      2.8 │      26 │   █    │  │
│ │ WHD                │   3400 │    13 │      3.7 │    0.96 │       1 │     3.2 │      3.8 │      17 │   █▃   │  │
│ │ WHA                │   3400 │    13 │      4.5 │     3.6 │     1.1 │     2.5 │        5 │      51 │   █▁   │  │
│ │ SJH                │   8900 │    34 │      2.6 │     1.7 │       1 │     1.7 │      2.8 │      23 │   █▁   │  │
│ │ SJD                │   8900 │    34 │      3.8 │       1 │     1.4 │     3.2 │      3.8 │      15 │   █▃   │  │
│ │ SJA                │   8900 │    34 │      4.6 │     3.6 │     1.1 │     2.5 │      5.2 │      41 │   █▁   │  │
│ │ VCH                │   3400 │    13 │      2.7 │     1.9 │       1 │     1.7 │      2.8 │      36 │   █    │  │
│ │ VCD                │   3400 │    13 │      3.9 │     1.2 │     1.6 │     3.3 │        4 │      26 │   █▁   │  │
│ │ VCA                │   3400 │    13 │      4.8 │     4.3 │     1.1 │     2.5 │      5.4 │      67 │   █    │  │
│ │ GBH                │  12000 │    45 │      2.5 │     1.5 │     1.1 │     1.7 │      2.6 │      21 │   █▁   │  │
│ │ GBD                │  12000 │    45 │      3.6 │    0.87 │     1.4 │     3.2 │      3.8 │      11 │  ▁█▁   │  │
│ │ GBA                │  12000 │    45 │      4.4 │       3 │     1.1 │     2.5 │        5 │      34 │   █▁   │  │
│ │ BSH                │  12000 │    45 │      2.5 │     1.5 │       1 │     1.7 │      2.6 │      17 │   █▁   │  │
│ │ BSD                │  12000 │    45 │      3.7 │    0.87 │     1.3 │     3.2 │      3.8 │      13 │  ▅█▁   │  │
│ │ BSA                │  12000 │    45 │      4.4 │     3.2 │     1.1 │     2.5 │        5 │      34 │   █▁   │  │
│ └────────────────────┴────────┴───────┴──────────┴─────────┴─────────┴─────────┴──────────┴─────────┴────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA     ┃ NA %      ┃ first               ┃ last                ┃ frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ date                   │      0 │         0 │     2008-07-18      │     2016-05-25      │ None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name              ┃ NA           ┃ NA %       ┃ words per row              ┃ total words            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ season                   │            0 │          0 │                          1 │                  26000 │  │
│ │ goal                     │        12000 │         45 │                          1 │                  26000 │  │
│ │ shoton                   │        12000 │         45 │                          1 │                  26000 │  │
│ │ shotoff                  │        12000 │         45 │                          1 │                  26000 │  │
│ │ foulcommit               │        12000 │         45 │                          1 │                  26000 │  │
│ │ card                     │        12000 │         45 │                          1 │                  26000 │  │
│ │ cross                    │        12000 │         45 │                          1 │                  26000 │  │
│ │ corner                   │        12000 │         45 │                          1 │                  26000 │  │
│ │ possession               │        12000 │         45 │                          1 │                  26000 │  │
│ └──────────────────────────┴──────────────┴────────────┴────────────────────────────┴────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

EDA: Data Profiling Report of match

Code

if do_eda:
    eda.ProfileReport(
        match,
        title="Data Profiling Report: match",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

3.4 Table `Player`

Database includes information on 11,060 European football players. No missing values or other obvious discrepancies in this dataset were found.

Code

query = """--sql
SELECT COUNT(*) n_records, COUNT(player_api_id) n_players FROM Player;
"""
n_players = pd.read_sql_query(query, db)

# Print
n_players.style.hide(axis="index")

**Table 3.7.** Inspection: number of unique items in `player` table.
n_records	n_players
11,060	11,060

Code: Import player

player = pd.read_sql_query("SELECT * FROM Player;", db)
# Fix datetime data type
player = player.to_datetime("birthday")
# Print
player.head(2)

**Table 3.8.** Inspection: a few rows of table `player`.
	id	player_api_id	player_name	player_fifa_api_id	birthday	height	weight
0	1	505942	Aaron Appindangoye	218353	1992-02-29	182.88	187
1	2	155782	Aaron Cresswell	189615	1989-12-15	170.18	146

EDA: Overview of player table

Code

skim(player)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 11060  │ │ int32       │ 4     │                                                          │
│ │ Number of columns │ 7      │ │ string      │ 1     │                                                          │
│ └───────────────────┴────────┘ │ datetime64  │ 1     │                                                          │
│                                │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA  ┃ NA %  ┃ mean     ┃ sd       ┃ p0    ┃ p25     ┃ p75     ┃ p100    ┃ hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩  │
│ │ id                     │   0 │     0 │     5500 │     3200 │     1 │    2800 │    8300 │   11000 │ ██████  │  │
│ │ player_api_id          │   0 │     0 │   160000 │   160000 │  2600 │   36000 │  210000 │  750000 │  █▃▁▁▁  │  │
│ │ player_fifa_api_id     │   0 │     0 │   170000 │    59000 │     2 │  150000 │  200000 │  230000 │ ▂▁▁▂█▇  │  │
│ │ height                 │   0 │     0 │      180 │      6.4 │   160 │     180 │     190 │     210 │   ▂▆█▁  │  │
│ │ weight                 │   0 │     0 │      170 │       15 │   120 │     160 │     180 │     240 │   ▃█▃   │  │
│ └────────────────────────┴─────┴───────┴──────────┴──────────┴───────┴─────────┴─────────┴─────────┴─────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA     ┃ NA %      ┃ first               ┃ last                ┃ frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ birthday               │      0 │         0 │     1967-01-23      │     1999-04-24      │ None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name               ┃ NA      ┃ NA %       ┃ words per row                ┃ total words              ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ player_name               │       0 │          0 │                            2 │                    22000 │  │
│ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

EDA: Data Profiling Report of player

Code

if do_eda:
    eda.ProfileReport(
        player,
        title="Data Profiling Report: player",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

3.5 Table `Player_Attributes`

Table Player_Attributes contains 183,978 records on 11,060 players about various properties of theirs. Variables in this dataset have from 0.45% to 1.46% of missing values.

Some numeric variables are bimodal and they might indicate different importance for different player roles:

ball_control
interceptions
marking
standing_tackle
sliding_tackle

Goalkeeper-related variables also have distinct distribution: a few players high scores (most probably they are goalkeepers) and many with low scores (most probably the remaining roles):

gk_diving
gk_handling
gk_kicking
gk_positioning
gk_reflexes

There are 3 categorical variables including 2 related to working rates. FIFA defines working rates categories as either “low”, “medium” or “high” [1]. In the dataset there are more values in these columns, and the additional values can be treated as errors in most cases especially when values make no sense. What is more, comparing attacking and defensive work rate columns, some errors in one column indicate what kind of errors will be in the other column. Some of those errors are characteristic only to data dated before 2012, which indicates that this might be data scraping errors or missing information on the scrapped webpages.

Code

query = """--sql
SELECT COUNT(1) n_records, COUNT(DISTINCT player_api_id) n_players
FROM Player_Attributes;
"""

n_player_attributes = pd.read_sql_query(query, db)
n_player_attributes.style.hide(axis="index")

**Table 3.9.** Inspection: number of unique items in `player_attributes` table.
n_records	n_players
183,978	11,060

Code: Import player_attributes

# Import
player_attributes = pd.read_sql_query("SELECT * FROM Player_Attributes", db)

# Fix datetime data type
player_attributes = player_attributes.to_datetime("date")

# Print
player_attributes.head(2)

**Table 3.10.** Inspection: a few rows of table `player_attributes`.
	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	long_passing	ball_control	acceleration	sprint_speed	agility	reactions	balance	shot_power	jumping	stamina	strength	long_shots	aggression	interceptions	positioning	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	218353	505942	2016-02-18	67.00	71.00	right	medium	medium	49.00	44.00	71.00	61.00	44.00	51.00	45.00	39.00	64.00	49.00	60.00	64.00	59.00	47.00	65.00	55.00	58.00	54.00	76.00	35.00	71.00	70.00	45.00	54.00	48.00	65.00	69.00	69.00	6.00	11.00	10.00	8.00	8.00
1	2	218353	505942	2015-11-19	67.00	71.00	right	medium	medium	49.00	44.00	71.00	61.00	44.00	51.00	45.00	39.00	64.00	49.00	60.00	64.00	59.00	47.00	65.00	55.00	58.00	54.00	76.00	35.00	71.00	70.00	45.00	54.00	48.00	65.00	69.00	69.00	6.00	11.00	10.00	8.00	8.00

EDA: Overview of player_attributes table

Code

skim(player_attributes)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 183978 │ │ float64     │ 35    │                                                          │
│ │ Number of columns │ 42     │ │ int32       │ 3     │                                                          │
│ └───────────────────┴────────┘ │ string      │ 3     │                                                          │
│                                │ datetime64  │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name            ┃ NA    ┃ NA %  ┃ mean     ┃ sd      ┃ p0    ┃ p25     ┃ p75     ┃ p100    ┃ hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ id                     │     0 │     0 │    92000 │   53000 │     1 │   46000 │  140000 │  180000 │ ██████ │  │
│ │ player_fifa_api_id     │     0 │     0 │   170000 │   54000 │     2 │  160000 │  200000 │  230000 │ ▁▁▁▂█▅ │  │
│ │ player_api_id          │     0 │     0 │   140000 │  140000 │  2600 │   35000 │  190000 │  750000 │  █▃▁▁  │  │
│ │ overall_rating         │   840 │  0.45 │       69 │       7 │    33 │      64 │      73 │      94 │   ▃█▃  │  │
│ │ potential              │   840 │  0.45 │       73 │     6.6 │    39 │      69 │      78 │      97 │   ▃█▄  │  │
│ │ crossing               │   840 │  0.45 │       55 │      17 │     1 │      45 │      68 │      95 │ ▁▂▄██▁ │  │
│ │ finishing              │   840 │  0.45 │       50 │      19 │     1 │      34 │      65 │      97 │ ▁▅▆█▇▁ │  │
│ │ heading_accuracy       │   840 │  0.45 │       57 │      16 │     1 │      49 │      68 │      98 │ ▁▁▃█▆▁ │  │
│ │ short_passing          │   840 │  0.45 │       62 │      14 │     3 │      57 │      72 │      97 │  ▁▁▇█▁ │  │
│ │ volleys                │  2700 │   1.5 │       49 │      18 │     1 │      35 │      64 │      93 │ ▁▄▅█▆▁ │  │
│ │ dribbling              │   840 │  0.45 │       59 │      18 │     1 │      52 │      72 │      97 │ ▁▁▂▆█▁ │  │
│ │ curve                  │  2700 │   1.5 │       53 │      18 │     2 │      41 │      67 │      94 │ ▁▃▅█▇▁ │  │
│ │ free_kick_accuracy     │   840 │  0.45 │       49 │      18 │     1 │      36 │      63 │      97 │ ▁▄▇█▆▁ │  │
│ │ long_passing           │   840 │  0.45 │       57 │      14 │     3 │      49 │      67 │      97 │  ▂▃█▅  │  │
│ │ ball_control           │   840 │  0.45 │       63 │      15 │     5 │      58 │      73 │      97 │  ▁▁▆█▁ │  │
│ │ acceleration           │   840 │  0.45 │       68 │      13 │    10 │      61 │      77 │      97 │  ▁▂▅█▂ │  │
│ │ sprint_speed           │   840 │  0.45 │       68 │      13 │    12 │      62 │      77 │      97 │  ▁▂▆█▂ │  │
│ │ agility                │  2700 │   1.5 │       66 │      13 │    11 │      58 │      75 │      96 │  ▁▂▇█▂ │  │
│ │ reactions              │   840 │  0.45 │       66 │     9.2 │    17 │      61 │      72 │      96 │   ▂█▆  │  │
│ │ balance                │  2700 │   1.5 │       65 │      13 │    12 │      58 │      74 │      96 │  ▁▃▇█▂ │  │
│ │ shot_power             │   840 │  0.45 │       62 │      16 │     2 │      54 │      73 │      97 │  ▁▂▆█▁ │  │
│ │ jumping                │  2700 │   1.5 │       67 │      11 │    14 │      60 │      74 │      96 │   ▂██▁ │  │
│ │ stamina                │   840 │  0.45 │       67 │      13 │    10 │      61 │      76 │      96 │  ▁▁▆█▂ │  │
│ │ strength               │   840 │  0.45 │       67 │      12 │    10 │      60 │      76 │      96 │   ▁▆█▂ │  │
│ │ long_shots             │   840 │  0.45 │       53 │      18 │     1 │      41 │      67 │      96 │ ▁▃▄█▇▁ │  │
│ │ aggression             │   840 │  0.45 │       61 │      16 │     6 │      51 │      73 │      97 │  ▂▃▇█▂ │  │
│ │ interceptions          │   840 │  0.45 │       52 │      19 │     1 │      34 │      68 │      96 │  ▆▄▇█▁ │  │
│ │ positioning            │   840 │  0.45 │       56 │      18 │     2 │      45 │      69 │      96 │ ▁▃▃▇█▁ │  │
│ │ vision                 │  2700 │   1.5 │       58 │      15 │     1 │      49 │      69 │      97 │  ▁▃█▇▁ │  │
│ │ penalties              │   840 │  0.45 │       55 │      16 │     2 │      45 │      67 │      96 │  ▂▄█▆▁ │  │
│ │ marking                │   840 │  0.45 │       47 │      21 │     1 │      25 │      66 │      96 │ ▂█▄▇▇▁ │  │
│ │ standing_tackle        │   840 │  0.45 │       50 │      21 │     1 │      29 │      69 │      95 │ ▁▆▃▄█▁ │  │
│ │ sliding_tackle         │  2700 │   1.5 │       48 │      22 │     2 │      25 │      67 │      95 │ ▂▇▄▅█▁ │  │
│ │ gk_diving              │   840 │  0.45 │       15 │      17 │     1 │       7 │      13 │      94 │   █    │  │
│ │ gk_handling            │   840 │  0.45 │       16 │      16 │     1 │       8 │      15 │      93 │ █▁  ▁  │  │
│ │ gk_kicking             │   840 │  0.45 │       21 │      21 │     1 │       8 │      15 │      97 │ █  ▁▁  │  │
│ │ gk_positioning         │   840 │  0.45 │       16 │      16 │     1 │       8 │      15 │      96 │   █▁   │  │
│ │ gk_reflexes            │   840 │  0.45 │       16 │      17 │     1 │       8 │      15 │      96 │ █▁  ▁  │  │
│ └────────────────────────┴───────┴───────┴──────────┴─────────┴───────┴─────────┴─────────┴─────────┴────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA     ┃ NA %      ┃ first               ┃ last                ┃ frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ date                   │      0 │         0 │     2007-02-22      │     2016-07-07      │ None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                        ┃ NA        ┃ NA %      ┃ words per row           ┃ total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ preferred_foot                     │       840 │      0.45 │                       1 │              180000 │  │
│ │ attacking_work_rate                │      3200 │       1.8 │                       1 │              180000 │  │
│ │ defensive_work_rate                │       840 │      0.45 │                       1 │              180000 │  │
│ └────────────────────────────────────┴───────────┴───────────┴─────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

EDA for categorical variables in player_attributes (1: existing values)

Code

player_attributes.preferred_foot.value_counts().to_df()

	preferred_foot	count
0	right	138409
1	left	44733

Code

player_attributes.attacking_work_rate.value_counts().to_df()

	attacking_work_rate	count
0	medium	125070
1	high	42823
2	low	8569
3	None	3639
4	norm	348
5	y	106
6	le	104
7	stoc	89

Code

player_attributes.defensive_work_rate.value_counts().to_df()

	defensive_work_rate	count
0	medium	130846
1	high	27041
2	low	18432
3	_0	2394
4	o	1550
5	1	441
6	ormal	348
7	2	342
8	3	258
9	5	234
10	7	217
11	0	197
12	6	197
13	9	152
14	4	116
15	es	106
16	ean	104
17	tocky	89
18	8	78

EDA for categorical variables in player_attributes (2: patterns)

Cells with zero values are in pastel red.

Code

# wr - work rate
wr_cats = ["low", "medium", "high"]

(
    pd.crosstab(
        player_attributes.defensive_work_rate.to_category(wr_cats),
        player_attributes.attacking_work_rate.to_category(wr_cats),
    )
    .style.background_gradient()
    .highlight_between(left=0, right=0, color="#FFBBBB")
)

attacking_work_rate	low	medium	high	None	le	norm	stoc	y
defensive_work_rate
low	695	12,003	5,727	7	0	0	0	0
medium	4,525	97,154	29,085	82	0	0	0	0
high	3,319	15,714	7,939	69	0	0	0	0
0	0	9	11	177	0	0	0	0
1	0	35	9	397	0	0	0	0
2	0	76	13	253	0	0	0	0
3	12	11	0	235	0	0	0	0
4	18	9	0	89	0	0	0	0
5	0	11	17	206	0	0	0	0
6	0	21	13	163	0	0	0	0
7	0	9	5	203	0	0	0	0
8	0	5	0	73	0	0	0	0
9	0	13	4	135	0	0	0	0
ean	0	0	0	0	104	0	0	0
es	0	0	0	0	0	0	0	106
o	0	0	0	1,550	0	0	0	0
ormal	0	0	0	0	0	348	0	0
tocky	0	0	0	0	0	0	89	0

Code

pd.crosstab(
    player_attributes.attacking_work_rate.to_category(wr_cats),
    player_attributes.date.dt.year.rename("year"),
).style.highlight_between(left=0, right=0, color="#FFBBBB")

year	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
attacking_work_rate
low	719	302	490	560	640	613	1,729	1,565	1,352	599
medium	10,445	3,687	5,851	7,329	8,717	9,384	27,524	22,285	20,955	8,893
high	2,432	929	1,524	1,854	2,182	2,497	9,175	8,606	9,162	4,462
None	735	338	436	498	270	131	402	336	349	144
le	35	17	21	23	8	0	0	0	0	0
norm	111	56	64	94	23	0	0	0	0	0
stoc	25	13	20	21	10	0	0	0	0	0
y	32	15	24	25	10	0	0	0	0	0

Code

pd.crosstab(
    player_attributes.defensive_work_rate.to_category(wr_cats),
    player_attributes.date.dt.year.rename("year"),
).style.highlight_between(left=0, right=0, color="#FFBBBB")

year	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
defensive_work_rate
low	1,368	567	890	1,050	1,170	1,236	3,986	3,347	3,325	1,493
medium	10,539	3,633	5,812	7,313	8,805	9,582	28,658	23,858	22,847	9,799
high	1,674	711	1,154	1,370	1,559	1,667	5,764	5,199	5,281	2,662
0	19	10	15	12	8	9	29	41	37	17
1	42	11	18	27	27	19	89	85	88	35
2	30	8	11	19	17	21	72	56	80	28
3	39	16	21	21	21	23	51	37	25	4
4	18	8	8	14	13	10	21	14	8	2
5	29	6	8	13	13	20	49	57	25	14
6	26	10	13	14	12	10	27	33	32	20
7	22	9	14	18	14	8	32	34	49	17
8	9	4	6	6	7	8	20	9	5	4
9	20	8	12	16	11	12	32	22	16	3
_0	868	439	560	419	108	0	0	0	0	0
ean	35	17	21	23	8	0	0	0	0	0
es	32	15	24	25	10	0	0	0	0	0
o	496	255	319	348	132	0	0	0	0	0
ormal	111	56	64	94	23	0	0	0	0	0
tocky	25	13	20	21	10	0	0	0	0	0

EDA: Data Profiling Report of player_attributes

Code

if do_eda:
    eda.ProfileReport(
        player_attributes,
        title="Data Profiling Report: player_attributes",
        config_file="_config/ydata_profile_config--default.yaml",
    )

3.6 Table `Team`

Table team contains records on 299 football teams.

Code

query = """--sql
SELECT COUNT(1) n_records, COUNT(DISTINCT team_api_id) n_teams FROM Team;
"""
n_teams = pd.read_sql_query(query, db)
n_teams.style.hide(axis="index")

**Table 3.11.** Inspection: number of unique items in `team` table.
n_records	n_teams
299	299

Code: Import team

team = pd.read_sql_query("SELECT * FROM Team ", db)
# Print
team.head(2).style.hide(axis="index").format(precision=1)

**Table 3.12.** Inspection: a few rows of table `team`.
id	team_api_id	team_fifa_api_id	team_long_name	team_short_name
1	9987	673.0	KRC Genk	GEN
2	9993	675.0	Beerschot AC	BAC

EDA: Overview of team table

Code

skim(team)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 299    │ │ int32       │ 2     │                                                          │
│ │ Number of columns │ 5      │ │ string      │ 2     │                                                          │
│ └───────────────────┴────────┘ │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA   ┃ NA %   ┃ mean    ┃ sd      ┃ p0     ┃ p25    ┃ p75     ┃ p100    ┃ hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩  │
│ │ id                     │    0 │      0 │   24000 │   15000 │      1 │   9600 │   36000 │   52000 │ ██▅▇▆▆  │  │
│ │ team_api_id            │    0 │      0 │   12000 │   26000 │   1600 │   8300 │    9900 │  270000 │    █    │  │
│ │ team_fifa_api_id       │   11 │    3.7 │   22000 │   42000 │      1 │    180 │    1900 │  110000 │ █    ▂  │  │
│ └────────────────────────┴──────┴────────┴─────────┴─────────┴────────┴────────┴─────────┴─────────┴─────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                     ┃ NA     ┃ NA %       ┃ words per row              ┃ total words           ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ team_long_name                  │      0 │          0 │                        2.1 │                   610 │  │
│ │ team_short_name                 │      0 │          0 │                        2.1 │                   610 │  │
│ └─────────────────────────────────┴────────┴────────────┴────────────────────────────┴───────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

EDA: Data Profiling Report of team

Code

if do_eda:
    eda.ProfileReport(
        team,
        title="Data Profiling Report: team",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

3.7 Table `Team_Attributes`

Table teams_attributes dataset contains 1,458 records on 288 teams. It is 11 teams less than in teams dataset. What is more, data is available only from year 2010.

Some variables like buildUpPlayDribbling and buildUpPlayDribblingClass have both numeric (without word Class in the name) and categorical (with word Class) versions. Graphical inspection show that numeric values in categorical classes do not overlap.

Categorical variables buildUpPlayPositioningClass, chanceCreationPositioningClass, and defenceDefenderLineClass do not have numeric equivalents.

Code

query = """--sql
SELECT COUNT(1) n_records, COUNT(DISTINCT team_api_id) n_teams
FROM Team_Attributes;
"""
pd.read_sql_query(query, db).style.hide(axis="index")

**Table 3.13.** Inspection: number of unique items in `team_attributes` table.
n_records	n_teams
1,458	288

Code: Import team_attributes

# Import
team_attributes = pd.read_sql_query("SELECT * FROM Team_Attributes;", db)
# Pre-process
team_attributes = team_attributes.to_datetime("date")
# Print
team_attributes.head(2)

**Table 3.14.** Inspection: a few rows of table `team_attributes`.
	id	team_fifa_api_id	team_api_id	date	buildUpPlaySpeed	buildUpPlaySpeedClass	buildUpPlayDribbling	buildUpPlayDribblingClass	buildUpPlayPassing	buildUpPlayPassingClass	buildUpPlayPositioningClass	chanceCreationPassing	chanceCreationPassingClass	chanceCreationCrossing	chanceCreationCrossingClass	chanceCreationShooting	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass
0	1	434	9930	2010-02-22	60	Balanced	NaN	Little	50	Mixed	Organised	60	Normal	65	Normal	55	Normal	Organised	50	Medium	55	Press	45	Normal	Cover
1	2	434	9930	2014-09-19	52	Balanced	48.00	Normal	56	Mixed	Organised	54	Normal	63	Normal	64	Normal	Organised	47	Medium	44	Press	54	Normal	Cover

EDA: Overview of team_attributes table

Code

skim(team_attributes)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 1458   │ │ string      │ 12    │                                                          │
│ │ Number of columns │ 25     │ │ int32       │ 11    │                                                          │
│ └───────────────────┴────────┘ │ datetime64  │ 1     │                                                          │
│                                │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name               ┃ NA    ┃ NA %   ┃ mean    ┃ sd      ┃ p0    ┃ p25   ┃ p75   ┃ p100    ┃ hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩  │
│ │ id                        │     0 │      0 │     730 │     420 │     1 │   370 │  1100 │    1500 │ ██████  │  │
│ │ team_fifa_api_id          │     0 │      0 │   18000 │   39000 │     1 │   110 │  1900 │  110000 │ █    ▁  │  │
│ │ team_api_id               │     0 │      0 │   10000 │   13000 │  1600 │  8500 │  9900 │  270000 │    █    │  │
│ │ buildUpPlaySpeed          │     0 │      0 │      52 │      12 │    20 │    45 │    62 │      80 │  ▄▆█▆▂  │  │
│ │ buildUpPlayDribbling      │   970 │     66 │      49 │     9.7 │    24 │    42 │    55 │      77 │ ▁▄██▂▁  │  │
│ │ buildUpPlayPassing        │     0 │      0 │      48 │      11 │    20 │    40 │    55 │      80 │  ▅▆█▃▁  │  │
│ │ chanceCreationPassin      │     0 │      0 │      52 │      10 │    21 │    46 │    59 │      80 │  ▁▃▇█▅  │  │
│ │ chanceCreationCrossi      │     0 │      0 │      54 │      11 │    20 │    47 │    62 │      80 │  ▂▄█▅▂  │  │
│ │ chanceCreationShooti      │     0 │      0 │      54 │      10 │    22 │    48 │    61 │      80 │  ▂▅█▅▁  │  │
│ │ defencePressure           │     0 │      0 │      46 │      10 │    23 │    39 │    51 │      72 │ ▂▅█▅▃▂  │  │
│ │ defenceAggression         │     0 │      0 │      49 │     9.7 │    24 │    44 │    55 │      72 │ ▁▂█▇▃▂  │  │
│ │ defenceTeamWidth          │     0 │      0 │      52 │     9.6 │    29 │    47 │    58 │      73 │ ▂▂▇█▄▂  │  │
│ └───────────────────────────┴───────┴────────┴─────────┴─────────┴───────┴───────┴───────┴─────────┴─────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA     ┃ NA %      ┃ first               ┃ last                ┃ frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ date                   │      0 │         0 │     2010-02-22      │     2015-09-10      │ None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                           ┃ NA    ┃ NA %      ┃ words per row            ┃ total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ buildUpPlaySpeedClas                  │     0 │         0 │                        1 │                1500 │  │
│ │ buildUpPlayDribbling                  │     0 │         0 │                        1 │                1500 │  │
│ │ buildUpPlayPassingCl                  │     0 │         0 │                        1 │                1500 │  │
│ │ buildUpPlayPositioni                  │     0 │         0 │                        1 │                1500 │  │
│ │ chanceCreationPassin                  │     0 │         0 │                        1 │                1500 │  │
│ │ chanceCreationCrossi                  │     0 │         0 │                        1 │                1500 │  │
│ │ chanceCreationShooti                  │     0 │         0 │                        1 │                1500 │  │
│ │ chanceCreationPositi                  │     0 │         0 │                        1 │                1500 │  │
│ │ defencePressureClass                  │     0 │         0 │                        1 │                1500 │  │
│ │ defenceAggressionCla                  │     0 │         0 │                        1 │                1500 │  │
│ │ defenceTeamWidthClas                  │     0 │         0 │                        1 │                1500 │  │
│ │ defenceDefenderLineC                  │     0 │         0 │                        1 │                1500 │  │
│ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

EDA: Data Profiling Report of team_attributes

Code

if do_eda:
    eda.ProfileReport(
        team_attributes,
        title="Data Profiling Report: team_attributes",
        config_file="_config/ydata_profile_config--default.yaml",
    )

EDA: boxplots of variable pairs (numeric and categorical counterparts)

Boxplots indicate non-overlapping values in each appropriate class.

Code

sns.boxplot(team_attributes, x="buildUpPlaySpeed", y="buildUpPlaySpeedClass");

Code

sns.boxplot(
    team_attributes, x="buildUpPlayDribbling", y="buildUpPlayDribblingClass"
);

Code

sns.boxplot(
    team_attributes, x="buildUpPlayPassing", y="buildUpPlayPassingClass"
);

Code

sns.boxplot(
    team_attributes, x="chanceCreationPassing", y="chanceCreationPassingClass"
);

Code

sns.boxplot(
    team_attributes, x="chanceCreationCrossing", y="chanceCreationCrossingClass"
);

Code

sns.boxplot(
    team_attributes, x="chanceCreationShooting", y="chanceCreationShootingClass"
);

Code

sns.boxplot(team_attributes, x="defencePressure", y="defencePressureClass");

Code

sns.boxplot(team_attributes, x="defenceAggression", y="defenceAggressionClass");

Code

sns.boxplot(team_attributes, x="defenceTeamWidth", y="defenceTeamWidthClass");

3.8 Delete Tables

The tables in this section were imported for exploratory purposes only. In the next section they will be imported in the form that is needed to answer the main questions of this analysis.

Code

df_to_delete = [match, player, player_attributes, team, team_attributes]
del df_to_delete

4 Data Import & Pre-Processing

In this section, data will be imported and pre-processed to create the following tables required for the main analyses:

To present analyzed counties and leagues:
- leagues
To compare resultativeness by leagues and seasons:
- goals_summary
To identify and analyze top teams:
- teams_top_bottom_goals
- teams_wins_per_season
To identify top players in 2015/2016 and what factors make them best:
- players
To investigate, if home advantage exists:
- matches
To investigate relationship between betting odds from different companies/websites:
- matches_betting_odds
For team score prediction in a match:
- team_train
- team_test
For match outcome (home wins, draw, away wins) prediction:
- match_train
- match_test

Some additional tables will be created ad-hoc in the analysis section.

4.1 Import

This section contains code that imports data to Python. Some pre-processing in SQL is also performed.

Before importing into Python:

tables country and league were merged and the result was called leagues.
new column country in table was created where Scotland and England were treated as the same country United Kingdom, UK,
column region was created to indicate regions of UK.

Code: Import leagues (country + league)

query = """--sql
SELECT 
    l.id league_id,
    CASE 
        WHEN c.name IN ('England', 'Scotland') THEN 'United Kingdom'
        ELSE c.name
    END country,
    CASE  
        WHEN c.name IN ('England', 'Scotland') THEN c.name
        ELSE ''
    END region,
    l.name league
FROM Country c FULL JOIN League l ON ( l.country_id = c.id );
"""
leagues = pd.read_sql_query(query, db)

# Print
leagues.head(2)

**Table 4.1.** Inspection: a few rows of table `leagues`.
	league_id	country	region	league
0	1	Belgium		Belgium Jupiler League
1	1729	United Kingdom	England	England Premier League

Code

leagues.shape

(11, 4)

Before importing into Python, team, and team_attributes tables were merged.

Code: Import teams (team + team_attributes)

# EXCLUDE [t.id, t.team_fifa_api_id, ta.id, ta.team_fifa_api_id, ta.team_api_id]
query = """--sql
SELECT
    t.team_api_id team_id,
    t.team_long_name team_name,
    t.team_short_name,
    
    ta.date team_info_date, 
    ta.buildUpPlayPositioningClass,
    ta.chanceCreationPositioningClass,
    ta.defenceDefenderLineClass,
    
    ta.buildUpPlaySpeed,
    ta.buildUpPlayDribbling,
    ta.buildUpPlayPassing,
    ta.chanceCreationPassing,
    ta.chanceCreationCrossing,
    ta.chanceCreationShooting,
    ta.defencePressure,
    ta.defenceAggression,
    ta.defenceTeamWidth,
    
    ta.buildUpPlaySpeedClass,
    ta.buildUpPlayDribblingClass,
    ta.buildUpPlayPassingClass,
    ta.chanceCreationPassingClass,
    ta.chanceCreationCrossingClass,
    ta.chanceCreationShootingClass,
    ta.defencePressureClass,
    ta.defenceAggressionClass,
    ta.defenceTeamWidthClass

FROM Team t FULL JOIN Team_Attributes ta 
ON ( ta.team_api_id = t.team_api_id );
"""
teams = pd.read_sql_query(query, db)

# Print
teams.head(2).style.hide(axis="index").format(precision=1)

**Table 4.2.** Inspection: a few rows of table `teams`.
team_id	team_name	team_short_name	team_info_date	buildUpPlayPositioningClass	chanceCreationPositioningClass	defenceDefenderLineClass	buildUpPlaySpeed	buildUpPlayDribbling	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	buildUpPlaySpeedClass	buildUpPlayDribblingClass	buildUpPlayPassingClass	chanceCreationPassingClass	chanceCreationCrossingClass	chanceCreationShootingClass	defencePressureClass	defenceAggressionClass	defenceTeamWidthClass
9987	KRC Genk	GEN	2010-02-22 00:00:00	Organised	Organised	Cover	45.0	nan	45.0	50.0	35.0	60.0	70.0	65.0	70.0	Balanced	Little	Mixed	Normal	Normal	Normal	High	Press	Wide
9987	KRC Genk	GEN	2011-02-22 00:00:00	Organised	Organised	Offside Trap	66.0	nan	52.0	65.0	66.0	51.0	48.0	47.0	54.0	Balanced	Little	Mixed	Normal	Normal	Normal	Medium	Press	Normal

Code

teams.shape

(1469, 25)

Before importing into Python, data about players were pre-processed:

Weight was converted to kilograms.
Birth year was extracted as separate column.
Body mass index (BMI) was calculated.
player and player_attributes tables were merged.

Code: Import players (player + player_attributes)

# EXCLUDE [p.id, p.player_fifa_api_id, pa.id, pa.player_fifa_api_id]
query = """--sql
SELECT 
    -- id info
    p.player_api_id player_id, 
    pa.date player_info_date,
    -- player
    p.player_name, 
    p.birthday,
    STRFTIME('%Y', p.birthday) birth_year,
    p.height,
    p.weight/2.205 weight_kg, 
    (p.weight/2.205) / ((p.height/100)*(p.height/100)) bmi,
    -- player attributes
    pa.overall_rating, pa.potential, 
    pa.preferred_foot,pa.attacking_work_rate, pa.defensive_work_rate, 
    pa.crossing, pa.finishing, pa.heading_accuracy, pa.short_passing, 
    pa.volleys, pa.dribbling, pa.curve, pa.free_kick_accuracy, 
    pa.long_passing, pa.ball_control, pa.acceleration, pa.sprint_speed, 
    pa.agility, pa.reactions, pa.balance, pa.shot_power, pa.jumping, 
    pa.stamina, pa.strength, pa.long_shots, pa.aggression, pa.interceptions, 
    pa.positioning, pa.vision, pa.penalties, pa.marking, pa.standing_tackle,
    pa.sliding_tackle, pa.gk_diving, pa.gk_handling, pa.gk_kicking, 
    pa.gk_positioning, pa.gk_reflexes
FROM Player p JOIN Player_Attributes pa 
ON ( pa.player_api_id = p.player_api_id );
"""
players = pd.read_sql_query(query, db)

# Print
players.head(2).style.hide(axis="index").format(precision=1)

**Table 4.3.** Inspection: a few rows of table `players`.
player_id	player_info_date	player_name	birthday	birth_year	height	weight_kg	bmi	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	long_passing	ball_control	acceleration	sprint_speed	agility	reactions	balance	shot_power	jumping	stamina	strength	long_shots	aggression	interceptions	positioning	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
505942	2016-02-18 00:00:00	Aaron Appindangoye	1992-02-29 00:00:00	1992	182.9	84.8	25.4	67.0	71.0	right	medium	medium	49.0	44.0	71.0	61.0	44.0	51.0	45.0	39.0	64.0	49.0	60.0	64.0	59.0	47.0	65.0	55.0	58.0	54.0	76.0	35.0	71.0	70.0	45.0	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
505942	2015-11-19 00:00:00	Aaron Appindangoye	1992-02-29 00:00:00	1992	182.9	84.8	25.4	67.0	71.0	right	medium	medium	49.0	44.0	71.0	61.0	44.0	51.0	45.0	39.0	64.0	49.0	60.0	64.0	59.0	47.0	65.0	55.0	58.0	54.0	76.0	35.0	71.0	70.0	45.0	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0

Code

players.shape

(183978, 46)

From table match only the columns of interest were imported. The table was named matches.

Code: Import matches

query = """--sql
SELECT 
    -- match info
    m.id match_id, m.league_id, m.season, m.stage, m.date match_date,
    -- team info
    m.home_team_api_id home_team_id, m.away_team_api_id away_team_id,
    m.home_team_goal, m.away_team_goal,
    -- players
    m.home_player_1, m.home_player_2, m.home_player_3, m.home_player_4,
    m.home_player_5, m.home_player_6, m.home_player_7, m.home_player_8, 
    m.home_player_9, m.home_player_10, m.home_player_11, 
    m.away_player_1, m.away_player_2, m.away_player_3, m.away_player_4,
    m.away_player_5, m.away_player_6, m.away_player_7, m.away_player_8,
    m.away_player_9, m.away_player_10, m.away_player_11,
    -- betting odds
    m.B365H, m.B365D, m.B365A, m.BWH, m.BWD, m.BWA, m.IWH, m.IWD, m.IWA, 
    m.LBH, m.LBD, m.LBA, m.PSH, m.PSD, m.PSA, m.WHH, m.WHD, m.WHA, 
    m.SJH, m.SJD, m.SJA, m.VCH, m.VCD, m.VCA, m.GBH, m.GBD, m.GBA,
    m.BSH, m.BSD, m.BSA
FROM Match m;
"""
matches = pd.read_sql_query(query, db)

# Print
matches.head(2).style.hide(axis="index").format(precision=1)

**Table 4.4.** Inspection: a few rows of table `matches` (1).
match_id	league_id	season	stage	match_date	home_team_id	away_team_id	home_team_goal	away_team_goal	home_player_1	home_player_2	home_player_3	home_player_4	home_player_5	home_player_6	home_player_7	home_player_8	home_player_9	home_player_10	home_player_11	away_player_1	away_player_2	away_player_3	away_player_4	away_player_5	away_player_6	away_player_7	away_player_8	away_player_9	away_player_10	away_player_11	B365H	B365D	B365A	BWH	BWD	BWA	IWH	IWD	IWA	LBH	LBD	LBA	PSH	PSD	PSA	WHH	WHD	WHA	SJH	SJD	SJA	VCH	VCD	VCA	GBH	GBD	GBA	BSH	BSD	BSA
1	1	2008/2009	1	2008-08-17 00:00:00	9987	9993	1	1	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1.7	3.4	5.0	1.8	3.4	4.2	1.9	3.2	3.5	1.8	3.3	3.8	nan	nan	nan	1.7	3.3	4.3	1.9	3.3	4.0	1.6	3.4	4.5	1.8	3.2	4.0	1.7	3.4	4.2
2	1	2008/2009	1	2008-08-16 00:00:00	10000	9994	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1.9	3.2	3.6	1.8	3.3	4.0	1.9	3.2	3.5	1.9	3.2	3.5	nan	nan	nan	1.8	3.3	3.6	1.9	3.3	3.8	2.0	3.2	3.2	1.9	3.2	3.8	1.9	3.2	3.6

Code

matches.shape

(25979, 61)

4.2 Pre-Process in Python

This section contains code that pre-processes data in Python.

Pre-process teams table.

Code: Pre-process teams

teams = teams.to_datetime("team_info_date").sort_values(["team_info_date"])

Pre-process players table.

Code: Pre-process players

# For work rate (wr) variables' pre-processing
wr_categories = pd.CategoricalDtype(
    categories=["low", "medium", "high"], ordered=True
)

# Pre-process
players = (
    players.to_datetime(["birthday", "player_info_date"])
    .astype({"birth_year": int})
    .sort_values(["player_info_date"])
    .to_category("preferred_foot", ["left", "right"])
    .astype(
        {
            "defensive_work_rate": wr_categories,
            "attacking_work_rate": wr_categories,
        }
    )
)

players.head(2)

**Table 4.5.** Inspection: a few rows of table `players` (2).
	player_id	player_info_date	player_name	birthday	birth_year	height	weight_kg	bmi	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	long_passing	ball_control	acceleration	sprint_speed	agility	reactions	balance	shot_power	jumping	stamina	strength	long_shots	aggression	interceptions	positioning	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
183977	39902	2007-02-22	Zvjezdan Misimovic	1982-06-05	1982	180.34	79.82	24.54	80.00	81.00	right	medium	low	74.00	68.00	57.00	88.00	77.00	87.00	86.00	53.00	78.00	91.00	58.00	64.00	77.00	66.00	73.00	72.00	58.00	67.00	59.00	78.00	63.00	63.00	68.00	88.00	53.00	38.00	32.00	30.00	9.00	9.00	78.00	7.00	15.00
79627	38343	2007-02-22	Jef Delen	1976-06-29	1976	175.26	63.04	20.52	67.00	69.00	left	medium	medium	63.00	62.00	59.00	63.00	38.00	68.00	53.00	65.00	51.00	62.00	61.00	68.00	65.00	64.00	61.00	66.00	69.00	83.00	54.00	61.00	58.00	62.00	60.00	63.00	65.00	44.00	64.00	64.00	7.00	15.00	51.00	8.00	6.00

Merge dataset leagues to matches and pre-process matches (this dataset contains one row per match):

Code: Pre-process matches

# Prepare for pre-processing ----------------------------------------------
# Recode match goal difference (goal_diff > 0, if home wins) into words.
def who_wins(goal_diff):
    """Recode outcome of match to text values"""
    if goal_diff < 0:
        return "Away Wins"
    elif goal_diff == 0:
        return "Draw"
    else:
        return "Home Wins"


# Objects to create categorical variables
# fmt: off
season_categories=[
    "2008/2009", "2009/2010", "2010/2011", "2011/2012", 
    "2012/2013", "2013/2014", "2014/2015", "2015/2016",
]

# Objects to rename betting odds
## Old names 
betting_odds_names_old = [
    "B365H", "B365D", "B365A", 
    "BWH", "BWD", "BWA", "IWH", "IWD", "IWA", "LBH", "LBD", "LBA", 
    "PSH", "PSD", "PSA", "WHH", "WHD", "WHA", "SJH", "SJD", "SJA",
    "VCH", "VCD", "VCA", "GBH", "GBD", "GBA", "BSH", "BSD", "BSA",
]

## New names 
betting_odds_names_new = [
    f"{i[:-1]}_home_wins"      if (i.endswith("H"))
    else f"{i[:-1]}_draw"      if (i.endswith("D"))
    else f"{i[:-1]}_away_wins" if (i.endswith("A"))
    else "error"
    for i in betting_odds_names_old
]
# fmt: on

## Names map
odds_names_map = dict(zip(betting_odds_names_old, betting_odds_names_new))

# Pre-process `matches` dataset -------------------------------------------
matches = (
    # Merge matches and leagues
    pd.merge(matches, leagues, on="league_id")
    # Drop columns
    .drop(columns="league_id")
    # Rename columns
    .rename(columns=odds_names_map)
    # Fix data types
    .to_datetime("match_date")
    .to_category("season", season_categories, ordered=True)
    .to_category("league")
    # Create new variables
    .assign(
        goal_sum=lambda x: x.home_team_goal + x.away_team_goal,
        # goal_diff > 0, if home wins:
        goal_diff=lambda x: x.home_team_goal - x.away_team_goal,
        goal_diff_sign=lambda x: np.sign(x.goal_diff),
        match_winner=lambda x: x.goal_diff.apply(who_wins).to_category(),
        # Ratio ha: "home wins / away wins"
        B365_ratio_ha=lambda x: x.B365_home_wins / x.B365_away_wins,
        BW_ratio_ha=lambda x: x.BW_home_wins / x.BW_away_wins,
        PS_ratio_ha=lambda x: x.PS_home_wins / x.PS_away_wins,
        VC_ratio_ha=lambda x: x.VC_home_wins / x.VC_away_wins,
        IW_ratio_ha=lambda x: x.IW_home_wins / x.IW_away_wins,
        WH_ratio_ha=lambda x: x.WH_home_wins / x.WH_away_wins,
        GB_ratio_ha=lambda x: x.GB_home_wins / x.GB_away_wins,
        LB_ratio_ha=lambda x: x.LB_home_wins / x.LB_away_wins,
        SJ_ratio_ha=lambda x: x.SJ_home_wins / x.SJ_away_wins,
        BS_ratio_ha=lambda x: x.BS_home_wins / x.BS_away_wins,
        # Log-ratios of ha
        B365_log_ratio_ha=lambda x: np.log(x.B365_home_wins / x.B365_away_wins),
        BW_log_ratio_ha=lambda x: np.log(x.BW_home_wins / x.BW_away_wins),
        PS_log_ratio_ha=lambda x: np.log(x.PS_home_wins / x.PS_away_wins),
        VC_log_ratio_ha=lambda x: np.log(x.VC_home_wins / x.VC_away_wins),
        IW_log_ratio_ha=lambda x: np.log(x.IW_home_wins / x.IW_away_wins),
        WH_log_ratio_ha=lambda x: np.log(x.WH_home_wins / x.WH_away_wins),
        GB_log_ratio_ha=lambda x: np.log(x.GB_home_wins / x.GB_away_wins),
        LB_log_ratio_ha=lambda x: np.log(x.LB_home_wins / x.LB_away_wins),
        SJ_log_ratio_ha=lambda x: np.log(x.SJ_home_wins / x.SJ_away_wins),
        BS_log_ratio_ha=lambda x: np.log(x.BS_home_wins / x.BS_away_wins),
    )
    # Change position of columns
    .relocate("league", before="season")
    .relocate("region", before="league")
    .relocate("country", before="region")
    .relocate("goal_sum", before="home_player_1")
    .relocate("goal_diff", before="home_player_1")
    .relocate("goal_diff_sign", before="home_player_1")
    .relocate("match_winner", before="home_player_1")
    # Sort rows by date
    .sort_values("match_date")
)

matches.head(2)

**Table 4.6.** Inspection: a few rows of table `matches` (2).
	match_id	country	region	league	season	stage	match_date	home_team_id	away_team_id	home_team_goal	away_team_goal	goal_sum	goal_diff	goal_diff_sign	match_winner	home_player_1	home_player_2	home_player_3	home_player_4	home_player_5	home_player_6	home_player_7	home_player_8	home_player_9	home_player_10	home_player_11	away_player_1	away_player_2	away_player_3	away_player_4	away_player_5	away_player_6	away_player_7	away_player_8	away_player_9	away_player_10	away_player_11	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	PS_home_wins	PS_draw	PS_away_wins	WH_home_wins	WH_draw	WH_away_wins	SJ_home_wins	SJ_draw	SJ_away_wins	VC_home_wins	VC_draw	VC_away_wins	GB_home_wins	GB_draw	GB_away_wins	BS_home_wins	BS_draw	BS_away_wins	B365_ratio_ha	BW_ratio_ha	PS_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	GB_ratio_ha	LB_ratio_ha	SJ_ratio_ha	BS_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	PS_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	GB_log_ratio_ha	LB_log_ratio_ha	SJ_log_ratio_ha	BS_log_ratio_ha
24558	24559	Switzerland		Switzerland Super League	2008/2009	1	2008-07-18	10192	9931	1	2	3	-1	-1	Away Wins	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24559	24560	Switzerland		Switzerland Super League	2008/2009	1	2008-07-19	9930	10179	3	1	4	2	1	Home Wins	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Code: Create goals_summary

# Goals summary dataset for each league and season
goals_summary = (
    matches.groupby(["league", "season"])
    .goal_sum.agg(["count", "sum"])
    .rename(columns={"count": "n_matches_total", "sum": "n_goals_total"})
    .eval("n_goals_per_match = n_goals_total/n_matches_total")
)

# Rank leagues by average number of goals per match per season
leagues_by_goals = [
    *goals_summary.groupby("league")
    .n_goals_per_match.mean()
    .sort_values(ascending=False)
    .index
]

# Sort leagues by "resultativeness": update goals summary table
goals_summary = (
    goals_summary.reset_index()
    .assign(league=lambda x: x.league.to_category(leagues_by_goals))
    .set_index(["league", "season"])
    .sort_index()
)

# Reorder categories (leagues) in `match`
matches = matches.to_category("league", leagues_by_goals)

Code: Create matches_betting_odds table

# Dataset for betting odds analysis

# Variable names
cols_to_include_for_odds = [
    "date",
    "stage",
    # Goal statistics/match outcomes
    "home_team_goal",
    "away_team_goal",
    "goal_sum",
    "goal_diff",
    "goal_diff_sign",
    "match_winner",
    # Betting odds
    "B365_home_wins",
    "BW_home_wins",
    "IW_home_wins",
    "LB_home_wins",
    "PS_home_wins",
    "WH_home_wins",
    "SJ_home_wins",
    "VC_home_wins",
    "GB_home_wins",
    "BS_home_wins",
    "B365_draw",
    "BW_draw",
    "IW_draw",
    "LB_draw",
    "PS_draw",
    "WH_draw",
    "SJ_draw",
    "VC_draw",
    "GB_draw",
    "BS_draw",
    "B365_away_wins",
    "BW_away_wins",
    "IW_away_wins",
    "LB_away_wins",
    "PS_away_wins",
    "WH_away_wins",
    "SJ_away_wins",
    "VC_away_wins",
    "GB_away_wins",
    "BS_away_wins",
    # Derivative/Calculated variables;
    # "ha" means Home/Away betting odds ratio
    "B365_ratio_ha",
    "BW_ratio_ha",
    "PS_ratio_ha",
    "VC_ratio_ha",
    "IW_ratio_ha",
    "WH_ratio_ha",
    "GB_ratio_ha",
    "LB_ratio_ha",
    "SJ_ratio_ha",
    "BS_ratio_ha",
    "B365_log_ratio_ha",
    "BW_log_ratio_ha",
    "PS_log_ratio_ha",
    "VC_log_ratio_ha",
    "IW_log_ratio_ha",
    "WH_log_ratio_ha",
    "GB_log_ratio_ha",
    "LB_log_ratio_ha",
    "SJ_log_ratio_ha",
    "BS_log_ratio_ha",
]

matches_betting_odds = matches.filter(cols_to_include_for_odds)

From matches, let’s create a dataset matches_long_team with one row per team:

Code: Create matches_long_team

# Add column `won_or_lost`, which indicates match status for the team:
def team_won_or_lost(df):
    """Return outcome if a team won or lost a match or there was draw."""
    if df.match_winner == "Draw":
        return "draw"
    elif (df.team_type == "home") and (df.match_winner == "Home Wins"):
        return "won"
    elif (df.team_type == "away") and (df.match_winner == "Away Wins"):
        return "won"
    else:
        return "lost"


def negate_for_away_team(df):
    """Negate goal difference for away team.
    Negative goal difference here means that the team lost.
    """
    if df.team_type == "away":
        return -df.goal_diff
    else:
        return df.goal_diff


matches_long_team = (
    matches.pivot_longer(
        column_names=re.compile("^(home|away)_(.+)"),
        names_pattern="^(home|away)_(.+)",
        names_to=("team_type", ".value"),
        sort_by_appearance=True,
    )
    .to_category("team_type")
    .rename(columns={"team_goal": "team_goals"})
    .assign(
        team_outcome=lambda x: x.apply(team_won_or_lost, axis=1),
        team_goal_diff=lambda x: x.apply(negate_for_away_team, axis=1),
        team_goal_diff_sign=lambda x: np.sign(x.team_goal_diff),
    )
    .relocate("team_id", before="B365_home_wins")
    .relocate("team_type", before="B365_home_wins")
    .relocate("team_goals", before="B365_home_wins")
    .relocate("team_goal_diff", before="B365_home_wins")
    .relocate("team_goal_diff_sign", before="B365_home_wins")
    .relocate("team_outcome", before="B365_home_wins")
)

# Check output
print(
    "Expected ratio is 2, got: ", matches_long_team.shape[0] / matches.shape[0]
)
matches_long_team.head(2)

Expected ratio is 2, got:  2.0

**Table 4.7.** Inspection: a few rows of table `matches_long_team` (1).
	match_id	country	region	league	season	stage	match_date	goal_sum	goal_diff	goal_diff_sign	match_winner	team_id	team_type	team_goals	team_goal_diff	team_goal_diff_sign	team_outcome	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	PS_home_wins	PS_draw	PS_away_wins	WH_home_wins	WH_draw	WH_away_wins	SJ_home_wins	SJ_draw	SJ_away_wins	VC_home_wins	VC_draw	VC_away_wins	GB_home_wins	GB_draw	GB_away_wins	BS_home_wins	BS_draw	BS_away_wins	B365_ratio_ha	BW_ratio_ha	PS_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	GB_ratio_ha	LB_ratio_ha	SJ_ratio_ha	BS_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	PS_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	GB_log_ratio_ha	LB_log_ratio_ha	SJ_log_ratio_ha	BS_log_ratio_ha	player_1	player_2	player_3	player_4	player_5	player_6	player_7	player_8	player_9	player_10	player_11
0	24559	Switzerland		Switzerland Super League	2008/2009	1	2008-07-18	3	-1	-1	Away Wins	10192	home	1	-1	-1	lost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	24559	Switzerland		Switzerland Super League	2008/2009	1	2008-07-18	3	-1	-1	Away Wins	9931	away	2	1	1	won	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

To each match, merge last known information about the team before that particular match.

Note, that only teams, which had information in team_attributes, will be merged (merge requires non-null values in team_info_date). As some teams did not have this information, their names were not merged too, as they were present in teams table. if this is an issue, team and team_attributes should be merged separately to matches_long_team.

Code: Merge matches_long_team and teams

matches_long_team = pd.merge_asof(
    left=matches_long_team,
    right=teams.dropna(subset=["team_info_date"]),
    left_on="match_date",
    right_on="team_info_date",
    by="team_id",
).relocate("team_info_date", before="goal_sum")

matches_long_team.tail(2)

**Table 4.8.** Inspection: a few rows of table `matches_long_team` (2).
	match_id	country	region	league	season	stage	match_date	team_info_date	goal_sum	goal_diff	goal_diff_sign	match_winner	team_id	team_type	team_goals	team_goal_diff	team_goal_diff_sign	team_outcome	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	PS_home_wins	PS_draw	PS_away_wins	WH_home_wins	WH_draw	WH_away_wins	SJ_home_wins	SJ_draw	SJ_away_wins	VC_home_wins	VC_draw	VC_away_wins	GB_home_wins	GB_draw	GB_away_wins	BS_home_wins	BS_draw	BS_away_wins	B365_ratio_ha	BW_ratio_ha	PS_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	GB_ratio_ha	LB_ratio_ha	SJ_ratio_ha	BS_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	PS_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	GB_log_ratio_ha	LB_log_ratio_ha	SJ_log_ratio_ha	BS_log_ratio_ha	player_1	player_2	player_3	player_4	player_5	player_6	player_7	player_8	player_9	player_10	player_11	team_name	team_short_name	buildUpPlayPositioningClass	chanceCreationPositioningClass	defenceDefenderLineClass	buildUpPlaySpeed	buildUpPlayDribbling	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	buildUpPlaySpeedClass	buildUpPlayDribblingClass	buildUpPlayPassingClass	chanceCreationPassingClass	chanceCreationCrossingClass	chanceCreationShootingClass	defencePressureClass	defenceAggressionClass	defenceTeamWidthClass
51956	25949	Switzerland		Switzerland Super League	2015/2016	36	2016-05-25	2015-09-10	4	2	1	Home Wins	10243	home	3	2	1	won	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7621.00	197757.00	115700.00	113235.00	121080.00	41116.00	632356.00	465399.00	462608.00	198082.00	3517.00	FC Zürich	ZUR	Organised	Organised	Cover	62.00	49.00	46.00	47.00	50.00	54.00	47.00	43.00	56.00	Balanced	Normal	Mixed	Normal	Normal	Normal	Medium	Press	Normal
51957	25949	Switzerland		Switzerland Super League	2015/2016	36	2016-05-25	2015-09-10	4	2	1	Home Wins	9824	away	1	-2	-1	lost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	154261.00	294256.00	257845.00	41621.00	214344.00	114011.00	56868.00	488297.00	113227.00	531309.00	493418.00	FC Vaduz	VAD	Organised	Organised	Cover	53.00	32.00	56.00	38.00	53.00	46.00	42.00	33.00	58.00	Balanced	Little	Mixed	Normal	Normal	Normal	Medium	Contain	Normal

Let’s calculate goal scoring statistics of each team per season for team analysis.

Code: Create teams_goals_per_team

teams_goals_per_team = (
    matches_long_team.groupby(["team_name", "season", "league"])
    .team_goals.agg(["count", "sum"])
    .rename({"count": "n_matches", "sum": "n_goals"}, axis=1)
    .reset_index()
    # Exclude teams that did not play in that season
    .query("n_matches > 0")
    # Goals per match
    .eval("n_goals_per_match = n_goals/n_matches")
)

teams_goals_per_team.head(n=2)

**Table 4.9.** Inspection: a few rows of table `teams_goals_per_team`.
	team_name	season	league	n_matches	n_goals	n_goals_per_match
0	1. FC Kaiserslautern	2010/2011	Germany 1. Bundesliga	34	48	1.41
1	1. FC Kaiserslautern	2011/2012	Germany 1. Bundesliga	34	24	0.71

Let’s find several best and worst performing teams in all leagues per season.

Code: Create teams_top_bottom_goals

# Select Top 5 and Bottom 5 teams (by **goals per match**)
# in each season (all leagues)
def select_5(data, column: str, best: bool = True):
    """Select best/worst teams

    Args:
        data (pandas.dataframe)
        column (str): column name to perform computations on.
        best(bool): Should the best (if True) of worst (if False) be found?

    If several teams share the same result as the 5-th, then more than 5
    teams are returned.
    """
    if best:
        return data.nlargest(5, column, keep="all")
    else:
        return data.nsmallest(5, column, keep="all")


def select_5_per_season_by_goals(best: bool):
    """Select best/worst teams in each season"""
    return (
        teams_goals_per_team.groupby("season", as_index=False)
        .apply(select_5, "n_goals_per_match", best=best)
        .sort_values(["season", "n_goals_per_match"], ascending=[True, False])
        .reset_index(drop=True)
    )


teams_top_bottom_goals = pd.concat(
    [
        select_5_per_season_by_goals(best=True).assign(which="Top 5"),
        select_5_per_season_by_goals(best=False).assign(which="Bottom 5"),
    ]
).index_start_at(1)

# Preview
pd.concat([teams_top_bottom_goals.head(n=2), teams_top_bottom_goals.tail(n=2)])

**Table 4.10.** Inspection: a few rows of table `teams_top_bottom_goals`.
	team_name	season	league	n_matches	n_goals	n_goals_per_match	which
1	Ajax	2009/2010	Netherlands Eredivisie	10	37	3.70	Top 5
2	Chelsea	2009/2010	England Premier League	11	40	3.64	Top 5
72	Aston Villa	2015/2016	England Premier League	38	27	0.71	Bottom 5
73	Boavista FC	2015/2016	Portugal Liga ZON Sagres	31	21	0.68	Bottom 5

Let’s find Top teams by percentage of matches that they won.

Code: Create teams_wins_per_season

teams_wins_per_season = (
    matches_long_team.groupby(["season", "league", "team_name"])
    .team_outcome.value_counts(normalize=True)
    .apply(lambda x: x * 100)
    .unstack("team_outcome")
    .relocate("lost", before="draw")
    .sort_values("won", ascending=False)
    .rename_axis(columns=None)
    .reset_index()
    .fillna(0)  # NaN = 0%
    .groupby(["season"], as_index=False)
    .apply(select_5, "won", best=True)
    .rename(columns=str.capitalize)
    .rename(columns={"Team_name": "Team"})
    .set_index(["Season", "League", "Team"])
    .sort_values(["Season", "Won"], ascending=[True, False])
)

teams_wins_per_season.head(2)

**Table 4.11.** Inspection: a few rows of table `teams_wins_per_season`. Columns `Lost`, `Draw`, `Won` indicate percentage of games per season with the indicated outcome.
			Lost	Draw	Won
Season	League	Team
2009/2010	Belgium Jupiler League	RSC Anderlecht	0.00	0.00	100.00
2009/2010	Netherlands Eredivisie	Ajax	0.00	0.00	100.00

From matches_long_team, let’s create dataset matches_long_player with one row per player:

Code: Create matches_long_player

matches_long_player = matches_long_team.pivot_longer(
    column_names=re.compile("player_.+"),
    names_pattern="player_(.+)",
    names_to="player_no",
    values_to="player_id",
    sort_by_appearance=True,
)

print(
    "Expected ratio is 11, got: ",
    matches_long_player.shape[0] / matches_long_team.shape[0],
)
matches_long_player.head(2)

Expected ratio is 11, got:  11.0

**Table 4.12.** Inspection: a few rows of table `matches_long_player` (1).
	match_id	country	region	league	season	stage	match_date	team_info_date	goal_sum	goal_diff	goal_diff_sign	match_winner	team_id	team_type	team_goals	team_goal_diff	team_goal_diff_sign	team_outcome	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	PS_home_wins	PS_draw	PS_away_wins	WH_home_wins	WH_draw	WH_away_wins	SJ_home_wins	SJ_draw	SJ_away_wins	VC_home_wins	VC_draw	VC_away_wins	GB_home_wins	GB_draw	GB_away_wins	BS_home_wins	BS_draw	BS_away_wins	B365_ratio_ha	BW_ratio_ha	PS_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	GB_ratio_ha	LB_ratio_ha	SJ_ratio_ha	BS_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	PS_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	GB_log_ratio_ha	LB_log_ratio_ha	SJ_log_ratio_ha	BS_log_ratio_ha	team_name	team_short_name	buildUpPlayPositioningClass	chanceCreationPositioningClass	defenceDefenderLineClass	buildUpPlaySpeed	buildUpPlayDribbling	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	buildUpPlaySpeedClass	buildUpPlayDribblingClass	buildUpPlayPassingClass	chanceCreationPassingClass	chanceCreationCrossingClass	chanceCreationShootingClass	defencePressureClass	defenceAggressionClass	defenceTeamWidthClass	player_no	player_id
0	24559	Switzerland		Switzerland Super League	2008/2009	1	2008-07-18	NaT	3	-1	-1	Away Wins	10192	home	1	-1	-1	lost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	NaN
1	24559	Switzerland		Switzerland Super League	2008/2009	1	2008-07-18	NaT	3	-1	-1	Away Wins	10192	home	1	-1	-1	lost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	NaN

To each match, merge at that time last known information about each player.

Code: Merge matches_long_team and teams

matches_long_player = (
    pd.merge_asof(
        left=(
            matches_long_player.dropna(subset=["player_id"]).astype(
                {"player_id": np.int64}
            )
        ),
        right=players,
        left_on="match_date",
        right_on="player_info_date",
        by="player_id",
    )
    .relocate("player_info_date", before="goal_sum")
    .assign(
        player_age=lambda x: (
            (x.match_date - x.birthday) / np.timedelta64(1, "Y")
        )
    )
)

matches_long_player.tail(2)

**Table 4.13.** Inspection: a few rows of table `matches_long_player` (2).
	match_id	country	region	league	season	stage	match_date	team_info_date	player_info_date	goal_sum	goal_diff	goal_diff_sign	match_winner	team_id	team_type	team_goals	team_goal_diff	team_goal_diff_sign	team_outcome	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	PS_home_wins	PS_draw	PS_away_wins	WH_home_wins	WH_draw	WH_away_wins	SJ_home_wins	SJ_draw	SJ_away_wins	VC_home_wins	VC_draw	VC_away_wins	GB_home_wins	GB_draw	GB_away_wins	BS_home_wins	BS_draw	BS_away_wins	B365_ratio_ha	BW_ratio_ha	PS_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	GB_ratio_ha	LB_ratio_ha	SJ_ratio_ha	BS_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	PS_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	GB_log_ratio_ha	LB_log_ratio_ha	SJ_log_ratio_ha	BS_log_ratio_ha	team_name	team_short_name	buildUpPlayPositioningClass	chanceCreationPositioningClass	defenceDefenderLineClass	buildUpPlaySpeed	buildUpPlayDribbling	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	buildUpPlaySpeedClass	buildUpPlayDribblingClass	buildUpPlayPassingClass	chanceCreationPassingClass	chanceCreationCrossingClass	chanceCreationShootingClass	defencePressureClass	defenceAggressionClass	defenceTeamWidthClass	player_no	player_id	player_name	birthday	birth_year	height	weight_kg	bmi	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	long_passing	ball_control	acceleration	sprint_speed	agility	reactions	balance	shot_power	jumping	stamina	strength	long_shots	aggression	interceptions	positioning	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes	player_age
542279	25949	Switzerland		Switzerland Super League	2015/2016	36	2016-05-25	2015-09-10	2016-04-21	4	2	1	Home Wins	9824	away	1	-2	-1	lost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	FC Vaduz	VAD	Organised	Organised	Cover	53.00	32.00	56.00	38.00	53.00	46.00	42.00	33.00	58.00	Balanced	Little	Mixed	Normal	Normal	Normal	Medium	Contain	Normal	10	531309	Robin Kamber	1996-02-15	1996	187.96	82.99	23.49	52.00	65.00	right	high	low	49.00	47.00	49.00	63.00	47.00	50.00	53.00	48.00	66.00	56.00	65.00	60.00	56.00	48.00	55.00	48.00	31.00	45.00	71.00	40.00	43.00	31.00	46.00	60.00	50.00	43.00	45.00	42.00	12.00	11.00	8.00	9.00	8.00	20.27
542280	25949	Switzerland		Switzerland Super League	2015/2016	36	2016-05-25	2015-09-10	2016-03-03	4	2	1	Home Wins	9824	away	1	-2	-1	lost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	FC Vaduz	VAD	Organised	Organised	Cover	53.00	32.00	56.00	38.00	53.00	46.00	42.00	33.00	58.00	Balanced	Little	Mixed	Normal	Normal	Normal	Medium	Contain	Normal	11	493418	Albion Avdijaj	1994-01-12	1994	190.50	79.82	21.99	54.00	61.00	right	medium	medium	29.00	60.00	72.00	40.00	47.00	47.00	30.00	23.00	23.00	49.00	48.00	53.00	62.00	47.00	49.00	49.00	67.00	60.00	66.00	42.00	50.00	24.00	56.00	48.00	56.00	27.00	20.00	21.00	7.00	6.00	15.00	10.00	8.00	22.37

Code

matches_long_player.shape

(542281, 139)

Let’s aggregate player information (numeric variables) to get one row for team-match combination.

Code: Create table team_player_summary

# Prepare dataset for aggregation
include = [
    "height",
    "weight_kg",
    "bmi",
    "overall_rating",
    "potential",
    "crossing",
    "finishing",
    "heading_accuracy",
    "short_passing",
    "volleys",
    "dribbling",
    "curve",
    "free_kick_accuracy",
    "long_passing",
    "ball_control",
    "acceleration",
    "sprint_speed",
    "agility",
    "reactions",
    "balance",
    "shot_power",
    "jumping",
    "stamina",
    "strength",
    "long_shots",
    "aggression",
    "interceptions",
    "positioning",
    "vision",
    "penalties",
    "marking",
    "standing_tackle",
    "sliding_tackle",
    "gk_diving",
    "gk_handling",
    "gk_kicking",
    "gk_positioning",
    "gk_reflexes",
    "player_age",
]

for_agg = matches_long_player.groupby(["match_id", "team_id"])[include]

# Evaluate which cases include all 11 players
n_player_ok = for_agg.count().min(axis=1).to_frame("players_summarized")
percent_ok = round(n_player_ok.eval("players_summarized==11").mean(), 3) * 100
print(
    f"In {percent_ok:.1f}% cases, summaries include all 11 players. \n"
    "Only these cases will be analyzed next."
)

# Calculate summary statistics for each selected attribute.
# Include only those cases where 11 players are aggregated.
team_player_summary = for_agg.agg(["min", "mean", "std", "max"])
team_player_summary.columns = (
    team_player_summary.columns.to_flat_index().str.join("__")
)
team_player_summary = (
    n_player_ok.join(team_player_summary)
    .query("players_summarized==11")
    .drop(columns="players_summarized")
)
team_player_summary.head(2)

In 82.9% cases, summaries include all 11 players. 
Only these cases will be analyzed next.

**Table 4.14.** Inspection: a few rows of table `team_player_summary`.
		height__min	height__mean	height__std	height__max	weight_kg__min	weight_kg__mean	weight_kg__std	weight_kg__max	bmi__min	bmi__mean	bmi__std	bmi__max	overall_rating__min	overall_rating__mean	overall_rating__std	overall_rating__max	potential__min	potential__mean	potential__std	potential__max	crossing__min	crossing__mean	crossing__std	crossing__max	finishing__min	finishing__mean	finishing__std	finishing__max	heading_accuracy__min	heading_accuracy__mean	heading_accuracy__std	heading_accuracy__max	short_passing__min	short_passing__mean	short_passing__std	short_passing__max	volleys__min	volleys__mean	volleys__std	volleys__max	dribbling__min	dribbling__mean	dribbling__std	dribbling__max	curve__min	curve__mean	curve__std	curve__max	free_kick_accuracy__min	free_kick_accuracy__mean	free_kick_accuracy__std	free_kick_accuracy__max	long_passing__min	long_passing__mean	long_passing__std	long_passing__max	ball_control__min	ball_control__mean	ball_control__std	ball_control__max	acceleration__min	acceleration__mean	acceleration__std	acceleration__max	sprint_speed__min	sprint_speed__mean	sprint_speed__std	sprint_speed__max	agility__min	agility__mean	agility__std	agility__max	reactions__min	reactions__mean	reactions__std	reactions__max	balance__min	balance__mean	balance__std	balance__max	shot_power__min	shot_power__mean	shot_power__std	shot_power__max	jumping__min	jumping__mean	jumping__std	jumping__max	stamina__min	stamina__mean	stamina__std	stamina__max	strength__min	strength__mean	strength__std	strength__max	long_shots__min	long_shots__mean	long_shots__std	long_shots__max	aggression__min	aggression__mean	aggression__std	aggression__max	interceptions__min	interceptions__mean	interceptions__std	interceptions__max	positioning__min	positioning__mean	positioning__std	positioning__max	vision__min	vision__mean	vision__std	vision__max	penalties__min	penalties__mean	penalties__std	penalties__max	marking__min	marking__mean	marking__std	marking__max	standing_tackle__min	standing_tackle__mean	standing_tackle__std	standing_tackle__max	sliding_tackle__min	sliding_tackle__mean	sliding_tackle__std	sliding_tackle__max	gk_diving__min	gk_diving__mean	gk_diving__std	gk_diving__max	gk_handling__min	gk_handling__mean	gk_handling__std	gk_handling__max	gk_kicking__min	gk_kicking__mean	gk_kicking__std	gk_kicking__max	gk_positioning__min	gk_positioning__mean	gk_positioning__std	gk_positioning__max	gk_reflexes__min	gk_reflexes__mean	gk_reflexes__std	gk_reflexes__max	player_age__min	player_age__mean	player_age__std	player_age__max
match_id	team_id
145	8635	167.64	183.34	7.08	193.04	60.77	78.87	9.00	93.88	21.62	23.39	1.32	25.59	57.00	69.45	4.80	75.00	69.00	74.36	2.84	78.00	29.00	57.82	14.86	78.00	23.00	49.27	18.31	71.00	25.00	61.00	16.77	83.00	51.00	65.73	8.91	78.00	9.00	47.82	21.34	69.00	23.00	55.45	18.62	85.00	11.00	48.27	19.80	77.00	23.00	51.18	19.39	79.00	48.00	62.82	10.22	79.00	51.00	64.45	10.52	82.00	48.00	66.00	9.83	78.00	58.00	68.91	6.49	77.00	48.00	64.73	10.53	82.00	57.00	67.64	6.77	82.00	47.00	68.64	13.04	91.00	25.00	62.00	16.96	85.00	61.00	67.82	4.75	77.00	55.00	73.45	9.17	85.00	42.00	68.73	16.60	91.00	23.00	53.27	17.77	74.00	32.00	67.45	16.62	93.00	31.00	62.09	15.47	82.00	13.00	59.64	20.44	83.00	49.00	67.64	10.49	84.00	42.00	62.64	13.46	83.00	24.00	55.36	19.07	74.00	22.00	57.09	20.20	78.00	12.00	56.55	22.62	74.00	1.00	14.64	18.03	67.00	20.00	25.82	13.71	67.00	48.00	62.82	10.22	79.00	20.00	25.64	13.11	65.00	20.00	26.09	14.61	70.00	18.90	25.71	3.56	31.00
146	9987	170.18	181.26	6.84	193.04	60.77	73.92	8.41	89.80	20.34	22.42	1.16	24.10	54.00	64.09	6.11	72.00	62.00	71.27	5.39	83.00	22.00	54.64	14.66	75.00	22.00	48.91	18.53	74.00	22.00	51.09	13.92	75.00	26.00	56.00	12.77	72.00	25.00	53.73	14.28	72.00	22.00	54.18	19.54	77.00	25.00	53.18	13.47	70.00	11.00	48.18	17.97	72.00	42.00	56.00	9.27	67.00	22.00	58.45	17.30	76.00	56.00	68.18	6.55	77.00	48.00	69.27	8.87	79.00	37.00	65.36	10.46	75.00	56.00	64.27	5.82	72.00	51.00	63.73	7.79	77.00	22.00	56.18	17.67	81.00	61.00	67.09	3.91	73.00	43.00	66.82	9.71	83.00	47.00	63.27	12.55	89.00	22.00	51.73	16.66	69.00	44.00	62.36	11.41	82.00	30.00	59.27	11.59	72.00	30.00	57.45	16.59	77.00	25.00	60.82	13.47	74.00	31.00	58.55	15.44	82.00	21.00	45.09	17.70	65.00	21.00	47.36	17.32	74.00	22.00	49.27	20.23	72.00	1.00	13.82	16.52	62.00	20.00	24.91	12.68	63.00	42.00	56.27	9.57	67.00	20.00	24.55	11.48	59.00	20.00	25.00	12.98	64.00	18.78	23.39	3.22	27.43

Let’s prepare datasets for predictive modelling. Several principles I followed:

mainly numeric variables will be included in the analysis. Exception will be for variable team_type witch indicates if the team is playing at home or away.
some variables (especially betting odds which are highly inter-correlated) with may missing values were also excluded in order not to have more complete cases.

Prepare betting odds for team analysis: instead of home_wins, away_wins which are less correct in this analysis, here _win (victory) and _loose (loss) betting odds will be used. Ratios will be calculate using the new variables accordingly. The ratios will wave names ending in _ratio_wl (ratio betting odds to win / betting odds to loose).

Code: Create table team_betting_odds

team_betting_odds_pre1 = matches_long_team.set_index(["match_id", "team_id"])

team_betting_odds = (
    pd.concat(
        [
            # Transformations for home team
            team_betting_odds_pre1.query("team_type == 'home'").assign(
                # Team wins
                B365_win=lambda x: x.B365_home_wins,
                BW_win=lambda x: x.BW_home_wins,
                VC_win=lambda x: x.VC_home_wins,
                IW_win=lambda x: x.IW_home_wins,
                WH_win=lambda x: x.WH_home_wins,
                LB_win=lambda x: x.LB_home_wins,
                # Team looses
                B365_loose=lambda x: x.B365_away_wins,
                BW_loose=lambda x: x.BW_away_wins,
                VC_loose=lambda x: x.VC_away_wins,
                IW_loose=lambda x: x.IW_away_wins,
                WH_loose=lambda x: x.WH_away_wins,
                LB_loose=lambda x: x.LB_away_wins,
            ),
            # Transformations for away team
            team_betting_odds_pre1.query("team_type == 'away'").assign(
                # Team wins
                B365_win=lambda x: x.B365_away_wins,
                BW_win=lambda x: x.BW_away_wins,
                VC_win=lambda x: x.VC_away_wins,
                IW_win=lambda x: x.IW_away_wins,
                WH_win=lambda x: x.WH_away_wins,
                LB_win=lambda x: x.LB_away_wins,
                # Team looses
                B365_loose=lambda x: x.B365_home_wins,
                BW_loose=lambda x: x.BW_home_wins,
                VC_loose=lambda x: x.VC_home_wins,
                IW_loose=lambda x: x.IW_home_wins,
                WH_loose=lambda x: x.WH_home_wins,
                LB_loose=lambda x: x.LB_home_wins,
            ),
        ]
    )
    .assign(
        # Ratio wl: "team wins / team looses"
        B365_ratio_wl=lambda x: x.B365_win / x.B365_loose,
        BW_ratio_wl=lambda x: x.BW_win / x.BW_loose,
        VC_ratio_wl=lambda x: x.VC_win / x.VC_loose,
        IW_ratio_wl=lambda x: x.IW_win / x.IW_loose,
        WH_ratio_wl=lambda x: x.WH_win / x.WH_loose,
        LB_ratio_wl=lambda x: x.LB_win / x.LB_loose,
        # Log-ratios of ha
        B365_log_ratio_wl=lambda x: np.log(x.B365_win / x.B365_loose),
        BW_log_ratio_wl=lambda x: np.log(x.BW_win / x.BW_loose),
        VC_log_ratio_wl=lambda x: np.log(x.VC_win / x.VC_loose),
        IW_log_ratio_wl=lambda x: np.log(x.IW_win / x.IW_loose),
        WH_log_ratio_wl=lambda x: np.log(x.WH_win / x.WH_loose),
        LB_log_ratio_wl=lambda x: np.log(x.LB_win / x.LB_loose),
    )
    # Keep just betting odds of interest
    .filter(regex="(?<!PS|GB|SJ|BS)_(win$|draw$|loose$|(log_)?ratio_wl$)")
    .dropna()
)

del [team_betting_odds_pre1]

# Inspect
team_betting_odds.tail(2)

**Table 4.15.** Inspection: a few rows of table `team_betting_odds`.
		B365_draw	BW_draw	IW_draw	LB_draw	WH_draw	VC_draw	B365_win	BW_win	VC_win	IW_win	WH_win	LB_win	B365_loose	BW_loose	VC_loose	IW_loose	WH_loose	LB_loose	B365_ratio_wl	BW_ratio_wl	VC_ratio_wl	IW_ratio_wl	WH_ratio_wl	LB_ratio_wl	B365_log_ratio_wl	BW_log_ratio_wl	VC_log_ratio_wl	IW_log_ratio_wl	WH_log_ratio_wl	LB_log_ratio_wl
match_id	team_id
24491	8305	3.80	3.80	4.00	3.80	3.80	4.00	1.70	1.70	1.73	1.60	1.60	1.70	5.00	4.60	5.00	4.80	5.00	4.50	0.34	0.37	0.35	0.33	0.32	0.38	-1.08	-1.00	-1.06	-1.10	-1.14	-0.97
4702	8678	4.20	4.00	3.70	3.80	4.00	4.10	5.25	5.00	5.20	4.50	4.75	5.25	1.67	1.67	1.67	1.70	1.67	1.65	3.14	2.99	3.11	2.65	2.84	3.18	1.15	1.10	1.14	0.97	1.05	1.16

Code

team_betting_odds.shape

(44864, 30)

Code: Create table matches_long_team1

cols = matches_long_team.columns
# Remove player info and column with many NA values
condition_1 = ("player_", "buildUpPlayDribbling")
# Remove some categorical variables and betting odds info
condition_2 = ("Class", "_draw", "_wins", "_ha")
col_index = ~(cols.str.startswith(condition_1) | cols.str.endswith(condition_2))
# Remove rows with no team info
row_index = ~matches_long_team.team_info_date.isna()

matches_long_team1 = matches_long_team.loc[row_index, col_index]

# Join team info, player info and betting odds info
matches_long_team1 = (
    matches_long_team1.set_index(["match_id", "team_id"])
    .join(team_player_summary)
    .join(team_betting_odds)
)

# Remove intermediate results
del [cols, condition_1, condition_2, col_index, row_index]

# Inspect
matches_long_team1.head(2)

**Table 4.16.** Inspection: a few rows of table `matches_long_team1`.
		country	league	season	stage	match_date	team_info_date	goal_sum	goal_diff	goal_diff_sign	match_winner	team_type	team_goals	team_goal_diff	team_goal_diff_sign	team_outcome	team_name	team_short_name	buildUpPlaySpeed	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	height__min	height__mean	height__std	height__max	weight_kg__min	weight_kg__mean	weight_kg__std	weight_kg__max	bmi__min	bmi__mean	bmi__std	bmi__max	overall_rating__min	overall_rating__mean	overall_rating__std	overall_rating__max	potential__min	potential__mean	potential__std	potential__max	crossing__min	crossing__mean	crossing__std	crossing__max	finishing__min	finishing__mean	finishing__std	finishing__max	heading_accuracy__min	heading_accuracy__mean	heading_accuracy__std	heading_accuracy__max	short_passing__min	short_passing__mean	short_passing__std	short_passing__max	volleys__min	volleys__mean	volleys__std	volleys__max	dribbling__min	dribbling__mean	dribbling__std	dribbling__max	curve__min	curve__mean	curve__std	curve__max	free_kick_accuracy__min	free_kick_accuracy__mean	free_kick_accuracy__std	free_kick_accuracy__max	long_passing__min	long_passing__mean	long_passing__std	long_passing__max	ball_control__min	ball_control__mean	ball_control__std	ball_control__max	acceleration__min	acceleration__mean	acceleration__std	acceleration__max	sprint_speed__min	sprint_speed__mean	sprint_speed__std	sprint_speed__max	agility__min	agility__mean	agility__std	agility__max	reactions__min	reactions__mean	reactions__std	reactions__max	balance__min	balance__mean	balance__std	balance__max	shot_power__min	shot_power__mean	shot_power__std	shot_power__max	jumping__min	jumping__mean	jumping__std	jumping__max	stamina__min	stamina__mean	stamina__std	stamina__max	strength__min	strength__mean	strength__std	strength__max	long_shots__min	long_shots__mean	long_shots__std	long_shots__max	aggression__min	aggression__mean	aggression__std	aggression__max	interceptions__min	interceptions__mean	interceptions__std	interceptions__max	positioning__min	positioning__mean	positioning__std	positioning__max	vision__min	vision__mean	vision__std	vision__max	penalties__min	penalties__mean	penalties__std	penalties__max	marking__min	marking__mean	marking__std	marking__max	standing_tackle__min	standing_tackle__mean	standing_tackle__std	standing_tackle__max	sliding_tackle__min	sliding_tackle__mean	sliding_tackle__std	sliding_tackle__max	gk_diving__min	gk_diving__mean	gk_diving__std	gk_diving__max	gk_handling__min	gk_handling__mean	gk_handling__std	gk_handling__max	gk_kicking__min	gk_kicking__mean	gk_kicking__std	gk_kicking__max	gk_positioning__min	gk_positioning__mean	gk_positioning__std	gk_positioning__max	gk_reflexes__min	gk_reflexes__mean	gk_reflexes__std	gk_reflexes__max	player_age__min	player_age__mean	player_age__std	player_age__max	B365_draw	BW_draw	IW_draw	LB_draw	WH_draw	VC_draw	B365_win	BW_win	VC_win	IW_win	WH_win	LB_win	B365_loose	BW_loose	VC_loose	IW_loose	WH_loose	LB_loose	B365_ratio_wl	BW_ratio_wl	VC_ratio_wl	IW_ratio_wl	WH_ratio_wl	LB_ratio_wl	B365_log_ratio_wl	BW_log_ratio_wl	VC_log_ratio_wl	IW_log_ratio_wl	WH_log_ratio_wl	LB_log_ratio_wl
match_id	team_id
22055	10267	Spain	Spain LIGA BBVA	2009/2010	23	2010-02-22	2010-02-22	3	1	1	Home Wins	home	2	1	1	won	Valencia CF	VAL	30.00	30.00	55.00	60.00	70.00	55.00	60.00	60.00	170.18	178.95	4.87	185.42	67.12	74.99	4.62	82.09	22.37	23.40	0.65	24.67	77.00	80.55	3.64	88.00	80.00	84.82	3.66	91.00	21.00	62.91	24.16	89.00	21.00	55.09	26.52	94.00	21.00	65.00	17.37	82.00	21.00	73.00	18.74	90.00	9.00	57.18	23.11	87.00	21.00	68.18	22.03	89.00	21.00	64.64	20.46	88.00	11.00	58.55	22.02	86.00	55.00	67.18	8.39	86.00	32.00	75.18	16.22	91.00	35.00	73.18	15.08	88.00	35.00	72.73	14.96	87.00	45.00	68.82	14.90	87.00	59.00	77.00	8.93	93.00	56.00	74.45	8.15	82.00	21.00	69.64	20.05	91.00	58.00	69.09	9.20	85.00	46.00	74.36	10.71	85.00	58.00	73.18	9.52	85.00	21.00	63.27	24.01	88.00	52.00	72.36	10.68	89.00	60.00	76.09	8.93	89.00	11.00	74.82	22.26	93.00	58.00	75.64	10.76	90.00	66.00	74.27	6.10	86.00	21.00	55.36	29.53	86.00	21.00	54.82	28.98	85.00	8.00	56.18	27.17	81.00	7.00	16.00	20.37	77.00	20.00	26.82	15.05	72.00	55.00	67.18	8.39	86.00	20.00	28.27	19.86	88.00	20.00	27.09	15.95	75.00	21.65	28.60	4.54	38.48	4.00	3.80	3.70	3.60	3.50	3.75	1.57	1.50	1.53	1.55	1.53	1.50	5.50	6.50	6.50	6.00	6.00	5.50	0.29	0.23	0.24	0.26	0.26	0.27	-1.25	-1.47	-1.45	-1.35	-1.37	-1.30
22055	8305	Spain	Spain LIGA BBVA	2009/2010	23	2010-02-22	2010-02-22	3	1	1	Home Wins	away	1	-1	-1	lost	Getafe CF	GET	30.00	35.00	35.00	50.00	70.00	40.00	30.00	50.00	175.26	182.42	3.74	187.96	68.03	75.82	3.93	81.18	21.83	22.77	0.61	23.87	72.00	75.45	2.38	80.00	75.00	80.18	3.71	86.00	24.00	62.36	17.15	89.00	21.00	51.64	22.73	80.00	24.00	64.18	15.67	82.00	24.00	68.18	16.82	83.00	11.00	55.91	20.40	74.00	24.00	63.09	17.31	83.00	7.00	59.00	20.98	86.00	12.00	59.45	19.65	80.00	49.00	68.09	12.36	85.00	26.00	67.45	14.20	75.00	68.00	74.18	4.87	82.00	63.00	72.82	6.24	83.00	57.00	66.91	6.44	80.00	59.00	71.73	5.39	77.00	62.00	71.55	5.92	81.00	24.00	66.18	20.86	92.00	49.00	69.55	8.72	81.00	48.00	72.55	10.57	85.00	66.00	74.73	6.17	87.00	24.00	60.09	17.19	83.00	27.00	66.82	16.46	82.00	58.00	70.64	5.68	76.00	11.00	68.27	19.60	83.00	55.00	66.91	11.09	84.00	55.00	65.09	5.03	73.00	23.00	57.00	24.42	87.00	24.00	59.73	23.25	84.00	12.00	56.82	26.10	84.00	1.00	13.18	20.49	74.00	20.00	27.55	16.45	77.00	49.00	68.45	12.44	85.00	20.00	27.36	15.85	75.00	20.00	27.91	17.65	81.00	20.85	26.45	3.64	33.95	4.00	3.80	3.70	3.60	3.50	3.75	5.50	6.50	6.50	6.00	6.00	5.50	1.57	1.50	1.53	1.55	1.53	1.50	3.50	4.33	4.25	3.87	3.92	3.67	1.25	1.47	1.45	1.35	1.37	1.30

Code

matches_long_team1.shape

(39840, 212)

In the following code blocks:

Separate datasets were created for team-related and match-related analysis.
Data were split ro train and sets sets:
- Training data was included all seasons except the last one.
- In test set there were data from the last season only (2015/2016).
Non-complete cases were removed.

Code: Tables and variables for team-related predictive modeling`

# For predictive modelling (teams)
# Target:
team_target = "team_goals"

# Predictors by type:
df = matches_long_team1
team_vars_team = df.loc[:, "buildUpPlaySpeed":"defenceTeamWidth"].columns
team_vars_player = df.loc[:, "height__min":"player_age__max"].columns
team_vars_betting_odds = df.loc[:, "B365_win":"LB_log_ratio_wl"].columns

team_predictors = [
    "team_type",
    *team_vars_team,
    *team_vars_player,
    *team_vars_betting_odds,
]

# Whole dataset
team_model = df.filter(
    ["season", team_target, *team_predictors], axis=1
).dropna()

# Training/Test sets
team_train = team_model.query("season != '2015/2016'").drop(columns="season")
team_test = team_model.query("season == '2015/2016'").drop(columns="season")

# Remove intermediate results
del [df, team_model]

# Inspect
team_train.head(2)

**Table 4.17.** Inspection: a few rows of table `team_train`.
		team_goals	team_type	buildUpPlaySpeed	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	height__min	height__mean	height__std	height__max	weight_kg__min	weight_kg__mean	weight_kg__std	weight_kg__max	bmi__min	bmi__mean	bmi__std	bmi__max	overall_rating__min	overall_rating__mean	overall_rating__std	overall_rating__max	potential__min	potential__mean	potential__std	potential__max	crossing__min	crossing__mean	crossing__std	crossing__max	finishing__min	finishing__mean	finishing__std	finishing__max	heading_accuracy__min	heading_accuracy__mean	heading_accuracy__std	heading_accuracy__max	short_passing__min	short_passing__mean	short_passing__std	short_passing__max	volleys__min	volleys__mean	volleys__std	volleys__max	dribbling__min	dribbling__mean	dribbling__std	dribbling__max	curve__min	curve__mean	curve__std	curve__max	free_kick_accuracy__min	free_kick_accuracy__mean	free_kick_accuracy__std	free_kick_accuracy__max	long_passing__min	long_passing__mean	long_passing__std	long_passing__max	ball_control__min	ball_control__mean	ball_control__std	ball_control__max	acceleration__min	acceleration__mean	acceleration__std	acceleration__max	sprint_speed__min	sprint_speed__mean	sprint_speed__std	sprint_speed__max	agility__min	agility__mean	agility__std	agility__max	reactions__min	reactions__mean	reactions__std	reactions__max	balance__min	balance__mean	balance__std	balance__max	shot_power__min	shot_power__mean	shot_power__std	shot_power__max	jumping__min	jumping__mean	jumping__std	jumping__max	stamina__min	stamina__mean	stamina__std	stamina__max	strength__min	strength__mean	strength__std	strength__max	long_shots__min	long_shots__mean	long_shots__std	long_shots__max	aggression__min	aggression__mean	aggression__std	aggression__max	interceptions__min	interceptions__mean	interceptions__std	interceptions__max	positioning__min	positioning__mean	positioning__std	positioning__max	vision__min	vision__mean	vision__std	vision__max	penalties__min	penalties__mean	penalties__std	penalties__max	marking__min	marking__mean	marking__std	marking__max	standing_tackle__min	standing_tackle__mean	standing_tackle__std	standing_tackle__max	sliding_tackle__min	sliding_tackle__mean	sliding_tackle__std	sliding_tackle__max	gk_diving__min	gk_diving__mean	gk_diving__std	gk_diving__max	gk_handling__min	gk_handling__mean	gk_handling__std	gk_handling__max	gk_kicking__min	gk_kicking__mean	gk_kicking__std	gk_kicking__max	gk_positioning__min	gk_positioning__mean	gk_positioning__std	gk_positioning__max	gk_reflexes__min	gk_reflexes__mean	gk_reflexes__std	gk_reflexes__max	player_age__min	player_age__mean	player_age__std	player_age__max	B365_win	BW_win	VC_win	IW_win	WH_win	LB_win	B365_loose	BW_loose	VC_loose	IW_loose	WH_loose	LB_loose	B365_ratio_wl	BW_ratio_wl	VC_ratio_wl	IW_ratio_wl	WH_ratio_wl	LB_ratio_wl	B365_log_ratio_wl	BW_log_ratio_wl	VC_log_ratio_wl	IW_log_ratio_wl	WH_log_ratio_wl	LB_log_ratio_wl
match_id	team_id
22055	10267	2	home	30.00	30.00	55.00	60.00	70.00	55.00	60.00	60.00	170.18	178.95	4.87	185.42	67.12	74.99	4.62	82.09	22.37	23.40	0.65	24.67	77.00	80.55	3.64	88.00	80.00	84.82	3.66	91.00	21.00	62.91	24.16	89.00	21.00	55.09	26.52	94.00	21.00	65.00	17.37	82.00	21.00	73.00	18.74	90.00	9.00	57.18	23.11	87.00	21.00	68.18	22.03	89.00	21.00	64.64	20.46	88.00	11.00	58.55	22.02	86.00	55.00	67.18	8.39	86.00	32.00	75.18	16.22	91.00	35.00	73.18	15.08	88.00	35.00	72.73	14.96	87.00	45.00	68.82	14.90	87.00	59.00	77.00	8.93	93.00	56.00	74.45	8.15	82.00	21.00	69.64	20.05	91.00	58.00	69.09	9.20	85.00	46.00	74.36	10.71	85.00	58.00	73.18	9.52	85.00	21.00	63.27	24.01	88.00	52.00	72.36	10.68	89.00	60.00	76.09	8.93	89.00	11.00	74.82	22.26	93.00	58.00	75.64	10.76	90.00	66.00	74.27	6.10	86.00	21.00	55.36	29.53	86.00	21.00	54.82	28.98	85.00	8.00	56.18	27.17	81.00	7.00	16.00	20.37	77.00	20.00	26.82	15.05	72.00	55.00	67.18	8.39	86.00	20.00	28.27	19.86	88.00	20.00	27.09	15.95	75.00	21.65	28.60	4.54	38.48	1.57	1.50	1.53	1.55	1.53	1.50	5.50	6.50	6.50	6.00	6.00	5.50	0.29	0.23	0.24	0.26	0.26	0.27	-1.25	-1.47	-1.45	-1.35	-1.37	-1.30
22055	8305	1	away	30.00	35.00	35.00	50.00	70.00	40.00	30.00	50.00	175.26	182.42	3.74	187.96	68.03	75.82	3.93	81.18	21.83	22.77	0.61	23.87	72.00	75.45	2.38	80.00	75.00	80.18	3.71	86.00	24.00	62.36	17.15	89.00	21.00	51.64	22.73	80.00	24.00	64.18	15.67	82.00	24.00	68.18	16.82	83.00	11.00	55.91	20.40	74.00	24.00	63.09	17.31	83.00	7.00	59.00	20.98	86.00	12.00	59.45	19.65	80.00	49.00	68.09	12.36	85.00	26.00	67.45	14.20	75.00	68.00	74.18	4.87	82.00	63.00	72.82	6.24	83.00	57.00	66.91	6.44	80.00	59.00	71.73	5.39	77.00	62.00	71.55	5.92	81.00	24.00	66.18	20.86	92.00	49.00	69.55	8.72	81.00	48.00	72.55	10.57	85.00	66.00	74.73	6.17	87.00	24.00	60.09	17.19	83.00	27.00	66.82	16.46	82.00	58.00	70.64	5.68	76.00	11.00	68.27	19.60	83.00	55.00	66.91	11.09	84.00	55.00	65.09	5.03	73.00	23.00	57.00	24.42	87.00	24.00	59.73	23.25	84.00	12.00	56.82	26.10	84.00	1.00	13.18	20.49	74.00	20.00	27.55	16.45	77.00	49.00	68.45	12.44	85.00	20.00	27.36	15.85	75.00	20.00	27.91	17.65	81.00	20.85	26.45	3.64	33.95	5.50	6.50	6.50	6.00	6.00	5.50	1.57	1.50	1.53	1.55	1.53	1.50	3.50	4.33	4.25	3.87	3.92	3.67	1.25	1.47	1.45	1.35	1.37	1.30

Code

team_train.shape

(27200, 190)

In a similar way, data for matches outcome analysis will be prepared.

Code: Create table matches_long_team2

cols = matches_long_team.columns
condition_1 = ("PS_", "GB_", "SJ_", "BS_", "player_", "buildUpPlayDribbling")

col_index = ~(cols.str.startswith(condition_1) | cols.str.endswith("Class"))
row_index = ~matches_long_team.team_info_date.isna()

matches_long_team2 = matches_long_team.loc[row_index, col_index]
matches_long_team2 = (
    matches_long_team2.set_index(["match_id", "team_id"])
    .join(team_player_summary)
    .relocate("team_short_name", before="goal_sum")
    .relocate("team_name", before="team_short_name")
)

# Remove intermediate results
del [cols, condition_1, col_index, row_index]

# Inspect
matches_long_team2.head(2)

**Table 4.18.** Inspection: a few rows of table `matches_long_team2`.
		country	league	season	stage	match_date	team_info_date	team_name	team_short_name	goal_sum	goal_diff	goal_diff_sign	match_winner	team_type	team_goals	team_goal_diff	team_goal_diff_sign	team_outcome	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	WH_home_wins	WH_draw	WH_away_wins	VC_home_wins	VC_draw	VC_away_wins	B365_ratio_ha	BW_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	LB_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	LB_log_ratio_ha	buildUpPlaySpeed	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	height__min	height__mean	height__std	height__max	weight_kg__min	weight_kg__mean	weight_kg__std	weight_kg__max	bmi__min	bmi__mean	bmi__std	bmi__max	overall_rating__min	overall_rating__mean	overall_rating__std	overall_rating__max	potential__min	potential__mean	potential__std	potential__max	crossing__min	crossing__mean	crossing__std	crossing__max	finishing__min	finishing__mean	finishing__std	finishing__max	heading_accuracy__min	heading_accuracy__mean	heading_accuracy__std	heading_accuracy__max	short_passing__min	short_passing__mean	short_passing__std	short_passing__max	volleys__min	volleys__mean	volleys__std	volleys__max	dribbling__min	dribbling__mean	dribbling__std	dribbling__max	curve__min	curve__mean	curve__std	curve__max	free_kick_accuracy__min	free_kick_accuracy__mean	free_kick_accuracy__std	free_kick_accuracy__max	long_passing__min	long_passing__mean	long_passing__std	long_passing__max	ball_control__min	ball_control__mean	ball_control__std	ball_control__max	acceleration__min	acceleration__mean	acceleration__std	acceleration__max	sprint_speed__min	sprint_speed__mean	sprint_speed__std	sprint_speed__max	agility__min	agility__mean	agility__std	agility__max	reactions__min	reactions__mean	reactions__std	reactions__max	balance__min	balance__mean	balance__std	balance__max	shot_power__min	shot_power__mean	shot_power__std	shot_power__max	jumping__min	jumping__mean	jumping__std	jumping__max	stamina__min	stamina__mean	stamina__std	stamina__max	strength__min	strength__mean	strength__std	strength__max	long_shots__min	long_shots__mean	long_shots__std	long_shots__max	aggression__min	aggression__mean	aggression__std	aggression__max	interceptions__min	interceptions__mean	interceptions__std	interceptions__max	positioning__min	positioning__mean	positioning__std	positioning__max	vision__min	vision__mean	vision__std	vision__max	penalties__min	penalties__mean	penalties__std	penalties__max	marking__min	marking__mean	marking__std	marking__max	standing_tackle__min	standing_tackle__mean	standing_tackle__std	standing_tackle__max	sliding_tackle__min	sliding_tackle__mean	sliding_tackle__std	sliding_tackle__max	gk_diving__min	gk_diving__mean	gk_diving__std	gk_diving__max	gk_handling__min	gk_handling__mean	gk_handling__std	gk_handling__max	gk_kicking__min	gk_kicking__mean	gk_kicking__std	gk_kicking__max	gk_positioning__min	gk_positioning__mean	gk_positioning__std	gk_positioning__max	gk_reflexes__min	gk_reflexes__mean	gk_reflexes__std	gk_reflexes__max	player_age__min	player_age__mean	player_age__std	player_age__max
match_id	team_id
22055	10267	Spain	Spain LIGA BBVA	2009/2010	23	2010-02-22	2010-02-22	Valencia CF	VAL	3	1	1	Home Wins	home	2	1	1	won	1.57	4.00	5.50	1.50	3.80	6.50	1.55	3.70	6.00	1.50	3.60	5.50	1.53	3.50	6.00	1.53	3.75	6.50	0.29	0.23	0.24	0.26	0.26	0.27	-1.25	-1.47	-1.45	-1.35	-1.37	-1.30	30.00	30.00	55.00	60.00	70.00	55.00	60.00	60.00	170.18	178.95	4.87	185.42	67.12	74.99	4.62	82.09	22.37	23.40	0.65	24.67	77.00	80.55	3.64	88.00	80.00	84.82	3.66	91.00	21.00	62.91	24.16	89.00	21.00	55.09	26.52	94.00	21.00	65.00	17.37	82.00	21.00	73.00	18.74	90.00	9.00	57.18	23.11	87.00	21.00	68.18	22.03	89.00	21.00	64.64	20.46	88.00	11.00	58.55	22.02	86.00	55.00	67.18	8.39	86.00	32.00	75.18	16.22	91.00	35.00	73.18	15.08	88.00	35.00	72.73	14.96	87.00	45.00	68.82	14.90	87.00	59.00	77.00	8.93	93.00	56.00	74.45	8.15	82.00	21.00	69.64	20.05	91.00	58.00	69.09	9.20	85.00	46.00	74.36	10.71	85.00	58.00	73.18	9.52	85.00	21.00	63.27	24.01	88.00	52.00	72.36	10.68	89.00	60.00	76.09	8.93	89.00	11.00	74.82	22.26	93.00	58.00	75.64	10.76	90.00	66.00	74.27	6.10	86.00	21.00	55.36	29.53	86.00	21.00	54.82	28.98	85.00	8.00	56.18	27.17	81.00	7.00	16.00	20.37	77.00	20.00	26.82	15.05	72.00	55.00	67.18	8.39	86.00	20.00	28.27	19.86	88.00	20.00	27.09	15.95	75.00	21.65	28.60	4.54	38.48
22055	8305	Spain	Spain LIGA BBVA	2009/2010	23	2010-02-22	2010-02-22	Getafe CF	GET	3	1	1	Home Wins	away	1	-1	-1	lost	1.57	4.00	5.50	1.50	3.80	6.50	1.55	3.70	6.00	1.50	3.60	5.50	1.53	3.50	6.00	1.53	3.75	6.50	0.29	0.23	0.24	0.26	0.26	0.27	-1.25	-1.47	-1.45	-1.35	-1.37	-1.30	30.00	35.00	35.00	50.00	70.00	40.00	30.00	50.00	175.26	182.42	3.74	187.96	68.03	75.82	3.93	81.18	21.83	22.77	0.61	23.87	72.00	75.45	2.38	80.00	75.00	80.18	3.71	86.00	24.00	62.36	17.15	89.00	21.00	51.64	22.73	80.00	24.00	64.18	15.67	82.00	24.00	68.18	16.82	83.00	11.00	55.91	20.40	74.00	24.00	63.09	17.31	83.00	7.00	59.00	20.98	86.00	12.00	59.45	19.65	80.00	49.00	68.09	12.36	85.00	26.00	67.45	14.20	75.00	68.00	74.18	4.87	82.00	63.00	72.82	6.24	83.00	57.00	66.91	6.44	80.00	59.00	71.73	5.39	77.00	62.00	71.55	5.92	81.00	24.00	66.18	20.86	92.00	49.00	69.55	8.72	81.00	48.00	72.55	10.57	85.00	66.00	74.73	6.17	87.00	24.00	60.09	17.19	83.00	27.00	66.82	16.46	82.00	58.00	70.64	5.68	76.00	11.00	68.27	19.60	83.00	55.00	66.91	11.09	84.00	55.00	65.09	5.03	73.00	23.00	57.00	24.42	87.00	24.00	59.73	23.25	84.00	12.00	56.82	26.10	84.00	1.00	13.18	20.49	74.00	20.00	27.55	16.45	77.00	49.00	68.45	12.44	85.00	20.00	27.36	15.85	75.00	20.00	27.91	17.65	81.00	20.85	26.45	3.64	33.95

Code

matches_long_team2.shape

(39840, 212)

Code: Tables and variables for match-related predictive modeling`

# For predictive modelling (match)
# Target:
match_target = "match_winner"

# Variable groups for transformation
vars_predictors = matches_long_team2.loc[
    :, "B365_home_wins":"player_age__max"
].columns
vars_betting_odds = matches_long_team2.loc[
    :, "B365_home_wins":"LB_log_ratio_ha"
].columns

# Whole dataset
match_model = (
    matches_long_team2.filter(
        ["season", "team_type", match_target, *vars_predictors], axis=1
    )
    .reset_index()
    .drop(columns="team_id")
    .pivot_wider(
        index=["match_id", match_target, "season", *vars_betting_odds],
        names_from="team_type",
    )
    .set_index("match_id")
    .dropna()
)

# Predictors by type
match_vars_team = match_model.loc[
    :, "buildUpPlaySpeed_away":"defenceTeamWidth_home"
].columns
match_vars_player = match_model.loc[
    :, "height__min_away":"player_age__max_home"
].columns
match_vars_betting_odds = match_model.loc[
    :, "B365_home_wins":"LB_log_ratio_ha"
].columns

match_predictors = [
    "team_type",
    *match_vars_team,
    *match_vars_player,
    *match_vars_betting_odds,
]

# Training/Test sets
match_train = match_model.query("season != '2015/2016'").drop(columns="season")
match_test = match_model.query("season == '2015/2016'").drop(columns="season")

# Remove intermediate results
del [match_model, vars_predictors, vars_betting_odds]

# Inspect
match_train.head(2)

**Table 4.19.** Inspection: a few rows of table `match_train`.
	match_winner	B365_home_wins	B365_draw	B365_away_wins	BW_home_wins	BW_draw	BW_away_wins	IW_home_wins	IW_draw	IW_away_wins	LB_home_wins	LB_draw	LB_away_wins	WH_home_wins	WH_draw	WH_away_wins	VC_home_wins	VC_draw	VC_away_wins	B365_ratio_ha	BW_ratio_ha	VC_ratio_ha	IW_ratio_ha	WH_ratio_ha	LB_ratio_ha	B365_log_ratio_ha	BW_log_ratio_ha	VC_log_ratio_ha	IW_log_ratio_ha	WH_log_ratio_ha	LB_log_ratio_ha	buildUpPlaySpeed_away	buildUpPlaySpeed_home	buildUpPlayPassing_away	buildUpPlayPassing_home	chanceCreationPassing_away	chanceCreationPassing_home	chanceCreationCrossing_away	chanceCreationCrossing_home	chanceCreationShooting_away	chanceCreationShooting_home	defencePressure_away	defencePressure_home	defenceAggression_away	defenceAggression_home	defenceTeamWidth_away	defenceTeamWidth_home	height__min_away	height__min_home	height__mean_away	height__mean_home	height__std_away	height__std_home	height__max_away	height__max_home	weight_kg__min_away	weight_kg__min_home	weight_kg__mean_away	weight_kg__mean_home	weight_kg__std_away	weight_kg__std_home	weight_kg__max_away	weight_kg__max_home	bmi__min_away	bmi__min_home	bmi__mean_away	bmi__mean_home	bmi__std_away	bmi__std_home	bmi__max_away	bmi__max_home	overall_rating__min_away	overall_rating__min_home	overall_rating__mean_away	overall_rating__mean_home	overall_rating__std_away	overall_rating__std_home	overall_rating__max_away	overall_rating__max_home	potential__min_away	potential__min_home	potential__mean_away	potential__mean_home	potential__std_away	potential__std_home	potential__max_away	potential__max_home	crossing__min_away	crossing__min_home	crossing__mean_away	crossing__mean_home	crossing__std_away	crossing__std_home	crossing__max_away	crossing__max_home	finishing__min_away	finishing__min_home	finishing__mean_away	finishing__mean_home	finishing__std_away	finishing__std_home	finishing__max_away	finishing__max_home	heading_accuracy__min_away	heading_accuracy__min_home	heading_accuracy__mean_away	heading_accuracy__mean_home	heading_accuracy__std_away	heading_accuracy__std_home	heading_accuracy__max_away	heading_accuracy__max_home	short_passing__min_away	short_passing__min_home	short_passing__mean_away	short_passing__mean_home	short_passing__std_away	short_passing__std_home	short_passing__max_away	short_passing__max_home	volleys__min_away	volleys__min_home	volleys__mean_away	volleys__mean_home	volleys__std_away	volleys__std_home	volleys__max_away	volleys__max_home	dribbling__min_away	dribbling__min_home	dribbling__mean_away	dribbling__mean_home	dribbling__std_away	dribbling__std_home	dribbling__max_away	dribbling__max_home	curve__min_away	curve__min_home	curve__mean_away	curve__mean_home	curve__std_away	curve__std_home	curve__max_away	curve__max_home	free_kick_accuracy__min_away	free_kick_accuracy__min_home	free_kick_accuracy__mean_away	free_kick_accuracy__mean_home	free_kick_accuracy__std_away	free_kick_accuracy__std_home	free_kick_accuracy__max_away	...	shot_power__mean_away	shot_power__mean_home	shot_power__std_away	shot_power__std_home	shot_power__max_away	shot_power__max_home	jumping__min_away	jumping__min_home	jumping__mean_away	jumping__mean_home	jumping__std_away	jumping__std_home	jumping__max_away	jumping__max_home	stamina__min_away	stamina__min_home	stamina__mean_away	stamina__mean_home	stamina__std_away	stamina__std_home	stamina__max_away	stamina__max_home	strength__min_away	strength__min_home	strength__mean_away	strength__mean_home	strength__std_away	strength__std_home	strength__max_away	strength__max_home	long_shots__min_away	long_shots__min_home	long_shots__mean_away	long_shots__mean_home	long_shots__std_away	long_shots__std_home	long_shots__max_away	long_shots__max_home	aggression__min_away	aggression__min_home	aggression__mean_away	aggression__mean_home	aggression__std_away	aggression__std_home	aggression__max_away	aggression__max_home	interceptions__min_away	interceptions__min_home	interceptions__mean_away	interceptions__mean_home	interceptions__std_away	interceptions__std_home	interceptions__max_away	interceptions__max_home	positioning__min_away	positioning__min_home	positioning__mean_away	positioning__mean_home	positioning__std_away	positioning__std_home	positioning__max_away	positioning__max_home	vision__min_away	vision__min_home	vision__mean_away	vision__mean_home	vision__std_away	vision__std_home	vision__max_away	vision__max_home	penalties__min_away	penalties__min_home	penalties__mean_away	penalties__mean_home	penalties__std_away	penalties__std_home	penalties__max_away	penalties__max_home	marking__min_away	marking__min_home	marking__mean_away	marking__mean_home	marking__std_away	marking__std_home	marking__max_away	marking__max_home	standing_tackle__min_away	standing_tackle__min_home	standing_tackle__mean_away	standing_tackle__mean_home	standing_tackle__std_away	standing_tackle__std_home	standing_tackle__max_away	standing_tackle__max_home	sliding_tackle__min_away	sliding_tackle__min_home	sliding_tackle__mean_away	sliding_tackle__mean_home	sliding_tackle__std_away	sliding_tackle__std_home	sliding_tackle__max_away	sliding_tackle__max_home	gk_diving__min_away	gk_diving__min_home	gk_diving__mean_away	gk_diving__mean_home	gk_diving__std_away	gk_diving__std_home	gk_diving__max_away	gk_diving__max_home	gk_handling__min_away	gk_handling__min_home	gk_handling__mean_away	gk_handling__mean_home	gk_handling__std_away	gk_handling__std_home	gk_handling__max_away	gk_handling__max_home	gk_kicking__min_away	gk_kicking__min_home	gk_kicking__mean_away	gk_kicking__mean_home	gk_kicking__std_away	gk_kicking__std_home	gk_kicking__max_away	gk_kicking__max_home	gk_positioning__min_away	gk_positioning__min_home	gk_positioning__mean_away	gk_positioning__mean_home	gk_positioning__std_away	gk_positioning__std_home	gk_positioning__max_away	gk_positioning__max_home	gk_reflexes__min_away	gk_reflexes__min_home	gk_reflexes__mean_away	gk_reflexes__mean_home	gk_reflexes__std_away	gk_reflexes__std_home	gk_reflexes__max_away	gk_reflexes__max_home	player_age__min_away	player_age__min_home	player_age__mean_away	player_age__mean_home	player_age__std_away	player_age__std_home	player_age__max_away	player_age__max_home
match_id
449	Home Wins	2.50	3.25	2.80	2.40	3.30	2.60	2.40	3.10	2.50	2.62	3.25	2.30	2.62	3.20	2.50	2.50	3.20	2.62	0.89	0.92	0.95	0.96	1.05	1.14	-0.11	-0.08	-0.05	-0.04	0.05	0.13	45.00	45.00	45.00	35.00	50.00	70.00	35.00	45.00	60.00	55.00	70.00	65.00	65.00	60.00	70.00	70.00	175.26	177.80	184.03	184.03	5.94	5.13	198.12	193.04	68.03	68.93	77.14	78.71	6.60	6.72	87.98	88.89	20.34	20.31	22.75	23.24	1.21	1.72	24.40	26.58	57.00	63.00	65.45	65.09	4.06	2.02	70.00	70.00	59.00	62.00	72.09	67.73	4.93	3.41	77.00	76.00	24.00	20.00	54.36	53.27	12.92	14.00	69.00	69.00	22.00	22.00	50.91	47.09	18.28	16.53	73.00	73.00	36.00	27.00	57.27	59.36	10.69	13.76	75.00	76.00	38.00	38.00	62.09	61.91	10.77	9.80	74.00	73.00	18.00	22.00	52.91	49.00	18.40	18.08	72.00	76.00	27.00	23.00	56.82	49.91	15.99	16.14	74.00	72.00	18.00	27.00	54.18	47.27	20.13	14.55	77.00	69.00	21.00	30.00	50.82	52.27	17.69	12.71	71.00	...	63.00	64.55	12.67	6.67	85.00	72.00	55.00	55.00	65.45	62.55	4.37	4.57	70.00	71.00	52.00	43.00	65.45	70.09	5.72	11.45	74.00	85.00	52.00	53.00	64.73	66.18	8.88	10.68	79.00	81.00	21.00	21.00	54.00	51.45	17.56	17.11	74.00	73.00	23.00	37.00	54.82	62.00	15.44	13.11	71.00	78.00	48.00	52.00	62.64	65.73	6.85	9.84	73.00	80.00	37.00	53.00	62.36	65.18	11.16	7.99	82.00	78.00	37.00	52.00	64.82	66.82	11.16	8.67	77.00	83.00	39.00	53.00	61.18	65.09	12.66	8.97	83.00	83.00	21.00	20.00	42.45	51.45	19.43	18.91	65.00	71.00	22.00	20.00	47.18	55.45	19.61	16.23	76.00	69.00	14.00	19.00	48.36	52.64	20.52	18.46	72.00	70.00	1.00	5.00	10.73	15.36	14.96	16.19	55.00	63.00	20.00	20.00	24.27	25.73	9.60	12.55	53.00	63.00	48.00	51.00	60.36	59.18	7.37	7.00	69.00	71.00	20.00	20.00	24.55	25.55	10.50	11.95	56.00	61.00	20.00	20.00	25.18	26.00	12.60	13.44	63.00	66.00	17.80	25.07	23.69	29.20	3.71	3.16	29.02	33.64
451	Draw	2.15	3.30	3.40	2.15	3.25	3.05	2.20	3.10	2.80	2.10	3.20	3.00	2.20	3.20	3.10	2.05	3.20	3.25	0.63	0.70	0.63	0.79	0.71	0.70	-0.46	-0.35	-0.46	-0.24	-0.34	-0.36	50.00	65.00	60.00	60.00	50.00	50.00	50.00	40.00	50.00	50.00	60.00	60.00	60.00	70.00	65.00	60.00	175.26	175.26	183.11	181.73	4.88	5.83	190.50	193.04	68.03	68.93	76.60	78.95	5.47	5.48	84.81	88.89	20.88	22.44	22.83	23.89	0.95	0.76	24.01	24.96	62.00	57.00	65.00	63.73	1.61	2.90	67.00	67.00	64.00	63.00	69.64	68.27	4.01	2.53	78.00	72.00	21.00	25.00	52.36	53.00	15.35	13.65	70.00	71.00	26.00	25.00	52.91	46.82	15.67	14.74	67.00	67.00	8.00	25.00	53.09	53.36	18.21	11.88	73.00	67.00	21.00	25.00	56.73	58.73	13.24	12.02	70.00	69.00	14.00	10.00	50.09	45.73	18.01	17.59	66.00	67.00	13.00	25.00	53.64	54.27	18.92	13.87	76.00	68.00	12.00	16.00	50.91	51.36	18.17	16.92	72.00	72.00	13.00	12.00	50.64	49.64	16.24	17.51	69.00	...	63.27	61.00	9.96	13.18	74.00	72.00	60.00	53.00	66.45	67.00	5.61	9.10	79.00	83.00	56.00	40.00	70.18	66.73	5.72	9.31	75.00	74.00	42.00	53.00	62.91	65.55	12.43	8.50	78.00	79.00	21.00	25.00	54.45	52.64	15.86	15.41	69.00	69.00	27.00	48.00	56.18	63.73	17.39	7.25	81.00	73.00	45.00	40.00	58.27	60.27	9.03	9.12	73.00	72.00	21.00	33.00	56.09	58.82	13.41	9.37	69.00	66.00	45.00	50.00	61.55	62.09	8.47	6.77	75.00	71.00	47.00	48.00	60.45	61.00	7.49	6.24	68.00	69.00	21.00	20.00	43.73	44.82	17.88	16.86	64.00	65.00	21.00	25.00	50.55	47.18	17.27	14.68	68.00	64.00	10.00	26.00	47.09	51.55	20.79	15.10	67.00	66.00	1.00	2.00	13.27	12.18	19.57	17.74	70.00	65.00	20.00	20.00	26.18	24.64	12.96	11.52	65.00	59.00	40.00	42.00	57.00	57.00	8.57	7.39	67.00	66.00	20.00	20.00	26.00	24.64	12.36	11.52	63.00	59.00	20.00	20.00	26.73	25.55	14.76	14.51	71.00	69.00	19.24	21.66	24.71	25.14	3.00	2.26	28.89	29.60

2 rows × 359 columns

Code

match_train.shape

(12634, 359)

Let’s clean up: remove unnecessary (intermediate) datasets.

Code

del [
    teams,
    teams_goals_per_team,
    team_player_summary,
    team_betting_odds,
    matches_long_team,
    matches_long_team1,
    matches_long_team2,
    matches_long_player,
]

5 Analysis

This is the main part there the most important data-based insights are created.

At the top of each subsection, there will be one or several questions provided.
Next, the summary and the main findings of that subsection will follow.
Next, the details (plots, tables, etc.) will be provided.

Each subsection main contain ad-hoc data preprocessing and analysis code.

5.1 Included Countries and Leagues

Which leagues are in which countries?

There are 10 countries (Scotland and England are parts of United Kingdom, UK) and 11 leagues in the database: 1 league per country except UK. See details in map 5.1 and Table 5.1.

Code

europe_boundaries = Polygon([(-25, 35), (40, 35), (40, 75), (-25, 75)])
map_world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
map_europe = (
    map_world.query("continent=='Europe'")
    .pipe(gpd.clip, europe_boundaries)
    .assign(included_countries=lambda x: x.name.isin(leagues.country))
)
cmap = LinearSegmentedColormap.from_list("", ["#DDD", "limegreen"])
ax = map_europe.plot(
    edgecolor="black", column="included_countries", cmap=cmap, figsize=(5, 5)
)
ax.set_axis_off();

**Fig. 5.1.** Map of European countries (in green) the leagues of which are included in this analysis.

Code

(
    leagues.drop(columns="league_id")
    .rename(columns=str.capitalize)
    .index_start_at(1)
    .style
)

**Table 5.1.** Football leagues in each country.
	Country	Region	League
1	Belgium		Belgium Jupiler League
2	United Kingdom	England	England Premier League
3	France		France Ligue 1
4	Germany		Germany 1. Bundesliga
5	Italy		Italy Serie A
6	Netherlands		Netherlands Eredivisie
7	Poland		Poland Ekstraklasa
8	Portugal		Portugal Liga ZON Sagres
9	United Kingdom	Scotland	Scotland Premier League
10	Spain		Spain LIGA BBVA
11	Switzerland		Switzerland Super League

5.2 Comparing Leagues and Seasons

Which leagues score the most and the fewest goals?

Are there any goal-scoring patterns between seasons?

Main points of this section:

Leagues differ in number of matches per season:
- Fewest games are played in Switzerland (180 matches per season);
- Most games are played in Italy, France, England, and Spain (380 matches per season).
As some matches are missing from the datasets and leagues are of different size, it is more correct to compare leagues by the goals per match than by total goals scored.
Leagues differ by resultativeness:
- Most scoring league is in the Netherlands (3.08 goals per match) and it does not differ significantly from leagues in Switzerland, and Germany.
- Least scoring league is in Poland (2.42 goals per match) and it does not differ significantly from France, Portugal, Italy, and Scotland.
By comparing seasons, no significant patterns (differences) were found.

Find the details in the following subsections.

5.2.1 Both (Leagues and Seasons)

First, slices of each league and season were analyzed (Table 5.2). The result revealed that, e.g., in Belgium Jupiler League 2013/2014, some games are clearly missing: Wikipedia article indicates that 299 matches were played and Table 5.2 shows only 12. Looking at Wikipedia pages of some other seasons and leagues, it is clear that in some cases all games are included in the dataset, but in other cases some games are missing. So it is not correct to compare total matches and total goals per league. Average number of goals per match is a more appropriate measure.

Code

(
    goals_summary.style.format(
        {"n_goals_per_match": "{:.2f}", "n_goals_total": "{0:,.0f}"}
    ).bar(cmap="RdYlGn", height=80, width=50)
)

**Table 5.2.** Goal statistics for each league and season.
		n_matches_total	n_goals_total	n_goals_per_match
league	season
Netherlands Eredivisie	2008/2009	306	870	2.84
	2009/2010	306	892	2.92
	2010/2011	306	987	3.23
	2011/2012	306	997	3.26
	2012/2013	306	964	3.15
	2013/2014	306	978	3.20
	2014/2015	306	942	3.08
	2015/2016	306	912	2.98
Switzerland Super League	2008/2009	180	540	3.00
	2009/2010	180	599	3.33
	2010/2011	180	537	2.98
	2011/2012	162	425	2.62
	2012/2013	180	462	2.57
	2013/2014	180	520	2.89
	2014/2015	180	517	2.87
	2015/2016	180	566	3.14
Germany 1. Bundesliga	2008/2009	306	894	2.92
	2009/2010	306	866	2.83
	2010/2011	306	894	2.92
	2011/2012	306	875	2.86
	2012/2013	306	898	2.93
	2013/2014	306	967	3.16
	2014/2015	306	843	2.75
	2015/2016	306	866	2.83
Spain LIGA BBVA	2008/2009	380	1,101	2.90
	2009/2010	380	1,031	2.71
	2010/2011	380	1,042	2.74
	2011/2012	380	1,050	2.76
	2012/2013	380	1,091	2.87
	2013/2014	380	1,045	2.75
	2014/2015	380	1,009	2.66
	2015/2016	380	1,043	2.74
Belgium Jupiler League	2008/2009	306	855	2.79
	2009/2010	210	565	2.69
	2010/2011	240	635	2.65
	2011/2012	240	691	2.88
	2012/2013	240	703	2.93
	2013/2014	12	30	2.50
	2014/2015	240	668	2.78
	2015/2016	240	694	2.89
England Premier League	2008/2009	380	942	2.48
	2009/2010	380	1,053	2.77
	2010/2011	380	1,063	2.80
	2011/2012	380	1,066	2.81
	2012/2013	380	1,063	2.80
	2013/2014	380	1,052	2.77
	2014/2015	380	975	2.57
	2015/2016	380	1,026	2.70
Scotland Premier League	2008/2009	228	548	2.40
	2009/2010	228	585	2.57
	2010/2011	228	584	2.56
	2011/2012	228	601	2.64
	2012/2013	228	623	2.73
	2013/2014	228	626	2.75
	2014/2015	228	587	2.57
	2015/2016	228	650	2.85
Italy Serie A	2008/2009	380	988	2.60
	2009/2010	380	992	2.61
	2010/2011	380	955	2.51
	2011/2012	358	925	2.58
	2012/2013	380	1,003	2.64
	2013/2014	380	1,035	2.72
	2014/2015	379	1,018	2.69
	2015/2016	380	979	2.58
Portugal Liga ZON Sagres	2008/2009	240	552	2.30
	2009/2010	240	601	2.50
	2010/2011	240	584	2.43
	2011/2012	240	634	2.64
	2012/2013	240	667	2.78
	2013/2014	240	569	2.37
	2014/2015	306	763	2.49
	2015/2016	306	831	2.72
France Ligue 1	2008/2009	380	858	2.26
	2009/2010	380	916	2.41
	2010/2011	380	890	2.34
	2011/2012	380	956	2.52
	2012/2013	380	967	2.54
	2013/2014	380	933	2.46
	2014/2015	380	947	2.49
	2015/2016	380	960	2.53
Poland Ekstraklasa	2008/2009	240	524	2.18
	2009/2010	240	532	2.22
	2010/2011	240	578	2.41
	2011/2012	240	527	2.20
	2012/2013	240	598	2.49
	2013/2014	240	634	2.64
	2014/2015	240	628	2.62
	2015/2016	240	635	2.65

5.2.2 Leagues

This subsection concentrates more on leagues. It compares leagues by size, i.e., number of games per season (Figure 5.2) and by resultativeness, i.e., average number of goals per match (Figure 5.3). Numeric summaries are displayed in Table 5.3.

Code

ax = (
    goals_summary.reset_index()
    .assign(tmp=lambda x: x.groupby("league").n_matches_total.transform("mean"))
    .sort_values("tmp", ascending=True)
    .plot.scatter(
        x="league", y="n_matches_total", c=green, alpha=0.4, edgecolor="black"
    )
)
ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("League")
ax.set_ylabel("Number of matches\nper season")
ax.set_ylim([0, 400]);

**Fig. 5.2.** Number of matches per season in each league. Darker points indicate that more leagues had this number of matches per season. Transparent points indicate that there are some variation in number of games: possibly some missing data or natural changes in leagues’ rules.

Code

ax = goals_summary.reset_index().plot.scatter(
    x="league", y="n_goals_per_match", c=blue, alpha=0.5, edgecolor="darkblue"
)

ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("League")
ax.set_ylabel("Number of goals per match")
ax.set_ylim([1, 3.5]);

**Fig. 5.3.** Average performance (number of goals per match) in each league. A point represents a mean of each season.

Code

res_goals_by_league = an.AnalyzeNumericGroups(
    goals_summary.reset_index(), y="n_goals_per_match", by="league"
).fit()

res_goals_by_league.display()

Omnibus (Kruskal-Wallis) test results

	Source	ddof1	H	p-unc
Kruskal	league	10	58.83	p < 0.001

Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)

	league	cld	spaced_cld	mean	ci_lower	ci_upper
0	Netherlands Eredivisie	a	a____	3.08	2.95	3.21
1	Switzerland Super League	ab	ab___	2.93	2.72	3.14
2	Germany 1. Bundesliga	ab	ab___	2.90	2.80	3.00
3	Spain LIGA BBVA	bc	_bc__	2.77	2.70	2.83
4	Belgium Jupiler League	bc	_bc__	2.76	2.64	2.89
5	England Premier League	bcd	_bcd_	2.71	2.61	2.81
6	Scotland Premier League	cde	__cde	2.63	2.52	2.75
7	Italy Serie A	cde	__cde	2.62	2.56	2.67
8	Portugal Liga ZON Sagres	de	___de	2.53	2.39	2.67
9	France Ligue 1	e	____e	2.44	2.36	2.53
10	Poland Ekstraklasa	e	____e	2.42	2.25	2.60

Descriptive statistics of group (league) means

	count	min	max	range	mean	median	std	mad	skew
mean	11	2.42	3.08	0.66	2.71	2.71	0.21	0.18	0.28

Code

(
    goals_summary.groupby("league")
    .agg(["mean", "std"])
    .style.format(precision=1)
    .format("{:.2f}", subset=["n_goals_per_match"])
    .highlight_max(color="#FFFF77", subset="n_goals_per_match")
    .highlight_min(color="#FFBBBB", subset="n_goals_per_match")
)

**Table 5.3.** Statistics of performance in each league and season: summaries for each league. Yellow cells indicate maximum and pale-red ones indicate minimum values in column.
	n_matches_total		n_goals_total		n_goals_per_match
	mean	std	mean	std	mean	std
league
Netherlands Eredivisie	306.0	0.0	942.8	46.9	3.08	0.15
Switzerland Super League	177.8	6.4	520.8	55.3	2.93	0.25
Germany 1. Bundesliga	306.0	0.0	887.9	37.0	2.90	0.12
Spain LIGA BBVA	380.0	0.0	1051.5	30.3	2.77	0.08
Belgium Jupiler League	216.0	86.7	605.1	246.3	2.76	0.15
England Premier League	380.0	0.0	1030.0	46.7	2.71	0.12
Scotland Premier League	228.0	0.0	600.5	31.8	2.63	0.14
Italy Serie A	377.1	7.7	986.9	34.8	2.62	0.07
Portugal Liga ZON Sagres	256.5	30.6	650.1	99.3	2.53	0.17
France Ligue 1	380.0	0.0	928.4	38.2	2.44	0.10
Poland Ekstraklasa	240.0	0.0	582.0	49.0	2.42	0.20

5.2.3 Seasons

This subsection concentrates more on seasons. It compares seasons by size, i.e., number of games per league (Figure 5.4) and by resultativeness, i.e., average number of goals per match (Figure 5.5). Numerical summaries are displayed in Table 5.4.

Code

ax = goals_summary.reset_index().plot.scatter(
    x="season", y="n_matches_total", alpha=0.4, c=green, edgecolor="black"
)
ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("Season")
ax.set_ylabel("Number of matches\nper league")
ax.set_ylim([0, 400]);

**Fig. 5.4.** Number of matches per league in each season. Darker points indicate that more seasons had this number of matches per league.

Code

ax = goals_summary.reset_index().plot.scatter(
    x="season", y="n_goals_per_match", c=blue, alpha=0.5, edgecolor="darkblue"
)
ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("Season")
ax.set_ylabel("Number of goals\nper match")
ax.set_ylim([1, 3.5]);

**Fig. 5.5.** Average performance (number of goals per match) in each season. A point represents a mean of each league.

Code

res_goals_by_season = an.AnalyzeNumericGroups(
    goals_summary.reset_index(), y="n_goals_per_match", by="season"
).fit()

res_goals_by_season.display()

Omnibus (Kruskal-Wallis) test results

	Source	ddof1	H	p-unc
Kruskal	season	7	3.49	p = 0.836

Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)

	season	cld	spaced_cld	mean	ci_lower	ci_upper
0	2008/2009	a	a	2.61	2.41	2.81
1	2009/2010	a	a	2.69	2.49	2.88
2	2010/2011	a	a	2.69	2.50	2.87
3	2011/2012	a	a	2.71	2.53	2.88
4	2012/2013	a	a	2.77	2.63	2.90
5	2013/2014	a	a	2.75	2.57	2.92
6	2014/2015	a	a	2.69	2.57	2.81
7	2015/2016	a	a	2.78	2.66	2.90

Descriptive statistics of group (season) means

	count	min	max	range	mean	median	std	mad	skew
mean	8	2.61	2.78	0.18	2.71	2.70	0.06	0.03	-0.48

Code

(
    goals_summary.groupby("season")
    .agg(["mean", "std"])
    .style.format(precision=1)
    .format("{:.2f}", subset=["n_goals_per_match"])
    .highlight_max(color="#FFFF77", subset="n_goals_per_match")
    .highlight_min(color="#FFBBBB", subset="n_goals_per_match")
)

**Table 5.4.** Statistics of performance in each league and season: summaries for each season. Yellow cells indicate maximum and pale red ones indicate minimum values in column.
	n_matches_total		n_goals_total		n_goals_per_match
	mean	std	mean	std	mean	std
season
2008/2009	302.4	72.4	788.4	208.2	2.61	0.30
2009/2010	293.6	77.5	784.7	207.7	2.69	0.29
2010/2011	296.4	74.8	795.4	210.3	2.69	0.28
2011/2012	292.7	75.6	795.2	226.2	2.71	0.26
2012/2013	296.4	74.8	821.7	216.3	2.77	0.20
2013/2014	275.6	113.5	762.6	319.9	2.75	0.26
2014/2015	302.3	72.3	808.8	184.0	2.69	0.18
2015/2016	302.4	72.4	832.9	170.1	2.78	0.18

5.3 Top Teams

Which teams shows the best performance?

How do best and worst teams differ in resultativeness?

How many matches do the best teams win and loose?

The analysis of teams included data from 7 seasons.
To evaluate team’s performance, it was decided to count in how many seasons the team appeared between the Top 5 scoring teams (in terms of goals per match in that season). Table 5.5 shows that 12 teams appears in that list and Real Madrid CF (7 times in 7 seasons), FC Barcelona (6 times), and PSV (5 times) are the 3 leaders. Comparing best and worst teams, they performance differ by about 2 goals per match (Figure 5.6, Table 5.6).
Comparing teams that had highest percentage of won matches per season, most frequently SL Benfica (5 times in 7 seasons), FC Barcelona (5 times), Real Madrid CF (4 times), Celtic (4 times) were between Top 5 (Table 5.7).
To get among Top 5 winners, in some cases, it was sufficient to win as little as 73.7 % of matches but to loose no more than 15.8 % of matches (Table 5.7).

See the details below.

Code

print(f"Seasons in this analysis: {teams_top_bottom_goals.season.nunique()}")

Seasons in this analysis: 7

Code

(
    teams_top_bottom_goals.query("which == 'Top 5'")
    .team_name.value_counts()
    .to_df("Number of seasons (out of 7)", "Team")
    .index_start_at(1)
    .style
)

**Table 5.5.** Number of seasons a teams was among Top 5 by number of **goals per match**.
	Team	Number of seasons (out of 7)
1	Real Madrid CF	7
2	FC Barcelona	6
3	PSV	5
4	FC Bayern Munich	4
5	SL Benfica	3
6	Ajax	2
7	FC Porto	2
8	Manchester City	2
9	Chelsea	1
10	Roda JC Kerkrade	1
11	Celtic	1
12	Liverpool	1
13	Paris Saint-Germain	1

Compare the performance of Top 5 and Bottom 5 teams:

Code

ax = sns.scatterplot(
    teams_top_bottom_goals,
    x="season",
    y="n_goals_per_match",
    hue="which",
)

ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("Season")
ax.set_ylabel("Number of goals per match")
ax.set_ylim([0, 4.5])
ax.get_legend().set_title(None)

**Fig. 5.6.** Comparison of Top 5 and Bottom 5 teams by number of goals per match.

Code

teams_top_bottom_goals_summary = (
    teams_top_bottom_goals.groupby(["which"])
    .n_goals_per_match.agg(["min", "mean", "std", "max"])
    .sort_index(ascending=False)
    .reset_index()
)

teams_top_bottom_goals_summary.columns = pd.MultiIndex.from_tuples(
    [
        ("", "Group of teams"),
        ("Goals per match", "Min"),
        ("Goals per match", "Mean"),
        ("Goals per match", "SD"),
        ("Goals per match", "Max"),
    ]
)

teams_top_bottom_goals_summary.style.format(precision=2).hide(axis="index")

**Table 5.6.** Summaries of Top 5 and Bottom 5 teams by number of **goals per match** in all 7 seasons.
	Goals per match
Group of teams	Min	Mean	SD	Max
Top 5	2.32	2.77	0.31	3.70
Bottom 5	0.30	0.68	0.11	0.87

Code

(
    teams_wins_per_season.reset_index()
    .Team.value_counts()
    .to_df("Number of seasons (out of 7)", "Team")
    .index_start_at(1)
    .style
)

**Table 5.7.** Number of seasons a team was among Top 5 by percentage of **won matches**.
	Team	Number of seasons (out of 7)
1	SL Benfica	5
2	FC Barcelona	5
3	Real Madrid CF	4
4	Celtic	4
5	FC Porto	3
6	FC Bayern Munich	3
7	Manchester United	2
8	PSV	2
9	RSC Anderlecht	1
10	Ajax	1
11	Rangers	1
12	Manchester City	1
13	Juventus	1
14	Atlético Madrid	1
15	Sporting CP	1
16	Paris Saint-Germain	1

Code

variables = ["Lost", "Draw", "Won"]

(
    teams_wins_per_season.style.format("{:.1f} %", subset=variables)
    .highlight_max(subset=variables, color="#FFFF77")
    .highlight_min(subset=variables, color="#FFBBBB")
)

**Table 5.8.** Top 5 teams by percentage of **won matches** in each season. Highest values in each column are in yellow and lowest ones are in pale red.
			Lost	Draw	Won
Season	League	Team
2009/2010	Belgium Jupiler League	RSC Anderlecht	0.0 %	0.0 %	100.0 %
	Netherlands Eredivisie	Ajax	0.0 %	0.0 %	100.0 %
	Portugal Liga ZON Sagres	SL Benfica	10.0 %	0.0 %	90.0 %
	Spain LIGA BBVA	Real Madrid CF	6.7 %	6.7 %	86.7 %
	Spain LIGA BBVA	FC Barcelona	0.0 %	13.3 %	86.7 %
2010/2011	Portugal Liga ZON Sagres	FC Porto	0.0 %	10.0 %	90.0 %
	Spain LIGA BBVA	FC Barcelona	5.3 %	15.8 %	78.9 %
	Scotland Premier League	Rangers	13.2 %	7.9 %	78.9 %
	Spain LIGA BBVA	Real Madrid CF	10.5 %	13.2 %	76.3 %
	Scotland Premier League	Celtic	10.5 %	13.2 %	76.3 %
2011/2012	Spain LIGA BBVA	Real Madrid CF	5.3 %	10.5 %	84.2 %
	Scotland Premier League	Celtic	13.2 %	7.9 %	78.9 %
	Portugal Liga ZON Sagres	FC Porto	3.3 %	20.0 %	76.7 %
	England Premier League	Manchester City	13.2 %	13.2 %	73.7 %
	Spain LIGA BBVA	FC Barcelona	7.9 %	18.4 %	73.7 %
	England Premier League	Manchester United	13.2 %	13.2 %	73.7 %
2012/2013	Germany 1. Bundesliga	FC Bayern Munich	2.9 %	11.8 %	85.3 %
	Spain LIGA BBVA	FC Barcelona	5.3 %	10.5 %	84.2 %
	Portugal Liga ZON Sagres	SL Benfica	3.3 %	16.7 %	80.0 %
	Portugal Liga ZON Sagres	FC Porto	0.0 %	20.0 %	80.0 %
	England Premier League	Manchester United	13.2 %	13.2 %	73.7 %
2013/2014	Italy Serie A	Juventus	5.3 %	7.9 %	86.8 %
	Germany 1. Bundesliga	FC Bayern Munich	5.9 %	8.8 %	85.3 %
	Scotland Premier League	Celtic	2.6 %	15.8 %	81.6 %
	Portugal Liga ZON Sagres	SL Benfica	6.7 %	16.7 %	76.7 %
	Spain LIGA BBVA	Atlético Madrid	10.5 %	15.8 %	73.7 %
2014/2015	Netherlands Eredivisie	PSV	11.8 %	2.9 %	85.3 %
	Portugal Liga ZON Sagres	SL Benfica	8.8 %	11.8 %	79.4 %
	Spain LIGA BBVA	Real Madrid CF	15.8 %	5.3 %	78.9 %
	Spain LIGA BBVA	FC Barcelona	10.5 %	10.5 %	78.9 %
	Scotland Premier League	Celtic	10.5 %	13.2 %	76.3 %
2015/2016	Portugal Liga ZON Sagres	SL Benfica	11.8 %	2.9 %	85.3 %
	Germany 1. Bundesliga	FC Bayern Munich	5.9 %	11.8 %	82.4 %
	Portugal Liga ZON Sagres	Sporting CP	5.9 %	14.7 %	79.4 %
	France Ligue 1	Paris Saint-Germain	5.3 %	15.8 %	78.9 %
	Netherlands Eredivisie	PSV	5.9 %	17.6 %	76.5 %

5.4 Players in 2015/2016

This subsection deals with the analysis of football players in the most recent available season. It contains a link to the dashboard (a technical requirement of this project) too.

5.5 Analysis of Players

Which players are the best ones?

Which player attributes are related to being a good player?

How various player attributes relate to each other?

This analysis included players from season 2015/2016 (information of players announced after 2015-07-01, if several records are present, the most recent one is used).
Among 7057 included players, Top 5 players by the overall rating were: Lionel Messi, Cristiano Ronaldo, Neymar, Manuel Neuer, and Luis Suarez (Table 5.9). Player reactions (r=0.81) and potential (r=0.80) were the attributes most strongly correlated to overall rating (Table 5.11).
Correlation and hierarchical cluster analysis revealed that there are at least 2 groups of related player attributes: one of the major clusters seems to be associated to goal-keeping-related features and bigger values of physiological properties like body mass (variable “weight_kg”) and height are positive related to better goal keeping characteristics (Figure 5.7). The other cluster has several sub-clusters which might also be related to different roles of players but this idea should be investigated in more detail.

Find the details below in this sub-section.

Code

player_info_2015_2016 = players.query("player_info_date >= '2015-07-01'")
n_players_last_season = player_info_2015_2016.player_id.nunique()
print(f"Number of players included: {n_players_last_season}")

Number of players included: 7057

Code

players_2015_2016 = (
    player_info_2015_2016.assign(
        rank=lambda x: (
            x.groupby("player_id").player_info_date.rank(
                method="first", ascending=False
            )
        )
    )
    .query("rank == 1")
    .drop(columns=["rank"])
    .sort_values("overall_rating", ascending=False)
)

players_2015_2016.query("overall_rating >= 90")

**Table 5.9.** Top players in season 2015/2016:players with overall rating over 90.
	player_id	player_info_date	player_name	birthday	birth_year	height	weight_kg	bmi	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	long_passing	ball_control	acceleration	sprint_speed	agility	reactions	balance	shot_power	jumping	stamina	strength	long_shots	aggression	interceptions	positioning	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
102482	30981	2015-12-17	Lionel Messi	1987-06-24	1987	170.18	72.11	24.90	94.00	94.00	left	medium	low	80.00	93.00	71.00	88.00	85.00	96.00	89.00	90.00	79.00	96.00	95.00	90.00	92.00	92.00	95.00	80.00	68.00	75.00	59.00	88.00	48.00	22.00	90.00	90.00	74.00	13.00	23.00	21.00	6.00	11.00	15.00	14.00	8.00
33330	30893	2015-10-16	Cristiano Ronaldo	1985-02-05	1985	185.42	79.82	23.22	93.00	93.00	right	high	low	82.00	95.00	86.00	81.00	87.00	93.00	88.00	77.00	72.00	91.00	91.00	93.00	90.00	92.00	62.00	94.00	94.00	90.00	79.00	93.00	62.00	29.00	93.00	81.00	85.00	22.00	31.00	23.00	7.00	11.00	15.00	14.00	11.00
131464	19533	2016-02-04	Neymar	1992-02-05	1992	175.26	68.03	22.15	90.00	94.00	right	high	medium	72.00	88.00	62.00	78.00	83.00	94.00	78.00	79.00	74.00	93.00	91.00	90.00	92.00	86.00	84.00	78.00	61.00	79.00	45.00	73.00	56.00	36.00	89.00	79.00	81.00	21.00	24.00	33.00	9.00	9.00	15.00	15.00	11.00
109033	27299	2016-04-21	Manuel Neuer	1986-03-27	1986	193.04	92.06	24.71	90.00	90.00	right	medium	medium	15.00	13.00	25.00	48.00	11.00	16.00	14.00	11.00	47.00	31.00	58.00	61.00	43.00	87.00	35.00	25.00	78.00	44.00	83.00	16.00	29.00	30.00	12.00	70.00	37.00	10.00	10.00	11.00	85.00	87.00	91.00	90.00	87.00
105983	40636	2015-10-16	Luis Suarez	1987-01-24	1987	182.88	84.81	25.36	90.00	90.00	right	high	medium	77.00	90.00	77.00	82.00	87.00	88.00	86.00	84.00	64.00	91.00	88.00	78.00	86.00	91.00	60.00	88.00	69.00	88.00	76.00	85.00	78.00	41.00	91.00	84.00	85.00	30.00	45.00	38.00	27.00	25.00	31.00	33.00	37.00

Code

players_2015_2016.query("overall_rating < 50")

**Table 5.10.** Players with overall rating below 50 in season 2015/2016.
	player_id	player_info_date	player_name	birthday	birth_year	height	weight_kg	bmi	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	long_passing	ball_control	acceleration	sprint_speed	agility	reactions	balance	shot_power	jumping	stamina	strength	long_shots	aggression	interceptions	positioning	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
98716	215085	2015-07-10	Kyrylo Petrov	1990-06-22	1990	182.88	76.19	22.78	49.00	55.00	right	medium	medium	30.00	22.00	63.00	40.00	28.00	28.00	30.00	30.00	26.00	51.00	60.00	55.00	43.00	51.00	59.00	37.00	72.00	58.00	66.00	26.00	60.00	54.00	27.00	29.00	37.00	55.00	62.00	64.00	11.00	11.00	14.00	9.00	13.00
76938	696435	2016-04-14	Jan Bamert	1998-03-09	1998	180.34	69.84	21.47	48.00	67.00	right	medium	medium	38.00	23.00	40.00	26.00	31.00	59.00	30.00	30.00	22.00	31.00	66.00	59.00	53.00	47.00	61.00	30.00	59.00	56.00	50.00	24.00	55.00	48.00	37.00	34.00	39.00	46.00	57.00	52.00	7.00	11.00	6.00	9.00	7.00
101928	674221	2016-02-04	Liam Grimshaw	1995-02-02	1995	177.80	74.83	23.67	48.00	60.00	right	medium	high	33.00	32.00	42.00	55.00	27.00	41.00	33.00	37.00	42.00	49.00	68.00	67.00	58.00	54.00	65.00	52.00	63.00	64.00	59.00	33.00	67.00	48.00	34.00	48.00	44.00	45.00	54.00	49.00	13.00	8.00	15.00	12.00	11.00
157494	659742	2016-05-12	Sandro Lauper	1996-10-25	1996	185.42	69.84	20.31	48.00	64.00	right	medium	medium	47.00	45.00	39.00	52.00	47.00	59.00	44.00	38.00	47.00	53.00	65.00	67.00	57.00	42.00	61.00	55.00	45.00	53.00	50.00	42.00	33.00	35.00	49.00	45.00	54.00	22.00	36.00	25.00	14.00	7.00	13.00	6.00	11.00
172	528212	2016-02-25	Aaron Lennox	1993-02-19	1993	190.50	82.09	22.62	48.00	56.00	right	medium	medium	12.00	15.00	16.00	23.00	14.00	15.00	14.00	18.00	18.00	22.00	15.00	26.00	31.00	45.00	24.00	26.00	38.00	18.00	44.00	12.00	21.00	19.00	14.00	15.00	41.00	15.00	15.00	12.00	53.00	41.00	39.00	51.00	53.00
151371	614951	2016-03-03	Robin Huser	1998-01-24	1998	180.34	69.84	21.47	47.00	63.00	right	medium	medium	34.00	27.00	44.00	53.00	35.00	47.00	44.00	35.00	47.00	41.00	65.00	66.00	67.00	52.00	75.00	57.00	64.00	48.00	47.00	29.00	51.00	47.00	39.00	37.00	56.00	42.00	48.00	50.00	13.00	6.00	15.00	11.00	16.00

Code

cor_data_players = [
    (i, round(players_2015_2016.overall_rating.corr(players_2015_2016[i]), 3))
    for i in (
        players_2015_2016.select_dtypes("number").drop(
            columns=["player_id", "overall_rating"]
        )
    )
]

(
    pd.DataFrame(cor_data_players, columns=["variable", "r"])
    .sort_values("r", ascending=False)
    .index_start_at(1)
    .style.format({"r": "{:.2f}"})
    .bar(vmin=-1, vmax=1, cmap="BrBG", subset=["r"])
)

**Table 5.11.** Correlation to overall rating.
	variable	r
1	reactions	0.81
2	potential	0.80
3	vision	0.41
4	short_passing	0.40
5	long_passing	0.39
6	ball_control	0.36
7	shot_power	0.33
8	long_shots	0.32
9	curve	0.32
10	volleys	0.30
11	free_kick_accuracy	0.29
12	crossing	0.29
13	dribbling	0.28
14	aggression	0.27
15	positioning	0.27
16	penalties	0.27
17	finishing	0.25
18	stamina	0.25
19	heading_accuracy	0.24
20	jumping	0.23
21	strength	0.22
22	interceptions	0.21
23	agility	0.20
24	sprint_speed	0.18
25	acceleration	0.17
26	standing_tackle	0.15
27	sliding_tackle	0.13
28	marking	0.13
29	bmi	0.09
30	balance	0.09
31	weight_kg	0.07
32	gk_handling	0.02
33	gk_reflexes	0.02
34	gk_diving	0.02
35	gk_positioning	0.02
36	gk_kicking	0.02
37	height	0.01
38	birth_year	-0.23

Note

Heatmaps and clustered heatmaps in this project are very big as they contain many variables. I tried smaller plot size, but then every second variable name got hidden.

Code

sns.clustermap(
    players_2015_2016.corr(numeric_only=True),
    vmin=-1,
    vmax=1,
    annot=False,
    cmap="BrBG",
    method="centroid",
    figsize=(15, 15),
);

**Fig. 5.7.** Clustered heatmap of correlation coefficients between player attributes.

5.5.1 Dashboard

Some additional exploration of football players is available via this Looker Studio dashboard (preview in Figure 5.8). Only players with no missing data in their attributes are included in the dashboard.

**Fig. 5.8.** Print screen of the Looker Studio dashboard (link).

5.6 Home Advantage: Is It Real?

Is there such a thing as home advantage?

If yes, can we quantify it?

The analysis of 25,979 matches revealed, that:

Teams that play at home wins 45.9% (CI 45.1%–46.6%) matches compared to 28.7% away winning and 25.4% draws. This difference is statistically significant (χ² test, p < 0.001).
On average, home teams score 0.38 goals more than away teams. This shift toward the home advantage is statistically significant (t-test, p < 0.001).
Comparing different leagues, they do differ by the degree of home advantage. E.g., in Spain LIGA BBVA home advantage is as high as 0.50 goals and in Scotland Premier League it is as low as 0.22.
Comparing different seasons, no significant differences were found.

Find the details below.

Code

# Count of Home wins, Draws and Away wins
counts = matches.match_winner.value_counts(sort=False).rename("matches")
res_counts = an.AnalyzeCounts(counts, "Match outcome").fit()
res_counts.display()

Omnibus (chi-squared) test results

Chi square test, χ²(2, n = 25979) = 1881.57, p < 0.001

Counts of matches with 95% CI and post-hoc (pairwise chi-squared) test results

	Match outcome	n_matches	percent	ci_lower	ci_upper	cld	spaced_cld
0	Away Wins	7,466	28.7%	28.1%	29.4%	a	a__
1	Draw	6,596	25.4%	24.7%	26.0%	b	_b_
2	Home Wins	11,917	45.9%	45.1%	46.6%	c	__c

Descriptive statistics of group (Match outcome) counts

	count	min	max	range	mean	median	std	mad	skew
n_matches	3	6,596	11,917	5,321	8,660	7,466	2,854	870	1.55
percent	3	25.4%	45.9%	20.5%	33.3%	28.7%	11.0%	3.3%	1.55

Code

res_counts.plot(rot=0, color=[blue, blue, green]);

**Fig. 5.9.** Distribution of match outcomes. The most common outcome is highlighted in green.

Code

mean_goal_diff = matches.goal_diff.mean()

ax = matches.goal_diff.plot.hist(
    edgecolor="black", label="_nolegend_", bins=np.arange(-6.5, 6.5)
)

ax.set_xlabel("Goal difference (home wins, if >0)")
ax.set_ylabel("Number of matches")

ax.axvline(
    x=mean_goal_diff,
    color="red",
    linestyle="--",
    label="Mean",
    zorder=1,
)

ax.axvline(
    x=0,
    color="gold",
    markeredgecolor="grey",
    linestyle="--",
    label="Zero (draw)",
    linewidth=1.5,
    zorder=2,
)

ax.legend(frameon=False, loc="upper right")

# Print results
(t, p) = sps.ttest_1samp(matches.goal_diff, 0)
print(
    f"On average, home teams score {mean_goal_diff:.2f} goals more than away "
    "teams. \nThis shift toward home advantage is statistically significant \n"
    f"(t-test, {my.format_p(p)})."
)

On average, home teams score 0.38 goals more than away teams. 
This shift toward home advantage is statistically significant 
(t-test, p < 0.001).

**Fig. 5.10.** Distribution of goal difference in each match. Negative number when away wins, 0 whet it is draw, positive number when home wins.

Code

res_by_league = an.AnalyzeNumericGroups(matches, "goal_diff", by="league").fit()
res_by_league.display()

Omnibus (Kruskal-Wallis) test results

	Source	ddof1	H	p-unc
Kruskal	league	10	44.87	p < 0.001

Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)

	league	cld	spaced_cld	mean	ci_lower	ci_upper
0	Belgium Jupiler League	ab	ab_	0.42	0.33	0.50
1	England Premier League	ab	ab_	0.39	0.33	0.46
2	France Ligue 1	abc	abc	0.36	0.31	0.42
3	Germany 1. Bundesliga	abc	abc	0.35	0.28	0.43
4	Italy Serie A	ab	ab_	0.38	0.33	0.44
5	Netherlands Eredivisie	a	a__	0.48	0.40	0.56
6	Poland Ekstraklasa	abc	abc	0.36	0.29	0.44
7	Portugal Liga ZON Sagres	bc	_bc	0.28	0.21	0.36
8	Scotland Premier League	c	__c	0.22	0.14	0.31
9	Spain LIGA BBVA	a	a__	0.50	0.43	0.56
10	Switzerland Super League	abc	abc	0.40	0.30	0.49

Descriptive statistics of group (league) means

	count	min	max	range	mean	median	std	mad	skew
mean	11	0.22	0.50	0.27	0.38	0.38	0.08	0.03	-0.44

Code

(_, ax) = res_by_league.plot(ylabel="Goal difference \n(home wins, if >0)")
ax.tick_params(axis="x", rotation=90)
ax.axhline(
    y=0,
    color="lightgray",
    linestyle="--",
    label="Draw",
    zorder=1,
)
ax.legend(frameon=False, loc="lower right")
ax.set_ylim([-0.15, 0.6]);

**Fig. 5.11.** Degree of home advantage in different leagues. Mean goal difference (home minus away) of a match with 95% confidence interval.

Code

res_by_season = an.AnalyzeNumericGroups(matches, "goal_diff", by="season").fit()

res_by_season.display()

Omnibus (Kruskal-Wallis) test results

	Source	ddof1	H	p-unc
Kruskal	season	7	12.85	p = 0.076

Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)

	season	cld	spaced_cld	mean	ci_lower	ci_upper
0	2008/2009	a	a	0.40	0.35	0.46
1	2009/2010	a	a	0.41	0.35	0.47
2	2010/2011	a	a	0.41	0.35	0.47
3	2011/2012	a	a	0.43	0.37	0.49
4	2012/2013	a	a	0.33	0.27	0.39
5	2013/2014	a	a	0.39	0.33	0.46
6	2014/2015	a	a	0.36	0.30	0.43
7	2015/2016	a	a	0.33	0.27	0.40

Descriptive statistics of group (season) means

	count	min	max	range	mean	median	std	mad	skew
mean	8	0.33	0.43	0.10	0.38	0.40	0.04	0.02	-0.62

Code

(_, ax) = res_by_season.plot(ylabel="Goal difference \n(home wins, if >0)")
ax.tick_params(axis="x", rotation=90)
ax.axhline(
    y=0,
    color="lightgray",
    linestyle="--",
    label="Draw",
    zorder=1,
)
ax.legend(frameon=False, loc="lower right")
ax.set_ylim([-0.15, 0.6]);

**Fig. 5.12.** Degree of home advantage in different seasons. Mean goal difference (home minus away) of a match with 95% confidence interval.

5.7 Relationship Between Betting Odds

What is the relationship between betting odds from different websites?

How strongly are betting odds related to match outcomes?

Odds ratios from different websites as well as ratio and log-ratio of home wins versus away wins betting odds are investigated in this subsection. The analysis (Figure 5.13) shows that:

odds of the same type (e.g., “home wins”) from different websites are strongly correlates between each other (Fig. 5.13).
odds of “draw” are more strongly related to “away wins” and almost not correlated to “home wins”.
the log-ratio of betting odds shows the highest correlation to football match outcome: r=-0.46 in case of B365 (bet365.com), log ratio of betting odds vs. difference of goals (home goals minus away goals), as shown in Table 5.12.

See details below.

EDA: Overview of matches_betting_odds table

Code

skim(matches_betting_odds)

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 25979  │ │ float64     │ 50    │ │ match_winner          │                                │
│ │ Number of columns │ 57     │ │ int32       │ 6     │ └───────────────────────┘                                │
│ └───────────────────┴────────┘ │ category    │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name            ┃ NA      ┃ NA %   ┃ mean    ┃ sd     ┃ p0      ┃ p25    ┃ p75    ┃ p100  ┃ hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩  │
│ │ stage                  │       0 │      0 │      18 │     10 │       1 │      9 │     27 │    38 │ █▇▇▇▇▅  │  │
│ │ home_team_goal         │       0 │      0 │     1.5 │    1.3 │       0 │      1 │      2 │    10 │   █▅▁   │  │
│ │ away_team_goal         │       0 │      0 │     1.2 │    1.1 │       0 │      0 │      2 │     9 │   █▂▁   │  │
│ │ goal_sum               │       0 │      0 │     2.7 │    1.7 │       0 │      2 │      4 │    12 │  ▄█▄▁   │  │
│ │ goal_diff              │       0 │      0 │    0.38 │    1.8 │      -9 │     -1 │      1 │    10 │   ▁█▇▁  │  │
│ │ goal_diff_sign         │       0 │      0 │    0.17 │   0.85 │      -1 │     -1 │      1 │     1 │ ▅  ▄ █  │  │
│ │ B365_home_wins         │    3400 │     13 │     2.6 │    1.8 │       1 │    1.7 │    2.8 │    26 │    █    │  │
│ │ BW_home_wins           │    3400 │     13 │     2.6 │    1.6 │       1 │    1.6 │    2.8 │    34 │    █    │  │
│ │ IW_home_wins           │    3500 │     13 │     2.5 │    1.4 │       1 │    1.6 │    2.6 │    20 │   █▁    │  │
│ │ LB_home_wins           │    3400 │     13 │     2.5 │    1.6 │       1 │    1.7 │    2.7 │    26 │    █    │  │
│ │ PS_home_wins           │   15000 │     57 │     2.8 │    2.2 │       1 │    1.7 │      3 │    36 │    █    │  │
│ │ WH_home_wins           │    3400 │     13 │     2.6 │    1.7 │       1 │    1.7 │    2.8 │    26 │    █    │  │
│ │ SJ_home_wins           │    8900 │     34 │     2.6 │    1.7 │       1 │    1.7 │    2.8 │    23 │   █▁    │  │
│ │ VC_home_wins           │    3400 │     13 │     2.7 │    1.9 │       1 │    1.7 │    2.8 │    36 │    █    │  │
│ │ GB_home_wins           │   12000 │     45 │     2.5 │    1.5 │     1.1 │    1.7 │    2.6 │    21 │   █▁    │  │
│ │ BS_home_wins           │   12000 │     45 │     2.5 │    1.5 │       1 │    1.7 │    2.6 │    17 │   █▁    │  │
│ │ B365_draw              │    3400 │     13 │     3.8 │    1.1 │     1.4 │    3.3 │      4 │    17 │   █▂    │  │
│ │ BW_draw                │    3400 │     13 │     3.7 │      1 │     1.6 │    3.2 │    3.8 │    20 │   █▁    │  │
│ │ IW_draw                │    3500 │     13 │     3.6 │    0.8 │     1.5 │    3.2 │    3.7 │    11 │   ▁█▁   │  │
│ │ LB_draw                │    3400 │     13 │     3.7 │      1 │     1.4 │    3.2 │    3.8 │    19 │   █▁    │  │
│ │ PS_draw                │   15000 │     57 │     4.1 │    1.5 │     2.2 │    3.4 │    4.2 │    29 │    █    │  │
│ │ WH_draw                │    3400 │     13 │     3.7 │   0.96 │       1 │    3.2 │    3.8 │    17 │   █▃    │  │
│ │ SJ_draw                │    8900 │     34 │     3.8 │      1 │     1.4 │    3.2 │    3.8 │    15 │   █▃    │  │
│ │ VC_draw                │    3400 │     13 │     3.9 │    1.2 │     1.6 │    3.3 │      4 │    26 │   █▁    │  │
│ │ GB_draw                │   12000 │     45 │     3.6 │   0.87 │     1.4 │    3.2 │    3.8 │    11 │   ▁█▁   │  │
│ │ BS_draw                │   12000 │     45 │     3.7 │   0.87 │     1.3 │    3.2 │    3.8 │    13 │   ▅█▁   │  │
│ │ B365_away_wins         │    3400 │     13 │     4.7 │    3.7 │     1.1 │    2.5 │    5.2 │    51 │   █▁    │  │
│ │ BW_away_wins           │    3400 │     13 │     4.4 │    3.3 │     1.1 │    2.5 │      5 │    51 │   █▁    │  │
│ │ IW_away_wins           │    3500 │     13 │     4.2 │    2.9 │     1.1 │    2.5 │    4.6 │    25 │   █▂    │  │
│ │ LB_away_wins           │    3400 │     13 │     4.4 │    3.4 │     1.1 │    2.5 │      5 │    51 │   █▁    │  │
│ │ PS_away_wins           │   15000 │     57 │       5 │    4.5 │     1.1 │    2.6 │    5.4 │    48 │   █▁    │  │
│ │ WH_away_wins           │    3400 │     13 │     4.5 │    3.6 │     1.1 │    2.5 │      5 │    51 │   █▁    │  │
│ │ SJ_away_wins           │    8900 │     34 │     4.6 │    3.6 │     1.1 │    2.5 │    5.2 │    41 │   █▁    │  │
│ │ VC_away_wins           │    3400 │     13 │     4.8 │    4.3 │     1.1 │    2.5 │    5.4 │    67 │    █    │  │
│ │ GB_away_wins           │   12000 │     45 │     4.4 │      3 │     1.1 │    2.5 │      5 │    34 │   █▁    │  │
│ │ BS_away_wins           │   12000 │     45 │     4.4 │    3.2 │     1.1 │    2.5 │      5 │    34 │   █▁    │  │
│ │ B365_ratio_ha          │    3400 │     13 │     1.1 │    1.5 │   0.021 │   0.32 │    1.1 │    24 │    █    │  │
│ │ BW_ratio_ha            │    3400 │     13 │     1.1 │    1.4 │   0.021 │   0.33 │    1.1 │    31 │    █    │  │
│ │ PS_ratio_ha            │   15000 │     57 │     1.2 │    1.9 │   0.022 │   0.32 │    1.2 │    33 │    █    │  │
│ │ VC_ratio_ha            │    3400 │     13 │     1.1 │    1.7 │   0.015 │   0.32 │    1.1 │    33 │    █    │  │
│ │ IW_ratio_ha            │    3500 │     13 │       1 │    1.3 │   0.042 │   0.36 │      1 │    18 │    █    │  │
│ │ WH_ratio_ha            │    3400 │     13 │     1.1 │    1.5 │   0.021 │   0.34 │    1.1 │    24 │    █    │  │
│ │ GB_ratio_ha            │   12000 │     45 │       1 │    1.3 │   0.031 │   0.33 │      1 │    19 │    █    │  │
│ │ LB_ratio_ha            │    3400 │     13 │       1 │    1.4 │   0.021 │   0.34 │    1.1 │    24 │    █    │  │
│ │ SJ_ratio_ha            │    8900 │     34 │       1 │    1.4 │   0.026 │   0.32 │    1.1 │    20 │    █    │  │
│ │ BS_ratio_ha            │   12000 │     45 │       1 │    1.3 │   0.031 │   0.33 │      1 │    15 │   █▁    │  │
│ │ B365_log_ratio_ha      │    3400 │     13 │    -0.5 │      1 │    -3.9 │   -1.1 │  0.097 │   3.2 │   ▃█▆▂  │  │
│ │ BW_log_ratio_ha        │    3400 │     13 │   -0.48 │      1 │    -3.9 │   -1.1 │   0.08 │   3.4 │   ▃█▅▁  │  │
│ │ PS_log_ratio_ha        │   15000 │     57 │   -0.49 │    1.1 │    -3.8 │   -1.1 │   0.15 │   3.5 │  ▁▃█▅▁  │  │
│ │ VC_log_ratio_ha        │    3400 │     13 │   -0.51 │    1.1 │    -4.2 │   -1.1 │    0.1 │   3.5 │   ▂█▆▂  │  │
│ │ IW_log_ratio_ha        │    3500 │     13 │   -0.46 │   0.95 │    -3.2 │     -1 │  0.039 │   2.9 │  ▁▃█▅▁  │  │
│ │ WH_log_ratio_ha        │    3400 │     13 │   -0.48 │      1 │    -3.9 │   -1.1 │  0.074 │   3.2 │   ▃█▇▂  │  │
│ │ GB_log_ratio_ha        │   12000 │     45 │    -0.5 │   0.97 │    -3.5 │   -1.1 │  0.039 │   2.9 │   ▃█▆▂  │  │
│ │ LB_log_ratio_ha        │    3400 │     13 │   -0.48 │      1 │    -3.9 │   -1.1 │  0.077 │   3.2 │   ▃█▇▂  │  │
│ │ SJ_log_ratio_ha        │    8900 │     34 │   -0.51 │      1 │    -3.7 │   -1.1 │  0.056 │     3 │   ▃█▆▂  │  │
│ │ BS_log_ratio_ha        │   12000 │     45 │    -0.5 │   0.98 │    -3.5 │   -1.1 │  0.039 │   2.7 │  ▁▃█▇▂  │  │
│ └────────────────────────┴─────────┴────────┴─────────┴────────┴─────────┴────────┴────────┴───────┴─────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                        ┃ NA        ┃ NA %          ┃ ordered               ┃ unique            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩  │
│ │ match_winner                       │         0 │             0 │ False                 │                 3 │  │
│ └────────────────────────────────────┴───────────┴───────────────┴───────────────────────┴───────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

EDA: Missing Value Plots of matches_betting_odds table

It seems that missing value structure is characteristic for each betting website as betting odds variables of each betting website are clustered together by their missing value structure.

Code

msno.matrix(matches_betting_odds, figsize=(10, 5));

Code

msno.dendrogram(matches_betting_odds);

EDA: Data Profiling Report of matches_betting_odds

Code

matches_betting_odds.shape

(25979, 57)

Code

if do_eda:
    profile_match_odds = eda.ProfileReport(
        matches_betting_odds,
        title="Data Profiling Report: matches_betting_odds",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

    profile_match_odds

Note

Heatmaps and clustered heatmaps in this project are very big as they contain many variables. I tried smaller plot size, but then every second variable name got hidden.

Code

cor_betting = matches_betting_odds.select_dtypes("number").corr()

There are 5 types of variables related to betting odds (odds for home and away wins, draw, ratio and log ratio of home versus away wins).These types of odds are highly correlated in each category (see plot 5.13).

Code

plt.figure(figsize=(15, 11))
mask = np.triu(np.ones_like(cor_betting, dtype=np.bool_))
sns.heatmap(cor_betting.round(2), vmin=-1, vmax=1, annot=False, cmap="BrBG");

**Fig. 5.13.** Heatmap of Pearson’s correlation coefficients between betting odds, goal statistics, and some other variables.

Code

def name_replace(x):
    return x.replace("B365_", "B365 ")


def name_replace(x):
    return x.replace("B365_", "B365 ")


new_names = {"goal_diff": "Match goal difference (home–away)"}

(
    matches_betting_odds.filter(regex="B365_|diff$")
    .corr()
    .rename(columns=new_names, index=new_names)
    .rename(columns=name_replace, index=name_replace)
)

**Table 5.12.** Correlation between betting odds from `B365` (bet365.com) website. `ratio_ha` is betting odds ratio for home and away teams.
	Match goal difference (home–away)	B365 home_wins	B365 draw	B365 away_wins	B365 ratio_ha	B365 log_ratio_ha
Match goal difference (home–away)	1.00	-0.38	0.24	0.40	-0.35	-0.46
B365 home_wins	-0.38	1.00	0.02	-0.47	0.99	0.82
B365 draw	0.24	0.02	1.00	0.82	0.09	-0.45
B365 away_wins	0.40	-0.47	0.82	1.00	-0.41	-0.83
B365 ratio_ha	-0.35	0.99	0.09	-0.41	1.00	0.77
B365 log_ratio_ha	-0.46	0.82	-0.45	-0.83	0.77	1.00

Details: Clustered version of the heatmap

Clustered heatmap of Pearson’s correlation coefficients between betting odds, goal statistics, and some other variables.

Code

sns.clustermap(
    cor_betting.round(1),
    vmin=-1,
    vmax=1,
    annot=False,
    cmap="BrBG",
    method="centroid",
    figsize=(15, 15),
);

Code

del cor_betting

5.8 Team Score Prediction

Can we predict how many goals each team will score in each match?

In this section, number of goals each team scores in a match is modeled.
As a reference, standard deviation of goals was calculated: SD = 1.26 goals.
The initial idea was to select 4 final models for the types of variables, that are available and different times before the match (team-related features, player-related features and betting odds and one model based on all types of variables), so:
- Three separate models for 3 predictor types (team-related features, player-related features and betting odds) were created.
- Models with all 3 feature types as well as PCA features were also among the candidates, but they did not improve cross-validation performance and were discarded (see Table 5.13).
Finally, only a single model was selected:
- Models with team-related (train RMSE=1.24, R²=0.03) and player-related features (train RMSE=1.20, R²=0.09) had really poor performance and barely explained any variation in target variable (R²<0.15), so were also discarded (see Table 5.14).
- In cases two cases (a. betting odds based model and b. model where all variables were between the candidates as possible predictors) the same RF model with a single variable B365_win (betting odds that team wins) was selected. Its test performance is RMSE=1.16, R²=0.15.
There is a debate if betting odds is a reliable predictor due to its nature (it is the output of other model, it changes frequently, etc.). Yet, in this analysis betting odds was the only type of predictors that allowed achieving model with minimum reasonable amount of explained variance (R²≥0.15).
Conclusion: there is a lot of randomness in the game, so basing on the available data it is hard to make reliable predictions in advance on how many goals a team will core..

The summary of the results is present in Tables 5.13 and 5.14.

The details are in the subsections below.

Code

target_sd = team_train[team_target].std()

print(
    "Standard deviation (SD) of target variable in training set: "
    f"{round(target_sd, 2)} goals"
)

Standard deviation (SD) of target variable in training set: 1.26 goals

5.8.1 Team-Related Features as Predictors

Linear Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_team_team():
    np.random.seed(250)
    estimator = LinearRegression()
    subset = [team_target, "team_type", *team_vars_team]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_team.pickle"
sfs_res_team_team = my.cached_results(file, fun_sfs_res_team_team)

Code

ml.sfs_plot_results(
    sfs_res_team_team,
    "Predictors: Team-Related Features (Linear Regression)",
    target_sd,
);

**Fig. 5.14.** SFS results. Red dashed reference line indicates SD of target variable.

k = 2, avg. RMSE = 1.241 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_team)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	team_type__home	1.245	nan	nan
2	buildUpPlayPassing	1.241	0.004	0.306
3	defencePressure	1.240	0.001	0.060
4	chanceCreationShooting	1.240	0.000	0.015
5	buildUpPlaySpeed	1.240	0.000	0.001
6	defenceTeamWidth	1.240	0.000	0.001
7	chanceCreationPassing	1.240	-0.000	-0.004
8	chanceCreationCrossing	1.241	-0.001	-0.045
9	defenceAggression	1.242	-0.001	-0.111

Random Forest

Code

# Do SFS or take results from cache
def fun_sfs_res_team_team_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    subset = [team_train, "team_type", *team_vars_team]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_team_rf.pickle"
sfs_res_team_team_rf = my.cached_results(file, fun_sfs_res_team_team_rf)

Code

ml.sfs_plot_results(
    sfs_res_team_team_rf,
    "Predictors: Team-Related Features (Random Forest)",
    target_sd,
)

**Fig. 5.15.** SFS results. Red dashed reference line indicates SD of target variable.

k = 2, avg. RMSE = 1.242 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_team_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	team_type__home	1.245	nan	nan
2	buildUpPlayPassing	1.242	0.003	0.207
3	chanceCreationCrossing	1.278	-0.036	-2.906
4	buildUpPlaySpeed	1.287	-0.008	-0.644
5	defenceAggression	1.292	-0.005	-0.383
6	chanceCreationShooting	1.294	-0.003	-0.221
7	chanceCreationPassing	1.295	-0.001	-0.047
8	defenceTeamWidth	1.298	-0.003	-0.225
9	defencePressure	1.317	-0.019	-1.469

5.8.2 Player-Related Features as Predictors

Linear Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_team_player():
    np.random.seed(250)
    estimator = LinearRegression()
    subset = [team_target, "team_type", *team_vars_player]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression", 30).fit(X, y)


file = "saved-output/sfs_res_team_player.pickle"
sfs_res_team_player = my.cached_results(file, fun_sfs_res_team_player)

Code

ml.sfs_plot_results(
    sfs_res_team_player,
    "Predictors: Player-Related Features (Linear Regression)",
    target_sd,
);

**Fig. 5.16.** SFS results. Red dashed reference line indicates SD of target variable.

k = 30, avg. RMSE = 1.198 [Best]
(Number of predictors at best score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_player)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	dribbling__mean	1.232	nan	nan
2	team_type__home	1.216	0.015	1.242
3	agility__mean	1.214	0.003	0.215
4	height__max	1.211	0.002	0.190
5	standing_tackle__mean	1.209	0.002	0.165
6	short_passing__mean	1.207	0.003	0.212
7	ball_control__min	1.206	0.001	0.077
8	gk_reflexes__max	1.205	0.001	0.068
9	overall_rating__max	1.204	0.001	0.082
10	player_age__mean	1.203	0.001	0.062
11	strength__mean	1.203	0.001	0.048
12	balance__max	1.202	0.001	0.046
13	aggression__mean	1.202	0.000	0.029
14	penalties__mean	1.202	0.000	0.026
15	penalties__std	1.201	0.000	0.031
16	crossing__max	1.201	0.000	0.021
17	sprint_speed__std	1.201	0.000	0.015
18	vision__mean	1.200	0.000	0.036
19	vision__std	1.200	0.001	0.066
20	weight_kg__mean	1.199	0.000	0.018
21	free_kick_accuracy__max	1.199	0.000	0.018
22	bmi__min	1.199	0.000	0.012
23	stamina__min	1.199	0.000	0.008
24	aggression__std	1.199	0.000	0.008
25	balance__std	1.199	0.000	0.007
26	jumping__mean	1.199	0.000	0.005
27	reactions__min	1.199	0.000	0.003
28	gk_kicking__std	1.199	0.000	0.003
29	gk_diving__min	1.198	0.000	0.005
30	penalties__max	1.198	0.000	0.004

Random Forests

Details: Feature importances

Code

def fun_rf_team_player():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    subset = [team_target, "team_type", *team_vars_player]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_player.pickle"
rf_team_player = my.cached_results(file, fun_rf_team_player)

rf_team_player_importances = ml.get_rf_importances(rf_team_player)

ml.plot_importances(rf_team_player_importances, n=10);

Code

(
    rf_team_player_importances.nlargest(10, "importance")
    .style.format(precision=4)
    .bar()
)

	features	importance
41	dribbling__mean	0.0503
156	team_type__home	0.0257
153	player_age__mean	0.0232
155	player_age__max	0.0177
152	player_age__min	0.0175
57	ball_control__mean	0.0140
69	agility__mean	0.0118
18	potential__std	0.0108
2	height__std	0.0107
25	finishing__mean	0.0107

Code

# Do SFS or take results from cache
def fun_sfs_res_team_player_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
        team_target, 'dribbling__mean', 'team_type__home', 'player_age__mean',
       'player_age__max', 'player_age__min', 'ball_control__mean',
       'agility__mean', 'potential__std', 'height__std',
       'finishing__mean'
    ]
    # fmt: on
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression", 10).fit(X, y)


file = "saved-output/sfs_res_team_player_rf.pickle"
sfs_res_team_player_rf = my.cached_results(file, fun_sfs_res_team_player_rf)

Code

ml.sfs_plot_results(
    sfs_res_team_player_rf,
    "Predictors: Player-Related Features  (Random Forests)",
    target_sd,
);

**Fig. 5.17.** SFS results. Red dashed reference line indicates SD of target variable.

k = 10, avg. RMSE = 1.224 [Best]
(Number of predictors at best score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_player_rf)
    .head(20)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	dribbling__mean	1.236	nan	nan
2	team_type__home	1.229	0.007	0.595
3	player_age__mean	1.325	-0.097	-7.865
4	player_age__max	1.265	0.060	4.537
5	potential__std	1.247	0.018	1.408
6	height__std	1.238	0.009	0.760
7	finishing__mean	1.232	0.006	0.499
8	ball_control__mean	1.229	0.002	0.171
9	player_age__min	1.226	0.003	0.271
10	agility__mean	1.224	0.002	0.151

5.8.3 Betting Odds as Predictors

Linear Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_team_betting():
    np.random.seed(250)
    estimator = LinearRegression()
    subset = [team_target, "team_type", *team_vars_betting_odds]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_betting.pickle"
sfs_res_team_betting = my.cached_results(file, fun_sfs_res_team_betting)

Code

ml.sfs_plot_results(
    sfs_res_team_betting,
    "Predictors: Betting Odds (Linear Regression)",
);

k = 2, avg. RMSE = 1.162 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_betting)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	VC_log_ratio_wl	1.168	nan	nan
2	B365_loose	1.162	0.006	0.524
3	IW_ratio_wl	1.162	0.000	0.028
4	IW_win	1.161	0.000	0.031
5	VC_win	1.161	0.000	0.013
6	VC_ratio_wl	1.161	0.000	0.034
7	BW_ratio_wl	1.161	-0.000	-0.000
8	LB_ratio_wl	1.161	-0.000	-0.000
9	LB_log_ratio_wl	1.161	-0.000	-0.000
10	LB_loose	1.161	0.000	0.000
11	B365_log_ratio_wl	1.161	0.000	0.000
12	B365_ratio_wl	1.161	-0.000	-0.001
13	WH_ratio_wl	1.161	-0.000	-0.001
14	WH_win	1.161	0.000	0.002
15	IW_log_ratio_wl	1.161	-0.000	-0.002
16	WH_log_ratio_wl	1.161	-0.000	-0.003
17	BW_win	1.161	-0.000	-0.006
18	B365_win	1.161	-0.000	-0.004
19	WH_loose	1.161	-0.000	-0.006
20	VC_loose	1.161	-0.000	-0.003
21	IW_loose	1.161	-0.000	-0.003
22	BW_loose	1.161	-0.000	-0.003
23	BW_log_ratio_wl	1.161	-0.000	-0.003
24	LB_win	1.161	-0.000	-0.009
25	team_type__home	1.161	-0.000	-0.010

Random Forests

Details: Feature importances

Code

def fun_rf_team_betting():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    subset = [team_target, "team_type", *team_vars_betting_odds]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_betting.pickle"
rf_team_betting = my.cached_results(file, fun_rf_team_betting)

rf_team_betting_importances = ml.get_rf_importances(rf_team_betting)

ml.plot_importances(rf_team_betting_importances, n=30);

Code

(rf_team_betting_importances.style.format(precision=4).bar())

	features	importance
2	VC_win	0.0928
0	B365_win	0.0646
4	WH_win	0.0600
1	BW_win	0.0539
19	BW_log_ratio_wl	0.0484
13	BW_ratio_wl	0.0482
14	VC_ratio_wl	0.0421
20	VC_log_ratio_wl	0.0415
17	LB_ratio_wl	0.0408
23	LB_log_ratio_wl	0.0400
16	WH_ratio_wl	0.0398
22	WH_log_ratio_wl	0.0393
7	BW_loose	0.0378
12	B365_ratio_wl	0.0358
18	B365_log_ratio_wl	0.0357
5	LB_win	0.0351
15	IW_ratio_wl	0.0344
21	IW_log_ratio_wl	0.0339
8	VC_loose	0.0289
9	IW_loose	0.0286
11	LB_loose	0.0283
3	IW_win	0.0273
10	WH_loose	0.0259
6	B365_loose	0.0233
24	team_type__home	0.0135

Code

# Do SFS or take results from cache
def fun_sfs_res_team_betting_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    subset = [team_target, "team_type", *team_vars_betting_odds]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_betting_rf.pickle"
sfs_res_team_betting_rf = my.cached_results(file, fun_sfs_res_team_betting_rf)

Code

ml.sfs_plot_results(
    sfs_res_team_betting_rf,
    "Predictors: Betting Odds (Random Forests)",
    team_train[team_target].std(),
);

**Fig. 5.19.** SFS results. Red dashed reference line indicates SD of target variable.

k = 1, avg. RMSE = 1.162 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_betting_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	B365_win	1.162	nan	nan
2	team_type__home	1.165	-0.003	-0.245
3	WH_win	1.194	-0.029	-2.496
4	B365_ratio_wl	1.257	-0.063	-5.243
5	B365_loose	1.256	0.001	0.072
6	B365_log_ratio_wl	1.255	0.001	0.041
7	BW_log_ratio_wl	1.271	-0.016	-1.244
8	LB_ratio_wl	1.227	0.044	3.449
9	VC_log_ratio_wl	1.211	0.016	1.343
10	IW_log_ratio_wl	1.203	0.008	0.645
11	WH_loose	1.202	0.001	0.106
12	BW_ratio_wl	1.199	0.002	0.177
13	LB_log_ratio_wl	1.200	-0.001	-0.058
14	VC_loose	1.200	-0.000	-0.019
15	BW_loose	1.200	0.000	0.007
16	BW_win	1.200	0.001	0.056
17	LB_loose	1.199	0.001	0.066
18	WH_log_ratio_wl	1.199	-0.000	-0.041
19	LB_win	1.198	0.001	0.102
20	IW_ratio_wl	1.199	-0.001	-0.070
21	IW_win	1.199	0.000	0.014
22	VC_ratio_wl	1.199	-0.001	-0.055
23	WH_ratio_wl	1.199	0.001	0.059
24	IW_loose	1.200	-0.002	-0.142
25	VC_win	1.201	-0.001	-0.087

5.8.4 All Variables as Predictors

Linear Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_team_all():
    np.random.seed(250)
    estimator = LinearRegression()
    X, y = team_train.make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression", 50).fit(X, y)


file = "saved-output/sfs_res_team_all.pickle"
sfs_res_team_all = my.cached_results(file, fun_sfs_res_team_all)

Code

ml.sfs_plot_results(
    sfs_res_team_all, "Predictors: All Variables (Linear Regression)"
);

k = 50, avg. RMSE = 1.156 [Best]
(Number of predictors at best score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_all)
    .head(20)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	VC_log_ratio_wl	1.168	nan	nan
2	B365_loose	1.162	0.006	0.524
3	dribbling__mean	1.161	0.001	0.076
4	standing_tackle__mean	1.159	0.002	0.135
5	weight_kg__mean	1.159	0.001	0.060
6	gk_positioning__std	1.158	0.000	0.026
7	short_passing__mean	1.158	0.000	0.028
8	BW_ratio_wl	1.158	0.000	0.023
9	acceleration__max	1.158	0.000	0.019
10	ball_control__min	1.157	0.000	0.016
11	gk_diving__max	1.157	0.000	0.013
12	BW_win	1.157	0.000	0.011
13	curve__std	1.157	0.000	0.009
14	standing_tackle__max	1.157	0.000	0.007
15	marking__mean	1.157	0.000	0.010
16	volleys__std	1.157	0.000	0.009
17	height__std	1.157	0.000	0.008
18	LB_win	1.157	0.000	0.005
19	LB_log_ratio_wl	1.157	0.000	0.011
20	vision__min	1.156	0.000	0.008

Random Forests

Details: Feature importances

The plot below contains all sorted importances (upper subplot) and 30 top cases (lower subplot).

Code

def fun_rf_team_all():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    X, y = team_train.make_dummies(team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_all.pickle"
rf_team_all = my.cached_results(file, fun_rf_team_all)


rf_team_all_importances = ml.get_rf_importances(rf_team_all)

ml.plot_importances(rf_team_all_importances);

Code

(
    rf_team_all_importances.nlargest(20, "importance")
    .style.format(precision=4)
    .bar()
)

	features	importance
166	VC_win	0.0639
164	B365_win	0.0417
168	WH_win	0.0327
165	BW_win	0.0161
161	player_age__mean	0.0107
160	player_age__min	0.0105
163	player_age__max	0.0094
82	reactions__std	0.0091
26	potential__std	0.0088
17	bmi__mean	0.0088
22	overall_rating__std	0.0087
162	player_age__std	0.0085
94	jumping__std	0.0084
10	height__std	0.0082
110	aggression__std	0.0081
34	finishing__std	0.0081
14	weight_kg__std	0.0079
58	free_kick_accuracy__std	0.0079
93	jumping__mean	0.0078
106	long_shots__std	0.0078

Code

# Do SFS or take results from cache
def fun_sfs_res_team_all_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
       team_target,
       "VC_win", "B365_win", "WH_win", "BW_win", "player_age__mean",
       "player_age__min", "player_age__max", "reactions__std",
       "potential__std", "bmi__mean", 
       "dribbling__mean", "team_type"
    ]
    # fmt: on
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_all_rf.pickle"
sfs_res_team_all_rf = my.cached_results(file, fun_sfs_res_team_all_rf)

Code

ml.sfs_plot_results(
    sfs_res_team_all_rf,
    "Predictors: All Variables (Random Forest)",
    target_sd,
);

**Fig. 5.21.** SFS results. Red dashed reference line indicates SD of target variable.

k = 1, avg. RMSE = 1.162 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_all_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	B365_win	1.162	nan	nan
2	team_type__home	1.165	-0.003	-0.264
3	WH_win	1.194	-0.029	-2.499
4	player_age__mean	1.257	-0.063	-5.246
5	bmi__mean	1.209	0.048	3.834
6	player_age__max	1.196	0.013	1.095
7	dribbling__mean	1.188	0.008	0.671
8	player_age__min	1.185	0.002	0.194
9	potential__std	1.184	0.001	0.124
10	VC_win	1.183	0.001	0.056
11	reactions__std	1.183	0.000	0.006
12	BW_win	1.183	0.000	0.005

5.8.5 PCA Features of All Variables as Predictors

It was tried to create predictive model based on principal components instead of original numeric variables. PCA scree plot suggests that it is reasonable to use that 4 or 6 components as at these points the “elbow” point can be visible. Six components explain 56 % of variance. To explain 80% of variance, 27 components are needed.

Code

_, team_train_num, _ = ml.get_columns_by_purpose(team_train, team_target)
_, _, pca_obj = ml.pca_screeplot(team_train_num, 60);

Code

pcs_6 = pca_obj.explained_variance_ratio_.cumsum()[5] * 100
print(f"First 6 PCs explain {pcs_6:.1f} % of variance.")

First 6 PCs explain 56.0 % of variance.

Code

n_pcs_80 = np.argwhere(pca_obj.explained_variance_ratio_.cumsum() >= 0.80).min()
print(f"Number of PCs needed to explain at least 80% of variance: {n_pcs_80}")

Number of PCs needed to explain at least 80% of variance: 27

Code

d_target, d_num, d_other, d_pca, team_scale, team_pca = ml.do_pca(
    team_train, team_target, n_components=50
)
team_train_with_pca = pd.concat([d_target, d_other, d_pca], axis=1)

Linear Regression

Include 6 PCs in SFS.

Code

# Do SFS or take results from cache
def fun_sfs_res_team_pca_2():
    np.random.seed(250)
    estimator = LinearRegression()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5", "pc_6", 
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca_2.pickle"
sfs_res_team_pca_2 = my.cached_results(file, fun_sfs_res_team_pca_2)

Code

ml.sfs_plot_results(
    sfs_res_team_pca_2,
    "Predictors: PCs of All Variables (Linear Regression)",
)

k = 4, avg. RMSE = 1.176 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_pca_2)
    .head(20)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	pc_2	1.218	nan	nan
2	pc_1	1.181	0.037	3.020
3	pc_6	1.178	0.004	0.324
4	team_type__home	1.176	0.001	0.124
5	pc_4	1.176	0.000	0.007
6	pc_3	1.175	0.001	0.077
7	pc_5	1.185	-0.010	-0.832

Include 27 PCs in SFS.

Code

# Do SFS or take results from cache
def fun_sfs_res_team_pca():
    np.random.seed(250)
    estimator = LinearRegression()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5",
        "pc_6", "pc_7", "pc_8", "pc_9", "pc_10",
        "pc_11", "pc_12", "pc_13", "pc_14", "pc_15",
        "pc_16", "pc_17", "pc_18", "pc_19", "pc_20",
        "pc_21", "pc_22", "pc_23", "pc_24", "pc_25",
        "pc_26", "pc_27",
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca.pickle"
sfs_res_team_pca = my.cached_results(file, fun_sfs_res_team_pca)

Code

ml.sfs_plot_results(
    sfs_res_team_pca, "Predictors: PCs of All Variables (Linear Regression)"
)

k = 5, avg. RMSE = 1.160 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_pca)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	pc_2	1.218	nan	nan
2	pc_1	1.181	0.037	3.020
3	pc_7	1.171	0.011	0.903
4	pc_8	1.164	0.006	0.553
5	pc_6	1.160	0.004	0.340
6	pc_4	1.160	0.001	0.054
7	pc_3	1.159	0.001	0.069
8	pc_14	1.159	0.000	0.012
9	pc_15	1.159	0.000	0.011
10	team_type__home	1.159	0.000	0.012
11	pc_20	1.158	0.000	0.005
12	pc_13	1.158	0.000	0.003
13	pc_16	1.158	0.000	0.003
14	pc_10	1.158	0.000	0.001
15	pc_22	1.158	0.000	0.000
16	pc_11	1.158	0.000	0.000
17	pc_26	1.158	0.000	0.002
18	pc_12	1.158	-0.000	-0.001
19	pc_27	1.158	-0.000	-0.001
20	pc_24	1.158	-0.000	-0.002
21	pc_21	1.158	-0.000	-0.002
22	pc_5	1.158	-0.000	-0.002
23	pc_17	1.158	0.000	0.006
24	pc_19	1.158	-0.000	-0.002
25	pc_25	1.158	-0.000	-0.006
26	pc_18	1.159	-0.000	-0.008
27	pc_23	1.159	-0.000	-0.007
28	pc_9	1.159	-0.000	-0.011

Random Forests

Details: Feature importances

Random forest feature importance of principal components and categorical variables.

Code

def fun_rf_team_pca():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5",
        "pc_6", "pc_7", "pc_8", "pc_9", "pc_10",
        "pc_11", "pc_12", "pc_13", "pc_14", "pc_15",
        "pc_16", "pc_17", "pc_18", "pc_19", "pc_20",
        "pc_21", "pc_22", "pc_23", "pc_24", "pc_25",
        "pc_26", "pc_27",
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_pca.pickle"
rf_team_pca = my.cached_results(file, fun_rf_team_pca)

rf_team_pca_importances = ml.get_rf_importances(rf_team_pca)

ml.plot_importances(rf_team_pca_importances, n=10);

Code

rf_team_pca_importances.style.format(precision=4).bar()

	features	importance
1	pc_2	0.1069
0	pc_1	0.1031
6	pc_7	0.0420
7	pc_8	0.0360
14	pc_15	0.0326
13	pc_14	0.0326
24	pc_25	0.0326
25	pc_26	0.0320
10	pc_11	0.0320
16	pc_17	0.0319
23	pc_24	0.0319
15	pc_16	0.0317
9	pc_10	0.0316
5	pc_6	0.0315
11	pc_12	0.0310
22	pc_23	0.0308
20	pc_21	0.0306
21	pc_22	0.0305
8	pc_9	0.0304
18	pc_19	0.0299
19	pc_20	0.0298
26	pc_27	0.0298
17	pc_18	0.0298
12	pc_13	0.0294
3	pc_4	0.0292
2	pc_3	0.0290
4	pc_5	0.0279
27	team_type__home	0.0036

Include 6 PCs in SFS.

Code

# Do SFS or take results from cache
def fun_sfs_res_team_pca_2_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5", "pc_6", 
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca_2_rf.pickle"
sfs_res_team_pca_2_rf = my.cached_results(file, fun_sfs_res_team_pca_2_rf)

Code

ml.sfs_plot_results(
    sfs_res_team_pca_2_rf,
    "Predictors: Selected PCs (Random Forest)",
    team_train[team_target].std(),
);

k = 6, avg. RMSE = 1.201 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_pca_2_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	team_type__home	1.245	nan	nan
2	pc_1	1.457	-0.213	-17.073
3	pc_2	1.259	0.198	13.604
4	pc_3	1.214	0.045	3.562
5	pc_6	1.206	0.008	0.648
6	pc_5	1.201	0.006	0.458
7	pc_4	1.204	-0.003	-0.225

Include 27 PCs in SFS.

Code

# Do SFS or take results from cache
def fun_sfs_res_team_pca_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5",
        "pc_6", "pc_7", "pc_8", "pc_9", "pc_10",
        "pc_11", "pc_12", "pc_13", "pc_14", "pc_15",
        "pc_16", "pc_17", "pc_18", "pc_19", "pc_20",
        "pc_21", "pc_22", "pc_23", "pc_24", "pc_25",
        "pc_26", "pc_27",
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca_rf.pickle"
sfs_res_team_pca_rf = my.cached_results(file, fun_sfs_res_team_pca_rf)

Code

ml.sfs_plot_results(
    sfs_res_team_pca_rf,
    "Predictors: Selected PCs (Random Forest)",
    team_train[team_target].std(),
);

k = 14, avg. RMSE = 1.180 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of RMSE

Code

(
    ml.sfs_list_features(sfs_res_team_pca_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)

	added_feature	RMSE	RMSE_improvement	RMSE_percentage_change
step
1	team_type__home	1.245	nan	nan
2	pc_1	1.459	-0.214	-17.194
3	pc_2	1.259	0.200	13.676
4	pc_3	1.216	0.044	3.474
5	pc_7	1.199	0.017	1.369
6	pc_9	1.192	0.007	0.591
7	pc_10	1.188	0.004	0.350
8	pc_11	1.186	0.002	0.163
9	pc_6	1.183	0.003	0.248
10	pc_20	1.183	0.000	0.006
11	pc_12	1.181	0.002	0.141
12	pc_8	1.181	0.000	0.029
13	pc_17	1.180	0.001	0.055
14	pc_24	1.180	-0.000	-0.007
15	pc_15	1.179	0.001	0.098
16	pc_16	1.179	0.000	0.002
17	pc_26	1.178	0.001	0.089
18	pc_14	1.179	-0.001	-0.071
19	pc_18	1.178	0.001	0.061
20	pc_19	1.179	-0.001	-0.101
21	pc_23	1.178	0.001	0.082
22	pc_27	1.179	-0.001	-0.057
23	pc_22	1.180	-0.001	-0.125
24	pc_21	1.180	0.000	0.034
25	pc_25	1.180	-0.000	-0.025
26	pc_5	1.181	-0.000	-0.017
27	pc_13	1.181	-0.000	-0.004
28	pc_4	1.184	-0.003	-0.268

5.8.6 Final models

This subsection summarizes the results from the subsections above and evaluates the performance on the whole training and test sets.

Basing on training CV RMSE:

Comparing 3 groups of predictors (team-related features, player-related features and betting odds), betting odds show the best predictive abilities and team-related features show the worst ones (see Table 5.13).
Comparing predictions based on original variables and PCs of these variables, PCs did not improve the predictions.
For the further investigation, 3 models were selected.

**Table 5.13.** **Regression** model selection results: selected models for each feature type and algorithm.
Features type	Method	Number of features selected	Training CV RMSE	Selected as final model	Note
Team-related	Linear regression	k = 2	1.241	No⁴
	Random forest	k = 2	1.242	No
Player-related	Linear regression	k = 10	1.203	No⁴	Included¹: all Max. allowed²: 30 With k = 20, RMSE: 1.199
	Random forest	k = 10	1.224	No	Included¹: 10
Betting odds	Linear regression	k = 2	1.162	No
	Random forest	k = 1	1.162	Yes⁵
All variables	Linear regression	k = 6	1.158	No	Included¹: all Max. allowed²: 50
	Random forest	k = 1	1.162	Yes⁵	Included¹: 12 The same model as in “Betting odds \| Random forest”
6 PCs of all³ variables	Linear regression	k = 4	1.176	No	Included¹: 7
	Random forest	k = 6	1.201	No	Included¹: 7
27 PCs of all³ variables	Linear regression	k = 5	1.160	No	Included¹: 28
	Random forest	k = 13	1.180	No	Included¹: 28 With k = 17, RMSE: 1.178

¹ – Number of features included in SFS selection.
² – Maximum allowed number of features to be selected.
³ – PCs of all numeric variables.
⁴ – Model was a candidate to become a final model but rejected due to low performance.
⁵ – In both cases the same model was selected.

Two candidates to final models out of 3 were discarded due to low explained variance (R²<0.15; see Table 5.14).

Code

np.random.seed(250)

# -----------------------------------------------------------------------

subset_1 = [team_target, "team_type", "buildUpPlayPassing"]
X_train_1, y_train_1 = team_train[subset_1].make_dummies(exclude=team_target)

model_team_team = LinearRegression()
model_team_team.fit(X_train_1, y_train_1)

y_pred_train_1 = model_team_team.predict(X_train_1)

# -----------------------------------------------------------------------

subset_2 = [
    team_target,
    "dribbling__mean",
    "team_type",
    "agility__mean",
    "height__max",
    "standing_tackle__mean",
    "short_passing__mean",
    "ball_control__min",
    "gk_reflexes__max",
    "overall_rating__max",
    "player_age__mean",
]
X_train_2, y_train_2 = team_train[subset_2].make_dummies(exclude=team_target)

model_team_player = LinearRegression()
model_team_player.fit(X_train_2, y_train_2)

y_pred_train_2 = model_team_player.predict(X_train_2)

# -----------------------------------------------------------------------

subset_3 = [team_target, "B365_win"]
X_train_3, y_train_3 = team_train[subset_3].make_dummies(exclude=team_target)
X_test_3, y_test_3 = team_test[subset_3].make_dummies(exclude=team_target)

model_team_all = RandomForestRegressor(n_jobs=-1)
model_team_all.fit(X_train_3, y_train_3)

y_pred_train_3 = model_team_all.predict(X_train_3)
y_pred_test_3 = model_team_all.predict(X_test_3)

# -----------------------------------------------------------------------

pd.concat(
    [
        ml.get_regression_performance(
            y_train_1, y_pred_train_1, "Train (team-related features)"
        ),
        ml.get_regression_performance(
            y_train_2, y_pred_train_2, "Train (player-related features)"
        ),
        ml.get_regression_performance(
            y_train_3, y_pred_train_3, "Train (all features/betting odds)"
        ),
        ml.get_regression_performance(
            y_test_3, y_pred_test_3, "Test (all features/betting odds)"
        ),
    ]
).index_start_at(1).style.format(precision=2)

**Table 5.14.** Final evaluation of selected models for team goal prediction.
	set	n	SD	RMSE	R²	RMSE_SD_ratio	SD_RMSE_ratio
1	Train (team-related features)	27200	1.26	1.24	0.03	0.98	1.02
2	Train (player-related features)	27200	1.26	1.20	0.09	0.95	1.05
3	Train (all features/betting odds)	27200	1.26	1.16	0.16	0.92	1.09
4	Test (all features/betting odds)	5455	1.26	1.16	0.15	0.92	1.08

5.9 Match Outcome Prediction

Can we predict which team will win the match?

In this section, the output of the match (home wins, draw, away wins) is modeled.
The initial idea was to select 4 models: one from each feature type group as these features are available at different time before the match. But model based on team-related features showed low performance. And model based on all type of variables was rejected due to possible overfitting in preference to less complex model with 1 variable based on betting odds: these models share she same most important feature and inclusion of additional features only slightly improved model performance on training set. So only 2 final models were selected.
The test performance of the final models:
- for the model based on player attributes accuracy is 50%, balanced accuracy is 42%;
- for the model based on betting odds is as follows: accuracy 52%, balanced accuracy is 45%.
These models can be used in different situations when different typos of variables are available.
Unfortunately, both models are unable to predict outcome “draw” correctly. This might be related to the findings in section Relationship Between Betting Odds that betting odds of “draw” are correlated to the outcome “away wins”.
Conclusion: despite the fact that there is a lot of randomness in the game, decisions based on data can improve predictions on the football match outcome. Still, this prediction is not perfect.

The results of classification model selection are in Table 5.15. The performance of the selected models is presented in Table 5.16 and in the output below this table.

The details are in the subsections below.

5.9.1 Team-Related Features as Predictors

Logistic Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_match_team():
    np.random.seed(250)
    estimator = LogisticRegression(
        solver="newton-cg", multi_class="multinomial"
    )
    subset = [match_target, *match_vars_team]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification").fit(X, y)


file = "saved-output/sfs_res_match_team.pickle"
sfs_res_match_team = my.cached_results(file, fun_sfs_res_match_team)

Code

ml.sfs_plot_results(
    sfs_res_match_team,
    "Predictors: Team-Related Features (Logistic Regression)",
);

k = 9, avg. BAcc = 0.347 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_team)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	buildUpPlaySpeed_away	0.333	nan	nan
2	buildUpPlayPassing_home	0.333	0.000	0.027
3	buildUpPlayPassing_away	0.340	0.006	1.928
4	defencePressure_away	0.341	0.001	0.397
5	chanceCreationShooting_home	0.344	0.003	0.946
6	defenceAggression_away	0.345	0.001	0.206
7	defenceAggression_home	0.345	0.000	0.046
8	chanceCreationShooting_away	0.346	0.001	0.315
9	buildUpPlaySpeed_home	0.347	0.000	0.079
10	chanceCreationPassing_home	0.346	-0.000	-0.063
11	defenceTeamWidth_home	0.347	0.001	0.222
12	defencePressure_home	0.347	-0.000	-0.021
13	chanceCreationPassing_away	0.347	-0.000	-0.073
14	chanceCreationCrossing_home	0.346	-0.001	-0.215
15	chanceCreationCrossing_away	0.344	-0.002	-0.553
16	defenceTeamWidth_away	0.342	-0.002	-0.676

Random Forests

Code

# Do SFS or take results from cache
def fun_sfs_res_match_team_rf():
    np.random.seed(250)
    estimator = RandomForestClassifier()
    subset = [match_target, *match_vars_team]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification").fit(X, y)


file = "saved-output/sfs_res_match_team_rf.pickle"
sfs_res_match_team_rf = my.cached_results(file, fun_sfs_res_match_team_rf)

Code

ml.sfs_plot_results(
    sfs_res_match_team_rf, "Predictors: Team-Related Features (Random Forests)"
);

k = 3, avg. BAcc = 0.350 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_team_rf)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	defencePressure_away	0.341	nan	nan
2	chanceCreationShooting_home	0.341	0.000	0.021
3	defenceTeamWidth_away	0.350	0.008	2.478
4	defenceAggression_home	0.343	-0.006	-1.794
5	buildUpPlaySpeed_away	0.339	-0.005	-1.384
6	chanceCreationShooting_away	0.337	-0.001	-0.338
7	buildUpPlayPassing_home	0.341	0.003	1.019
8	chanceCreationCrossing_home	0.340	-0.001	-0.287
9	chanceCreationPassing_away	0.338	-0.002	-0.692
10	defenceTeamWidth_home	0.340	0.003	0.811
11	defenceAggression_away	0.345	0.005	1.324
12	chanceCreationPassing_home	0.344	-0.001	-0.237
13	defencePressure_home	0.340	-0.004	-1.083
14	chanceCreationCrossing_away	0.344	0.004	1.231
15	buildUpPlayPassing_away	0.344	0.000	0.020
16	buildUpPlaySpeed_home	0.339	-0.005	-1.479

5.9.2 Player-Related Features as Predictors

Logistic Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_match_player():
    np.random.seed(250)
    estimator = LogisticRegression(
        solver="newton-cg", multi_class="multinomial"
    )
    subset = [match_target, *match_vars_player]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 30).fit(X, y)


file = "saved-output/sfs_res_match_player.pickle"
sfs_res_match_player = my.cached_results(file, fun_sfs_res_match_player)

Code

ml.sfs_plot_results(
    sfs_res_match_player,
    "Predictors: Player-Related Features (Logistic Regression)",
);

k = 28, avg. BAcc = 0.454 [Best]
(Number of predictors at best score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_player)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	dribbling__mean_away	0.389	nan	nan
2	overall_rating__mean_home	0.432	0.043	11.032
3	overall_rating__mean_away	0.446	0.014	3.284
4	stamina__max_away	0.447	0.001	0.253
5	gk_positioning__std_home	0.448	0.001	0.126
6	long_shots__max_away	0.450	0.002	0.356
7	weight_kg__std_home	0.450	0.001	0.168
8	jumping__min_away	0.450	0.000	0.005
9	strength__std_away	0.451	0.000	0.028
10	gk_kicking__mean_home	0.451	0.000	0.010
11	gk_positioning__min_home	0.451	0.000	0.035
12	ball_control__max_home	0.451	0.000	0.008
13	sprint_speed__min_home	0.451	0.000	0.031
14	gk_handling__min_away	0.451	-0.000	-0.010
15	strength__max_away	0.451	-0.000	-0.044
16	aggression__mean_home	0.451	0.000	0.079
17	gk_kicking__min_home	0.451	-0.000	-0.081
18	crossing__mean_away	0.451	0.000	0.086
19	strength__mean_home	0.451	-0.000	-0.096
20	ball_control__std_away	0.451	0.001	0.147
21	interceptions__min_away	0.452	0.001	0.115
22	gk_positioning__mean_home	0.452	0.000	0.019
23	shot_power__max_away	0.452	0.000	0.053
24	vision__mean_away	0.453	0.000	0.088
25	vision__min_home	0.453	0.000	0.074
26	sprint_speed__std_away	0.453	0.000	0.061
27	positioning__std_home	0.453	0.000	0.019
28	sprint_speed__std_home	0.454	0.001	0.148
29	short_passing__max_home	0.453	-0.000	-0.084
30	gk_positioning__max_home	0.454	0.000	0.035

Random Forests

Details: Feature importances

Code

def fun_rf_match_player():
    np.random.seed(250)
    rf = RandomForestClassifier(n_jobs=-1)
    subset = [match_target, *match_vars_player]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return rf.fit(X, y)


file = "saved-output/rf_match_player.pickle"
rf_match_player = my.cached_results(file, fun_rf_match_player)

rf_match_player_importances = ml.get_rf_importances(rf_match_player)

ml.plot_importances(rf_match_player_importances, n=30);

Code

(
    rf_match_player_importances.nlargest(21, "importance")
    .style.format(precision=4)
    .bar()
)

	features	importance
147	reactions__mean_home	0.0072
115	ball_control__mean_home	0.0070
114	ball_control__mean_away	0.0066
106	long_passing__mean_away	0.0065
67	short_passing__mean_home	0.0065
27	overall_rating__mean_home	0.0063
82	dribbling__mean_away	0.0063
34	potential__mean_away	0.0062
83	dribbling__mean_home	0.0061
146	reactions__mean_away	0.0060
26	overall_rating__mean_away	0.0058
66	short_passing__mean_away	0.0057
107	long_passing__mean_home	0.0055
227	vision__mean_home	0.0052
35	potential__mean_home	0.0052
163	shot_power__mean_home	0.0051
195	long_shots__mean_home	0.0050
51	finishing__mean_home	0.0050
138	agility__mean_away	0.0048
91	curve__mean_home	0.0047
75	volleys__mean_home	0.0045

Code

# Do SFS or take results from cache
def fun_sfs_res_match_player_rf():
    np.random.seed(250)
    estimator = RandomForestClassifier()
    # fmt: off
    subset = [
        match_target, 'reactions__mean_home', 'ball_control__mean_home',
       'ball_control__mean_away', 'long_passing__mean_away',
       'short_passing__mean_home', 'overall_rating__mean_home',
       'dribbling__mean_away', 'potential__mean_away',
       'dribbling__mean_home', 'reactions__mean_away',
       'overall_rating__mean_away', 'short_passing__mean_away',
       'long_passing__mean_home', 'vision__mean_home',
       'potential__mean_home', 'shot_power__mean_home',
       'long_shots__mean_home', 'finishing__mean_home',
       'agility__mean_away', 'curve__mean_home', 'volleys__mean_home'
    ]
    # fmt: on
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_player_rf.pickle"
sfs_res_match_player_rf = my.cached_results(file, fun_sfs_res_match_player_rf)

Code

ml.sfs_plot_results(
    sfs_res_match_player_rf,
    "Predictors: Player-Related Features (Random Forests)",
);

k = 10, avg. BAcc = 0.431 [Best]
(Number of predictors at best score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_player_rf)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	dribbling__mean_away	0.377	nan	nan
2	short_passing__mean_home	0.398	0.021	5.607
3	long_shots__mean_home	0.413	0.015	3.773
4	reactions__mean_home	0.417	0.003	0.814
5	long_passing__mean_away	0.422	0.005	1.307
6	overall_rating__mean_home	0.426	0.004	0.969
7	short_passing__mean_away	0.430	0.004	0.956
8	agility__mean_away	0.430	0.000	0.001
9	reactions__mean_away	0.430	-0.000	-0.112
10	shot_power__mean_home	0.431	0.001	0.267

5.9.3 Betting-Odds as Predictors

Logistic Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_match_betting():
    np.random.seed(250)
    estimator = LogisticRegression(
        solver="newton-cg", multi_class="multinomial"
    )
    subset = [match_target, *match_vars_betting_odds]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification").fit(X, y)


file = "saved-output/sfs_res_match_betting.pickle"
sfs_res_match_betting = my.cached_results(file, fun_sfs_res_match_betting)

Code

ml.sfs_plot_results(
    sfs_res_match_betting, "Predictors: Betting Odds (Logistic Regression)"
);

k = 1, avg. BAcc = 0.454 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_betting)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	BW_away_wins	0.454	nan	nan
2	VC_away_wins	0.454	-0.000	-0.022
3	LB_away_wins	0.455	0.000	0.086
4	WH_away_wins	0.455	0.000	0.073
5	B365_away_wins	0.455	-0.000	-0.008
6	IW_away_wins	0.455	-0.000	-0.060
7	VC_log_ratio_ha	0.446	-0.009	-1.893
8	BW_log_ratio_ha	0.447	0.001	0.254
9	IW_draw	0.447	0.000	0.077
10	WH_log_ratio_ha	0.448	0.000	0.046
11	B365_log_ratio_ha	0.448	-0.000	-0.000
12	VC_draw	0.447	-0.000	-0.038
13	B365_draw	0.448	0.000	0.018
14	IW_log_ratio_ha	0.448	0.000	0.003
15	LB_log_ratio_ha	0.447	-0.000	-0.018
16	WH_draw	0.447	-0.000	-0.108
17	BW_draw	0.447	-0.000	-0.074
18	WH_ratio_ha	0.447	0.000	0.009
19	LB_draw	0.447	0.000	0.069
20	VC_ratio_ha	0.447	-0.000	-0.038
21	IW_home_wins	0.447	0.000	0.026
22	WH_home_wins	0.447	-0.000	-0.069
23	VC_home_wins	0.446	-0.000	-0.043
24	BW_ratio_ha	0.446	-0.000	-0.044
25	LB_ratio_ha	0.446	0.000	0.000
26	B365_ratio_ha	0.446	-0.000	-0.003
27	B365_home_wins	0.446	-0.000	-0.010
28	IW_ratio_ha	0.446	-0.000	-0.111
29	BW_home_wins	0.445	-0.001	-0.144
30	LB_home_wins	0.445	0.000	0.023

Random Forests

Details: Feature importances

Code

def fun_rf_match_betting():
    np.random.seed(250)
    rf = RandomForestClassifier(n_jobs=-1)
    subset = [match_target, *match_vars_betting_odds]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return rf.fit(X, y)


file = "saved-output/rf_match_betting.pickle"
rf_match_betting = my.cached_results(file, fun_rf_match_betting)

rf_match_betting_importances = ml.get_rf_importances(rf_match_betting)

ml.plot_importances(rf_match_betting_importances, n=30);

Code

# Do SFS or take results from cache
def fun_sfs_res_match_betting_rf():
    np.random.seed(250)
    estimator = RandomForestClassifier()
    subset = [match_target, *match_vars_betting_odds]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_betting_rf.pickle"
sfs_res_match_betting_rf = my.cached_results(file, fun_sfs_res_match_betting_rf)

Code

ml.sfs_plot_results(
    sfs_res_match_betting_rf, "Predictors: Betting Odds (Random Forests)"
);

k = 10, avg. BAcc = 0.445 [Best]
(Number of predictors at best score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_betting_rf)
    .head(20)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	BW_home_wins	0.445	nan	nan
2	B365_home_wins	0.439	-0.006	-1.251
3	WH_away_wins	0.428	-0.012	-2.671
4	WH_draw	0.426	-0.001	-0.317
5	IW_home_wins	0.434	0.008	1.782
6	LB_ratio_ha	0.442	0.008	1.936
7	WH_home_wins	0.443	0.001	0.130
8	VC_home_wins	0.444	0.002	0.349
9	B365_log_ratio_ha	0.442	-0.003	-0.606
10	BW_log_ratio_ha	0.445	0.003	0.751

5.9.4 All Variables as Predictors

Logistic Regression

Code

# Do SFS or take results from cache
def fun_sfs_res_match_all():
    np.random.seed(250)
    estimator = LogisticRegression(
        solver="newton-cg", multi_class="multinomial"
    )
    X, y = match_train.make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_all.pickle"
sfs_res_match_all = my.cached_results(file, fun_sfs_res_match_all)

Code

ml.sfs_plot_results(
    sfs_res_match_all, "Predictors: All Features (Logistic Regression)"
);

k = 9, avg. BAcc = 0.459 [Best]
(Number of predictors at best score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_all)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	BW_away_wins	0.454	nan	nan
2	chanceCreationCrossing_home	0.456	0.002	0.399
3	jumping__mean_home	0.457	0.001	0.272
4	defenceTeamWidth_home	0.458	0.001	0.115
5	gk_kicking__mean_away	0.458	0.000	0.065
6	gk_kicking__min_home	0.459	0.001	0.120
7	balance__max_home	0.458	-0.000	-0.048
8	buildUpPlaySpeed_home	0.459	0.000	0.080
9	finishing__std_home	0.459	0.000	0.040
10	agility__std_home	0.459	-0.000	-0.045

Random Forests

Details: Feature importances

Code

def fun_rf_match_all():
    np.random.seed(250)
    rf = RandomForestClassifier(n_jobs=-1)
    X, y = match_train.make_dummies(exclude=match_target)
    return rf.fit(X, y)


file = "saved-output/rf_match_all.pickle"
rf_match_all = my.cached_results(file, fun_rf_match_all)

rf_match_all_importances = ml.get_rf_importances(rf_match_all)

ml.plot_importances(rf_match_all_importances, n=50);

Code

(
    rf_match_all_importances.nlargest(25, "importance")
    .style.format(precision=4)
    .bar()
)

	features	importance
25	BW_log_ratio_ha	0.0096
19	BW_ratio_ha	0.0081
9	LB_home_wins	0.0071
18	B365_ratio_ha	0.0070
17	VC_away_wins	0.0069
29	LB_log_ratio_ha	0.0068
26	VC_log_ratio_ha	0.0067
22	WH_ratio_ha	0.0064
24	B365_log_ratio_ha	0.0062
20	VC_ratio_ha	0.0060
2	B365_away_wins	0.0057
15	VC_home_wins	0.0056
28	WH_log_ratio_ha	0.0056
8	IW_away_wins	0.0055
5	BW_away_wins	0.0050
12	WH_home_wins	0.0047
11	LB_away_wins	0.0047
23	LB_ratio_ha	0.0046
3	BW_home_wins	0.0045
0	B365_home_wins	0.0041
21	IW_ratio_ha	0.0040
355	player_age__std_home	0.0040
64	bmi__mean_away	0.0039
66	bmi__std_away	0.0038
354	player_age__std_away	0.0038

Code

# Do SFS or take results from cache
def fun_sfs_res_match_all_rf():
    np.random.seed(250)
    estimator = RandomForestClassifier()
    # fmt: off
    subset = [
        match_target, 'BW_log_ratio_ha', 'BW_ratio_ha', 'LB_home_wins', 
        'B365_ratio_ha', 'VC_away_wins', 'LB_log_ratio_ha', 'VC_log_ratio_ha',
        'WH_ratio_ha', 'B365_log_ratio_ha', 'VC_ratio_ha',
        'B365_away_wins', 'VC_home_wins', 'WH_log_ratio_ha',
        'IW_away_wins', 'BW_away_wins', 'WH_home_wins', 'LB_away_wins',
        'LB_ratio_ha', 'BW_home_wins', 'B365_home_wins', 'IW_ratio_ha',
        'player_age__std_home', 'bmi__mean_away', 'bmi__std_away',
        'player_age__std_away', 'player_age__min_away',
        'agility__std_home', 'free_kick_accuracy__std_home',
        'potential__std_away', 'overall_rating__std_away',
        'overall_rating__std_home', 'agility__std_away',
        'acceleration__std_home', 'player_age__min_home', 'bmi__mean_home',
        'weight_kg__std_home', 'IW_home_wins', 'dribbling__mean_away',
        'long_shots__std_away', 'reactions__std_away',
        'long_shots__std_home', 'player_age__mean_away', 'bmi__std_home',
        'potential__std_home', 'heading_accuracy__std_away',
        'player_age__mean_home', 'strength__std_away',
        'weight_kg__std_away', 'shot_power__std_home',
        'long_passing__std_away'
    ]
    # fmt: on
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_all_rf.pickle"
sfs_res_match_all_rf = my.cached_results(file, fun_sfs_res_match_all_rf)

Code

ml.sfs_plot_results(
    sfs_res_match_all_rf, "All Features as Predictors (Random Forests)"
);

k = 10, avg. BAcc = 0.452 [Best]
(Number of predictors at best score)

Details: Numeric values of BAcc

Code

(
    ml.sfs_list_features(sfs_res_match_all_rf)
    .head(20)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)

	added_feature	BAcc	BAcc_improvement	BAcc_percentage_change
step
1	BW_home_wins	0.445	nan	nan
2	B365_home_wins	0.439	-0.006	-1.245
3	long_shots__std_home	0.425	-0.014	-3.252
4	player_age__std_away	0.438	0.013	3.029
5	free_kick_accuracy__std_home	0.443	0.005	1.210
6	VC_home_wins	0.446	0.003	0.587
7	BW_ratio_ha	0.451	0.005	1.028
8	player_age__min_home	0.451	0.001	0.129
9	weight_kg__std_home	0.451	-0.001	-0.132
10	LB_away_wins	0.452	0.002	0.358

5.9.5 Final Models

This subsection summarizes the results from the subsections above and evaluates the performance on whole training and test sets.

**Table 5.15.** **Classification** model selection results: selected models for each feature type and algorithm.
Features type	Method	Number of features selected	Training CV BAcc	Selected as final model	Notes
Team-related	Logistic regression	k = 9	0.347	No
	Random forest	k = 3	0.350	No³
Player-related	Logistic regression	k = 9	0.451	Yes	Included¹: all Max. allowed²: 30 With k = 28, BAcc: 0.454.
	Random forest	k = 7	0.430	No	Included¹: 21 Max. allowed²: 10 With k = 10, BAcc: 0.431.
Betting odds	Logistic regression	k = 1	0.454	Yes
	Random forest	k = 1	0.445	No	Included¹: all Max. allowed²: 10
All Variables	Logistic regression	k = 4	0.458	No⁴	Included¹: all Max. allowed²: 10
	Random forest	k = 7	0.451	No	Included¹: 50 Max. allowed²: 10

¹ – Number of features included in SFS selection.
² – Maximum allowed number of features to be selected.
³ – This model was a candidate to become the final model but it was rejected due to low performance.
⁴ – This model was a candidate to become the final model but it shares the same variable as betting odds based model and with 3 additional variables the performance increased only slightly. So model was rejected due to possible overfitting in preference to less complex model with 1 variable.

Code

np.random.seed(250)

# -----------------------------------------------------------------------

subset_1 = [
    match_target,
    "dribbling__mean_away",
    "overall_rating__mean_home",
    "overall_rating__mean_away",
    "stamina__max_away",
    "gk_positioning__std_home",
    "long_shots__max_away",
    "weight_kg__std_home",
    "jumping__min_away",
    "strength__std_away",
]
X_train_1, y_train_1 = match_train[subset_1].make_dummies(exclude=match_target)
X_test_1, y_test_1 = match_test[subset_1].make_dummies(exclude=match_target)

model_match_player = LogisticRegression(
    solver="newton-cg", multi_class="multinomial"
)
model_match_player.fit(X_train_1, y_train_1)

y_pred_train_1 = model_match_player.predict(X_train_1)
y_pred_test_1 = model_match_player.predict(X_test_1)

# -----------------------------------------------------------------------

subset_2 = [match_target, "BW_away_wins"]
X_train_2, y_train_2 = match_train[subset_2].make_dummies(exclude=match_target)
X_test_2, y_test_2 = match_test[subset_2].make_dummies(exclude=match_target)

model_match_betting = LogisticRegression(
    solver="newton-cg", multi_class="multinomial"
)
model_match_betting.fit(X_train_2, y_train_2)

y_pred_train_2 = model_match_betting.predict(X_train_2)
y_pred_test_2 = model_match_betting.predict(X_test_2)

# -----------------------------------------------------------------------

pd.concat(
    [
        ml.get_classification_performance(
            y_train_1, y_pred_train_1, "Train (player-related variables)"
        ),
        ml.get_classification_performance(
            y_test_1, y_pred_test_1, "Test (player-related variables)"
        ),
        ml.get_classification_performance(
            y_train_2, y_pred_train_2, "Train (betting odds based prediction)"
        ),
        ml.get_classification_performance(
            y_test_2, y_pred_test_2, "Test (betting odds based prediction)"
        ),
    ]
).index_start_at(1)

**Table 5.16.** Final evaluation of selected models for match outcome prediction.
	set	n	Accuracy	BAcc	BAcc_01	f1_macro	f1_weighted	Kappa
1	Train (player-related variables)	12634	0.53	0.45	0.17	0.39	0.45	0.21
2	Test (player-related variables)	2575	0.50	0.42	0.14	0.37	0.42	0.17
3	Train (betting odds based prediction)	12634	0.53	0.45	0.18	0.39	0.45	0.22
4	Test (betting odds based prediction)	2575	0.52	0.45	0.18	0.39	0.45	0.21

Code

print("Classification Report\nTest set (player-related variables)\n")
print(ml.print_classification_report(y_test_1, y_pred_test_1, "test"))

Classification Report
Test set (player-related variables)

    set     n  Accuracy  BAcc  BAcc_01  f1_macro  f1_weighted  Kappa
0  test  2575      0.50  0.42     0.14      0.37         0.42   0.17

              precision    recall  f1-score   support

   Away Wins       0.49      0.45      0.47       790
        Draw       1.00      0.00      0.00       641
   Home Wins       0.51      0.83      0.63      1144

    accuracy                           0.50      2575
   macro avg       0.67      0.42      0.37      2575
weighted avg       0.63      0.50      0.42      2575


Confusion matrix (rows - true, columns - predicted):
[[352   0 438]
 [168   1 472]
 [199   0 945]]
None

Code

print("Classification Report\nTest set (betting odds based prediction)\n")
ml.print_classification_report(y_test_2, y_pred_test_2, "test")

Classification Report
Test set (betting odds based prediction)

    set     n  Accuracy  BAcc  BAcc_01  f1_macro  f1_weighted  Kappa
0  test  2575      0.52  0.45     0.18      0.39         0.45   0.21

              precision    recall  f1-score   support

   Away Wins       0.48      0.56      0.52       790
        Draw       0.00      0.00      0.00       641
   Home Wins       0.54      0.79      0.64      1144

    accuracy                           0.52      2575
   macro avg       0.34      0.45      0.39      2575
weighted avg       0.39      0.52      0.45      2575


Confusion matrix (rows - true, columns - predicted):
[[446   0 344]
 [231   0 410]
 [244   0 900]]

6 Summary

In this project, the European Football database, which includes data from seasons 2008/2009 to 2015/2016 was analyzed. Nine main questions in the “Analysis” section were answered. At the beginning of each main subsection, the most important findings were summarized. The game includes a lot of randomness but in some situations data-based approach can give additional valuable information about the European Football game.

6.1 Things to Improve

Some pre-precessing steps were performed but data from those steps were not included in the final analysis. These pre-processing steps could be removed from the analysis.
Some pre-processing steps should be explained in more detail in a written form.
I preferred .eval() over .assign() were possible and used .assign() elsewhere. Some users may find this as inconsistent coding style.
Some tables have names that are technical (n_goals) rather that natural for humans (e.g., “Number of goals”).
Variable names in the last part (e.g., y_train_1, y_pred_train_1) could have been more human-friendly.
Parameter tuning may improve RF performance.
Other types of machine learning algorithms (e.g., SVM, xgBoost) may capture the rends better and lead to better performance. This should be tested.
Some parts of this database (e.g., tables with player data) could be investigated in more detail to get even more insights.
Some plots (e.g., heat maps or cluster maps) are very large in order not to loose variable names. But these plots may not fin on the screen. So to fit then onto a screen, the user should make browser window narrower but as tall as it was before. On the other hand, some HTML output (profiling report) can be effectively studied only on wide screens.

Abbreviations

1 Introduction

1.1 Setup

2 Methods

2.1 Statistical Inference

2.2 Predictive Modelling

3 Initial Exploration

3.1 Database

3.2 Tables Country and League

3.3 Table Match

3.4 Table Player

3.5 Table Player_Attributes

3.6 Table Team

3.7 Table Team_Attributes

3.8 Delete Tables

4 Data Import & Pre-Processing

4.1 Import

4.2 Pre-Process in Python

5 Analysis

5.1 Included Countries and Leagues

5.2 Comparing Leagues and Seasons

5.2.1 Both (Leagues and Seasons)

5.2.2 Leagues

5.2.3 Seasons

5.3 Top Teams

5.4 Players in 2015/2016

5.5 Analysis of Players

5.5.1 Dashboard

5.6 Home Advantage: Is It Real?

5.7 Relationship Between Betting Odds

5.8 Team Score Prediction

5.8.1 Team-Related Features as Predictors

Linear Regression

Random Forest

5.8.2 Player-Related Features as Predictors

Linear Regression

Random Forests

5.8.3 Betting Odds as Predictors

Linear Regression

Random Forests

5.8.4 All Variables as Predictors

Linear Regression

Random Forests

5.8.5 PCA Features of All Variables as Predictors

Linear Regression

Random Forests

5.8.6 Final models

5.9 Match Outcome Prediction

5.9.1 Team-Related Features as Predictors

Logistic Regression

Random Forests

5.9.2 Player-Related Features as Predictors

Logistic Regression

Random Forests

5.9.3 Betting-Odds as Predictors

Logistic Regression

Random Forests

5.9.4 All Variables as Predictors

Logistic Regression

Random Forests

5.9.5 Final Models

6 Summary

6.1 Things to Improve

3.2 Tables `Country` and `League`

3.3 Table `Match`

3.4 Table `Player`

3.5 Table `Player_Attributes`

3.6 Table `Team`

3.7 Table `Team_Attributes`