The Insights on European Football

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-03-23

Updated

2023-07-30

The Insights on European Football project logo. Generated with Leonardo.Ai.

Data analysis tools: Python, SQL, Looker Studio
Helper tools: VS Code, Quarto, Git
Skills:

Abbreviations

  • Acc – accuracy.
  • BAcc – balanced accuracy.
  • BAcc_01 – balanced accuracy where 0 is the worst and 1 is the best result.
  • CI – 95% confidence interval.
  • CLD – compact letter display.
  • CV – cross-validation.
  • EDA – exploratory data analysis.
  • FIFA – International Federation of Association Football.
  • k – number of variables/features.
  • ML – machine learning.
  • n – either sample or group size.
  • NA, NAs – missing value(s).
  • p – p-value.
  • p_adj – p-value (adjusted).
  • PC, PCs – principal component(s).
  • PCA – principal component analysis.
  • r – Pearson’s correlation coefficient.
  • R² – coefficient of determination, r squared.
  • RMSE – root mean squared error.
  • RNG – (pseudo)random number generator.
  • SD – standard deviation.
  • SE – standard error.
  • SFS – sequential feature selection.
  • UK – United Kingdom.

1 Introduction

European Football (also known as Soccer) is one of the most popular games in Europe. Football is a big market with revenues of €27.6 billion in 2020/21 (source). The money is earned by, e.g., selling tickets to matches and rights to broadcast games, participating in betting, and advertising.

In this project, European Football data from seasons 2008/2009 to 2015/2016 was analyzed to get a better data-based understanding of this game. In each subsection of the “Analysis” section of this project, nine main questions are analyzed and insights are provided. At the beginning of each main subsection, the most important findings are presented and further parts of that subsection provide the details (plots, tables, etc.) on those findings.

Tip

Pay attention that some codes, analyses, results, or other details are hidden in collapsible sections (that are collapsed by default). These are:

  • either parts that have a lot of results that can clutter the report,
  • or less important or supplementary parts, e.g., to prove some claims in the text.

1.1 Setup

Code: The main Python setup
# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

# Plotting
%matplotlib inline

# Packages and modules -------------------------------
import os
import re
import warnings

# Working with SQL database
import sqlite3

# EDA
import ydata_profiling as eda
from skimpy import skim
import missingno as msno

# Data wrangling, maths
import numpy as np
import pandas as pd
import janitor  # imports additional Pandas methods

# Statistical analysis
import scipy.stats as sps

# Machine learning
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

# Maps
import geopandas as gpd
from shapely.geometry import Polygon

# Enable ability to run R in Python
os.environ["R_HOME"] = "C:/PROGRA~1/R/R-4.2.3"

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    import rpy2

    %load_ext rpy2.ipython

# Custom functions
import functions.fun_utils as my
import functions.pandas_methods
import functions.fun_analysis as an
import functions.fun_ml as ml

%aimport functions.fun_utils
%aimport functions.pandas_methods
%aimport functions.fun_analysis
%aimport functions.fun_ml

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_colwidth", 50)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# colors
green, blue, orange, red = "tab:green", "tab:blue", "tab:orange", "tab:red"

# Analysis parameters
do_eda = True
Code
"""Various functions for data pre-processing, analysis and plotting."""

# OS module
import os

# Enable ability to run R code in Python
os.environ["R_HOME"] = "C:/PROGRA~1/R/R-4.2.3"
import rpy2.robjects as r_obj
from rpy2.robjects.conversion import localconverter

# Other Python libraries and modules
import pathlib
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms
import scipy.stats as sps
from scipy.stats import median_abs_deviation
from typing import Union
from IPython.display import display, HTML
from matplotlib.ticker import MaxNLocator

# Utilities ==================================================================

# For Pandas objects
def index_has_names(obj):
    """Check if index of an object has names.

    Args:
        obj: Object that has `.index` attribute

    Returns:
        bool: True if index has names, False otherwise.
    """
    return None not in list(obj.index.names)


# Display in Jupyter notebook
def display_collapsible(x, summary: str = "", sep=" ", is_open: bool = False):
    """Display data frame or other object surrounded by `<details>` tags

    (I.e., display in collapsible way)

    Args:
        x (pd.DataDrame, str, list): Object to display
        summary (str, optional): Collapsed section name. Defaults to "".
        sep (str, optional): Symbol used to join strings (when x is a list).
             Defaults to " ".
        is_open (bool, optional): Should the section be open by default
            Defaults to False.
    """
    if is_open:
        is_open = " open"
    else:
        is_open = ""

    if hasattr(x, "to_html") and callable(x.to_html):
        html_str = x.to_html()
    elif type(x) == str:
        html_str = x
    else:
        html_str = sep.join([str(i) for i in x])

    display(
        HTML(
            f"<details{is_open}><summary>{summary}</summary>"
            + html_str
            + "</details>"
        )
    )


def cached_results(file, fun, **kwargs):
    """If file does not exist, take results from file, otherwise
       calculate them, save to file and return the calculated result.

    Args:
        file (str): File name.
        fun (function): function.
        **kwargs: arguments passed to `fun`.

    Returns:
        The result of `fun()`
    """
    if pathlib.Path(file).is_file():
        with open(file, "rb") as f:
            results = pickle.load(f)

    else:
        results = fun(**kwargs)
        with open(file, "wb") as f:
            pickle.dump(results, f)

    return results


# Helper functions to work with R in Python -----------------------------------
def r_to_python(obj: str):
    """Import object from R environment to Python

    Import object from R environment created in ipynb cells via `rpy2` package.

    Args:
        obj (str): Object name in R global environment.

    Returns:
        Analogous Python object (NOTE: tested with data frames only).
    """
    return r_obj.pandas2ri.rpy2py(r_obj.globalenv[obj])


# Format values ------------------------------------------------------------
def format_p(p):
    """Format p values at 3 decimal places.

    Args:
        p (float): p value (number between 0 and 1).
    """
    if p < 0.001:
        return "p < 0.001"
    elif p > 0.999:
        return "p > 0.999"
    else:
        return f"p = {p:.3f}"


def format_percent(x: float):
    """Round percentages to 1 decimal place and format as strings

    Values between 0 and 0.05 are printed as <0.1%
    Values between 99.95 and 100 are printed as >100%

    Args:
        x (float): A sequence of percentage values ranging from 0 to 100.

    Returns:
        pd.Series[str]: Pandas series of formatted values.
        Values equal to 0 are formatted as "0%", values between
        0 and 0.05 are formatted as "<0.1%", values between 99.95 and 100
        are formatted as ">99.9%", and values equal to 100 are formatted
        as "100%".

    Author: Vilmantas Gėgžna
    """
    return pd.Series(
        [
            "0%"
            if i == 0
            else "<0.1%"
            if i < 0.05
            else ">99.9%"
            if 99.95 <= i < 100
            else f"{i:.1f}%"
            for i in x
        ],
        index=x.index,
    )


# Analysis =================================================================

# Exploratory analysis
def count_unique(data: pd.DataFrame):
    """Get number and percentage of unique values

    Args:
        data (pd.DataFrame): Data frame to analyze.

    Return: data frame with columns `n_unique` (int) and `percent_unique` (str)
    """
    n_unique = data.nunique()
    return pd.concat(
        [
            n_unique.rename("n_unique"),
            format_percent((n_unique / data.shape[0]).multiply(100)).rename(
                "percent_unique"
            ),
        ],
        axis=1,
    )


# Descriptive statistics ----------------------------------------------------
def calc_summaries(x, ndigits=None):
    """Calculate some common summary statistics.

    Args:
        x (pandas.Series): Numeric variable to summarize.
        ndigits (int, None, optional): Number of decimal digits to round to.
                Defaults to None.
    Return:
       pandas.DataFrame with summary statistics.
    """

    def mad(x):
        return median_abs_deviation(x)

    def range(x):
        return x.max() - x.min()

    res = x.agg(
        ["count", "min", "max", range, "mean", "median", "std", mad, "skew"]
    )

    if ndigits is not None:
        summary = pd.DataFrame(round(res, ndigits=ndigits)).T
    else:
        summary = pd.DataFrame(res).T
    # Present count data as integer:
    summary = summary.assign(count=lambda d: d["count"].astype(int))

    return summary


# Plot counts ---------------------------------------------------------------
def plot_counts_with_labels(
    counts,
    title="",
    x=None,
    y="n",
    x_lab=None,
    y_lab="Count",
    label="percent",
    label_rotation=0,
    title_fontsize=13,
    legend=False,
    ec="black",
    y_lim_max=None,
    ax=None,
    **kwargs,
):
    """Plot count data as bar plots with labels.

    Args:
        counts (pandas.DataFrame): Data frame with counts data.
        title (str, optional): Figure title. Defaults to "".
        x (str, optional): Column name from `counts` to plot on x axis.
                Defaults to None: first column.
        y (str, optional): Column name from `counts` to plot on y axis.
                Defaults to "n".
        x_lab (str, optional): X axis label.
              Defaults to value of `x` with capitalized first letter.
        y_lab (str, optional): Y axis label. Defaults to "Count".
        label (str, None, optional): Column name from `counts` for value labels.
                Defaults to "percent".
                If None, label is not added.
        label_rotation (int, optional): Angle of label rotation. Defaults to 0.
        legend (bool, optional): Should legend be shown?. Defaults to False.
        ec (str, optional): Edge color. Defaults to "black".
        y_lim_max (float, optional): Upper limit for Y axis.
                Defaults to None: do not change.
        ax (matplotlib.axes.Axes, optional): Axes object. Defaults to None.
        **kwargs: further arguments to pandas.DataFrame.plot.bar()

    Returns:
        matplotlib.axes.Axes: Axes object of the generate plot.

    Author: Vilmantas Gėgžna
    """
    if x is None:
        x = counts.columns[0]

    if x_lab is None:
        x_lab = x.capitalize()

    if y_lim_max is None:
        y_lim_max = counts[y].max() * 1.15

    ax = counts.plot.bar(x=x, y=y, legend=legend, ax=ax, ec=ec, **kwargs)
    ax.set_title(title, fontsize=title_fontsize)
    ax.set_xlabel(x_lab)
    ax.set_ylabel(y_lab)
    if label is not None:
        ax_add_value_labels_ab(
            ax, labels=counts[label], rotation=label_rotation
        )
    ax.set_ylim(0, y_lim_max)

    return ax


def ax_xaxis_integer_ticks(min_n_ticks: int, rot: int = 0):
    """Ensure that x axis ticks has integer values

    Args:
        min_n_ticks (int): Minimal number of ticks to use.
        rot (int, optional): Rotation angle of x axis tick labels.
        Defaults to 0.
    """
    ax = plt.gca()
    ax.xaxis.set_major_locator(
        MaxNLocator(min_n_ticks=min_n_ticks, integer=True)
    )
    plt.xticks(rotation=rot)


def ax_axis_comma_format(axis: str = "xy", ax=None):
    """Write values of X axis ticks with comma as thousands separator

    Args:
        axis (str, optional): which axis should be formatted:
           "x" X axis, "y" Y axis or "xy" (default) both axes.
        ax (axis object, None, optional):Axis of plot.
            Defaults to None: current axis.
    """

    if ax is None:
        ax = plt.gca()

    fmt = "{x:,.0f}"
    formatter = plt.matplotlib.ticker.StrMethodFormatter(fmt)
    if "x" in axis:
        ax.xaxis.set_major_formatter(formatter)

    if "y" in axis:
        ax.yaxis.set_major_formatter(formatter)


def ax_add_value_labels_ab(
    ax, labels=None, spacing=2, size=9, weight="bold", **kwargs
):
    """Add value labels above/below each bar in a bar chart.

    Arguments:
        ax (matplotlib.Axes): Plot (axes) to annotate.
        label (str or similar): Values to be used as labels.
        spacing (int): Number of points between bar and label.
        size (int): font size.
        weight (str): font weight.
        **kwargs: further arguments to axis.annotate.

    Source:
        This function is based on https://stackoverflow.com/a/48372659/4783029
    """

    # For each bar: Place a label
    for rect, label in zip(ax.patches, labels):
        # Get X and Y placement of label from rect.
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2

        space = spacing

        # Vertical alignment for positive values
        va = "bottom"

        # If the value of a bar is negative: Place label below the bar
        if y_value < 0:
            # Invert space to place label below
            space *= -1
            # Vertical alignment
            va = "top"

        # Use Y value as label and format number with one decimal place
        if labels is None:
            label = "{:.1f}".format(y_value)

        # Create annotation
        ax.annotate(
            label,
            (x_value, y_value),
            xytext=(0, space),
            textcoords="offset points",
            ha="center",
            va=va,
            fontsize=size,
            fontweight=weight,
            **kwargs,
        )


# Inferential statistics -----------------------------------------------------
def ci_proportion_multinomial(
    counts,
    method: str = "goodman",
    n_label: str = "n",
    percent_label: str = "percent",
) -> pd.DataFrame:
    """Calculate  simultaneous confidence intervals for multinomial proportion.

    More information in documentation of statsmodels'
    multinomial_proportions_confint.

    Args:
        x (int): ps.Series, list or tuple with count data.
        method (str, optional): Method. Defaults to "goodman".
       n_label (str, optional): Name for column for counts.
       percent_label (str, optional): Name for column for percentage values.

    Returns:
        pd.DataFrame: _description_

    Examples:
    >>> ci_proportion_multinomial([62, 33, 55])
    """
    assert type(counts) in [pd.Series, list, tuple]
    if type(counts) is not pd.Series:
        counts = pd.Series(counts)

    return pd.concat(
        [
            (counts).rename(n_label),
            (counts / sum(counts)).rename(percent_label) * 100,
            pd.DataFrame(
                sms.multinomial_proportions_confint(counts, method=method),
                index=counts.index,
                columns=["ci_lower", "ci_upper"],
            )
            * 100,
        ],
        axis=1,
    )


def test_chi_square_gof(
    f_obs: list[int], f_exp: Union[str, list[float]] = "all equal"
) -> str:
    """Chi squared (χ²) goodness-of-fit (gof) test

    Args:
        f_obs (list[int]): Observed frequencies
        f_exp str, list[int]: List of expected frequencies or "all equal" if
              all frequencies are equal to the mean of observed frequencies.
              Defaults to "all equal".

    Returns:
        str: formatted test results including p value.
    """
    k = len(f_obs)
    n = sum(f_obs)
    exp = n / k
    dof = k - 1
    if f_exp == "all equal":
        f_exp = [exp for _ in range(k)]
    stat, p = sps.chisquare(f_obs=f_obs, f_exp=f_exp)
    # May also be formatted this way:
    return (
        f"Chi square test, χ²({dof}, n = {n}) = {round(stat, 2)}, {format_p(p)}"
    )


def pairwise_chisq_gof_test(x: pd.Series):
    """Post-hoc Pairwise chi-squared Test

    Interface to R function `rstatix::pairwise_chisq_gof_test()`.

    Args:
        x (pandas.Series): data with group counts

    Returns:
        pandas.DataFrame: DataFrame with CLD results.
    """
    # Loading R package
    rstatix = r_obj.packages.importr("rstatix")
    dplyr = r_obj.packages.importr("dplyr")

    # Converting Pandas obj to R obj
    with localconverter(r_obj.default_converter + r_obj.pandas2ri.converter):
        x_in_r = r_obj.conversion.py2rpy(x)

    # Invoking the R function and getting the result
    df_result_r = rstatix.pairwise_chisq_gof_test(x_in_r)
    df_result_r = dplyr.relocate(df_result_r, "group1", "group2")

    # Converting the result to a Pandas dataframe
    return r_obj.pandas2ri.rpy2py(df_result_r)


def convert_pairwise_p_to_cld(
    data,
    group1: str = "group1",
    group2: str = "group2",
    p_name: str = "p.adj",
    output_gr_var: str = "group",
):
    """Convert p values from pairwise comparisons to CLD

    CLD - compact letter display: shared letter shows that difference
    is not significant. Interface to R function `convert_pairwise_p_to_cld()`.

    Args:
        data (pandas.DataFrame): Data frame with at least 3 columns:
              the first 2 columns contain names of both groups, one more
              column should contain p values.
        group1 (str, optional): Name of the  first column with group names.
               Defaults to "group1".
        group2 (str, optional): Name of the  first column with group names.
               Defaults to "group2".
        p_name (str, optional): Name of column with p values.
               Defaults to "p.adj".
        output_gr_var (str, optional): Name of column in output dataset
               with group names. Defaults to "group".

    Returns:
        pandas.DataFrame: DataFrame with CLD results.
    """
    # Loading R function from file
    r_obj.r["source"]("functions/functions.R")
    convert_pairwise_p_to_cld = r_obj.globalenv["convert_pairwise_p_to_cld"]

    # Converting Pandas data frame to R data frame
    with localconverter(r_obj.default_converter + r_obj.pandas2ri.converter):
        df_in_r = r_obj.conversion.py2rpy(data)

    # Invoking the R function and getting the result
    df_result_r = convert_pairwise_p_to_cld(
        df_in_r,
        group1=group1,
        group2=group2,
        p_name=p_name,
        output_gr_var=output_gr_var,
    )

    # Converting the result back to a Pandas dataframe
    return r_obj.pandas2ri.rpy2py(df_result_r)
Code
"""Classes to perform statistical analysis and output the results."""

import pandas as pd
import numpy as np
import pingouin as pg
import statsmodels.stats.api as sms
import scikit_posthocs as sp
import matplotlib.pyplot as plt

import functions.fun_utils as my  # Custom module
import functions.pandas_methods  # Custom module; imports method .to_df()


# Analyze count data ---------------------------------------------------------
class AnalyzeCounts:
    """The class to analyze count data.

    - Performs omnibus chi-squared and post-hoc pair-wise chi-squared test.
    - Compactly presents results of post-hoc test as compact letter display, CLD
      NOTE: for CLD calculations, R is required.
      (Shared CLD letter show no significant difference between groups).
    - Calculates percentages and their confidence intervals by using Goodman's
    method.
    - Creates summary of grouped values (group counts and percentages).
    - Plots results as bar plots with percentage labels.
    """

    def __init__(self, counts, by=None, counts_of=None):
        """
        Object initialization function.

        Args:
            counts (pandas.Series[int]): Count data to analyze.
            by (str, optional): Grouping variable name. Used to create labels.
                      If None, defaults to "Group"
            counts_of (str, optional): The thing that was counted.
                    This name is used for labels in plots and tables.
                    Defaults to `counts.name`.
        """
        assert isinstance(counts, pd.Series)

        # Set defaults
        if by is None:
            by = "Group"

        if counts_of is None:
            counts_of = counts.name

        # Set attributes: user inputs or defaults
        self.counts = counts
        self.counts_of = counts_of
        self.by = by

        # Set attributes: created/calculated
        self.n_label = f"n_{counts_of}"  # Create label for counts

        # Set attributes: results to be calculated
        self.results_are_calculated = False
        self.omnibus = None
        self.n_ci_and_cld = None
        self.descriptive_stats = None

    def fit(self):
        """Perform count data analysis: calculate the results."""

        # Alias attributes
        counts = self.counts
        by = self.by
        n_label = self.n_label

        # Omnibus test: perform and save the results
        self.omnibus = my.test_chi_square_gof(counts)

        # Post-hoc (pairwise chi-square): perform
        posthoc_p = my.pairwise_chisq_gof_test(counts)
        posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p, output_gr_var=by)

        # Confidence interval: calculate
        ci = (
            my.ci_proportion_multinomial(
                counts, method="goodman", n_label=n_label
            )
            .rename_axis(by)
            .reset_index()
        )

        # Make sure datasets are mergeable
        ci[by] = ci[by].astype(str)
        posthoc_cld[by] = posthoc_cld[by].astype(str)

        # Merge results
        n_ci_and_cld = pd.merge(ci, posthoc_cld, on=by)

        # Format percentages and counts
        vars = ["percent", "ci_lower", "ci_upper"]
        n_ci_and_cld[vars] = n_ci_and_cld[vars].apply(my.format_percent)

        # Save results
        self.n_ci_and_cld = n_ci_and_cld

        # Descriptive statistics: calculate
        to_format = ["min", "max", "range", "mean", "median", "std", "mad"]

        def format_0f(x):
            return [f"{i:,.0f}" for i in x]

        summary_count = my.calc_summaries(ci[n_label])
        summary_count[to_format] = summary_count[to_format].apply(format_0f)

        summary_perc = my.calc_summaries(ci["percent"])
        summary_perc[to_format] = summary_perc[to_format].apply(
            my.format_percent
        )
        # Save results
        self.descriptive_stats = pd.concat([summary_count, summary_perc])

        # Initialization status
        self.results_are_calculated = True

        # Output
        return self

    def print(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Print numeric results.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.fit()` method were not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            print("Omnibus (chi-squared) test results:")
            print(self.omnibus, "\n")

        # Post-hoc and CI
        if posthoc:
            print(
                f"Counts of {self.counts_of} with 95% CI "
                "and post-hoc (pairwise chi-squared) test results:"
            )
            print(self.n_ci_and_cld, "\n")

        # Descriptive statistics: display
        if descriptives:
            print(f"Descriptive statistics of group ({self.by}) counts:")
            print(self.descriptive_stats, "\n")

    def display(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Display numeric results in Jupyter Notebooks.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.analyze()` method were
            not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            my.display_collapsible(
                self.omnibus, "Omnibus (chi-squared) test results"
            )

        # Post-hoc and CI
        if posthoc:
            my.display_collapsible(
                self.n_ci_and_cld.style.format({self.n_label: "{:,.0f}"}),
                f"Counts of {self.counts_of} with 95% CI and post-hoc "
                " (pairwise chi-squared) test results",
            )

        # Descriptive statistics: display
        if descriptives:
            my.display_collapsible(
                self.descriptive_stats,
                f"Descriptive statistics of group ({self.by}) counts",
            )

    def plot(self, xlabel=None, ylabel=None, **kwargs):
        """Plot analysis results.

        Args:
            xlabel (str, None, optional): X axis label.
                    Defaults to None: autogenerated label.
            ylabel (str, None, optional): Y axis label.
                    Defaults to None: autogenerated label.
            **kwargs: further arguments passed to `my.plot_counts_with_labels()`

        Raises:
            Exception: if calculations with `.fit()` method were
            not performed.

        Returns:
            matplotlib.axes object
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Plot
        if xlabel is None:
            xlabel = self.by.capitalize()

        if ylabel is None:
            ylabel = f"Number of {self.counts_of}"

        ax = my.plot_counts_with_labels(
            self.n_ci_and_cld,
            x_lab=xlabel,
            y_lab=ylabel,
            y=self.n_label,
            **kwargs,
        )

        my.ax_axis_comma_format("y")

        return ax


# Analyze numeric groups ------------------------------------------------------
class AnalyzeNumericGroups:
    """Class to analyze numeric/continuous data by groups.

    - Calculates mean ratings per group and their confidence intervals using
        t distribution.
    - Performs omnibus (Kruskal-Wallis) and post-hoc (Conover-Iman) tests.
    - Compactly presents results of post-hoc test as compact letter display, CLD
      NOTE: for CLD calculations, R is required.
      (Shared CLD letter show no significant difference between groups).
    - Creates summary of grouped values (group counts and percentages).
    - Plots results as points with 95% confidence interval error bars.
    """

    def __init__(self, data, y: str, by: str):
        """Initialize the class.

        Args:
            y (str): Name of numeric/continuous (dependent) variable.
            by (str): Name of grouping (independent) variable.
            data (pandas.DataFrame): data frame with variables indicated in
                `y` and `by`.
        """
        assert isinstance(data, pd.DataFrame)

        # Set attributes: user inputs
        self.data = data
        self.y = y
        self.by = by

        # Set attributes: results to be calculated
        self.results_are_calculated = False
        self.omnibus = None
        self.ci_and_cld = None
        self.descriptive_stats = None

    def fit(self):
        # Aliases:
        data = self.data
        y = self.y
        by = self.by

        # Omnibus test: Kruskal-Wallis test
        omnibus = pg.kruskal(data=data, dv=y, between=by)
        omnibus["p-unc"] = my.format_p(omnibus["p-unc"][0])

        self.omnibus = omnibus

        # Confidence intervals
        ci_raw = data.groupby(by)[y].apply(
            lambda x: [np.mean(x), *sms.DescrStatsW(x).tconfint_mean()]
        )
        ci = pd.DataFrame(
            list(ci_raw),
            index=ci_raw.index,
            columns=["mean", "ci_lower", "ci_upper"],
        ).reset_index()

        # Post-hoc test: Conover-Iman test
        posthoc_p_matrix = sp.posthoc_conover(
            data, val_col=y, group_col=by, p_adjust="holm"
        )
        posthoc_p_df = posthoc_p_matrix.stack().to_df(
            "p.adj", ["group1", "group2"]
        )
        posthoc_cld = my.convert_pairwise_p_to_cld(
            posthoc_p_df, output_gr_var=by
        )

        # Make sure datasets are mergeable
        ci[by] = ci[by].astype(str)
        posthoc_cld[by] = posthoc_cld[by].astype(str)

        self.ci_and_cld = pd.merge(posthoc_cld, ci, on=by)

        # Descriptive statistics of means
        self.descriptive_stats = my.calc_summaries(ci["mean"])

        # Results are present
        self.results_are_calculated = True

        # Output:
        return self

    def print(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Print numeric results.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.fit()` method were
            not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            print("Omnibus (Kruskal-Wallis) test results:")
            print(self.omnibus, "\n")

        # Post-hoc and CI
        if posthoc:
            print(
                "Post-hoc (Conover-Iman) test results as CLD and "
                "Confidence intervals (CI):",
            )
            print(self.ci_and_cld, "\n")

        # Descriptive statistics
        if descriptives:
            print(f"Descriptive statistics of group ({self.by}) means:")
            print(self.descriptive_stats, "\n")

    def display(
        self,
        omnibus: bool = True,
        posthoc: bool = True,
        descriptives: bool = True,
    ):
        """Display numeric results in Jupyter Notebooks.

        Args:
            omnibus (bool, optional): Flag to print omnibus test results.
                                      Defaults to True.
            posthoc (bool, optional): Flag to print post-hoc test results.
                                      Defaults to True.
            descriptives (bool, optional): Flag to print descriptive statistics.
                                      Defaults to True.

        Raises:
            Exception: if calculations with `.fit()` method were
            not performed.
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Omnibus test
        if omnibus:
            my.display_collapsible(
                self.omnibus, "Omnibus (Kruskal-Wallis) test results"
            )

        # Post-hoc and CI
        if posthoc:
            my.display_collapsible(
                self.ci_and_cld,
                "Post-hoc (Conover-Iman) test results as CLD and "
                "Confidence intervals (CI)",
            )

        # Descriptive statistics of means
        if descriptives:
            my.display_collapsible(
                self.descriptive_stats,
                f"Descriptive statistics of group ({self.by}) means",
            )

    def plot(self, title=None, xlabel=None, ylabel=None, **kwargs):
        """Plot the results

        Args:

            xlabel (str, None, optional): X axis label.
                    Defaults to None: capitalized value of `by`.
            ylabel (str, None, optional): Y axis label.
                    Defaults to None: capitalized value of `y`.
            title (str, None, optional): The title of the plot.
                    Defaults to None.

        Returns:
            Tuple with matplotlib figure and axis objects (fig, ax).
        """
        if not self.results_are_calculated:
            raise Exception("No results. Run `.fit()` first.")

        # Aliases:
        ci = self.ci_and_cld
        by = self.by
        y = self.y

        # Create figure and axes
        fig, ax = plt.subplots()

        # Construct plot
        x = ci.iloc[:, 0]

        ax.errorbar(
            x=x,
            y=ci["mean"],
            yerr=[ci["mean"] - ci["ci_lower"], ci["ci_upper"] - ci["mean"]],
            mfc="red",
            ms=2,
            mew=1,
            fmt="ko",
            zorder=3,
        )

        if xlabel is None:
            xlabel = by.capitalize()

        if ylabel is None:
            ylabel = y.capitalize()

        ax.set_xlabel(xlabel)
        ax.set_ylabel(ylabel)
        ax.set_ylim([0, None])
        ax.set_title(title)

        # Output
        return (fig, ax)
Code
from typing import Union

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

# Machine learning
from sklearn.metrics import mean_squared_error as mse, r2_score
from sklearn.metrics import f1_score, accuracy_score, balanced_accuracy_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import classification_report, confusion_matrix
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


# Helpers
def as_formula(
    target: str = None,
    include: Union[list[str], pd.DataFrame] = None,
    exclude: list[str] = None,
    add: str = "",
):
    """
    Generates the R style formula for statsmodels (patsy) given
    the dataframe, dependent variable and optional excluded columns
    as strings.

    Args:
        target (str): name of target variable.
        include (pandas.DataFrame or list[str]):
            dataframe of column names to include.
        exclude (list[str], optional):
            columns to exclude.
        add (str): string to add to formula, e.g., "+ 0"

    Return:
        String with R style formula for `patsy` (e.g., "target ~ x1 + x2").

    See also: https://stackoverflow.com/a/44866142/4783029
    """
    if isinstance(include, pd.DataFrame):
        include = list(include.columns.values)

    if target in include:
        include.remove(target)

    if exclude is not None:
        for col in exclude:
            include.remove(col)

    return target + " ~ " + " + ".join(include) + add


def get_columns_by_purpose(data, target: str):
    """Split data frame to 3 data frames: for target, numeric, and remaining
    variables.

    Examples:
    >>> # Split
    >>> d_target, d_num, d_other = get_columns_by_purpose(data, "class")

    >>> # Merge back
    >>> pd.concat([d_target, d_num, d_other], axis=1)
    """
    d_num = data.drop(columns=target).select_dtypes("number")
    d_other = data.drop(columns=[target, *d_num.columns.values])

    return data[target].to_frame(), d_num, d_other


# Functions for feature selection
def sfs(estimator, est_type, k_features="parsimonious", forward=True):
    """Create SFS instance for classification

    Args.:
        est_type (str): classification or regression
        other arguments: see mlextend.SequentialFeatureSelector()
    """

    if est_type == "regression":
        scoring = "neg_root_mean_squared_error"
    elif est_type == "classification":
        scoring = "balanced_accuracy"
    else:
        raise Exception(f"Unrecognized learner/estimator type: {type}")

    return SequentialFeatureSelector(
        estimator,
        k_features=k_features,  # "parsimonious",
        forward=forward,
        floating=False,
        scoring=scoring,
        verbose=1,
        cv=5,
        n_jobs=-1,
    )


def sfs_get_score(sfs_object, k_features):
    """Return performance score achieved with certain number of features.

    Args.:
        sfs_object: result of function do_sfs_lin_reg()
        k_features (int): number of features.
    """
    md = round(
        np.median(sfs_object.get_metric_dict()[k_features]["cv_scores"]), 3
    )
    return {
        "k_features": k_features,
        "mean_score": round(
            sfs_object.get_metric_dict()[k_features]["avg_score"], 3
        ),
        "median_score": md,
        "sd_score": round(
            sfs_object.get_metric_dict()[k_features]["std_dev"], 3
        ),
    }


def sfs_plot_results(sfs_object, sub_title="", ref_y=None):
    """Plot results from SFS object

    Args.:
      sfs_object: object with SFS results.
      sub_title (str): second line of title.
      ref_y (float): Y coordinate of reference line.
    """

    scoring = sfs_object.get_params()["scoring"]

    if scoring == "neg_root_mean_squared_error":
        metric = "RMSE"
        sign = -1
    elif scoring == "balanced_accuracy":
        metric = "BAcc"
        sign = 1
    else:
        raise Exception(f"Unsupported scoring metric: {scoring}")

    if sfs_object.forward:
        sfs_type = "Forward"
    else:
        sfs_type = "Backward"

    fig, ax = plt.subplots(1, 2, sharey=True)

    xlab = "Number of predictors included"

    if ref_y is not None:
        ax[0].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5)
        ax[1].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5)

    avg_score = [
        (int(i), sign * c["avg_score"]) for i, c in sfs_object.subsets_.items()
    ]

    averages = pd.DataFrame(avg_score, columns=["k_features", "avg_score"])

    (
        averages.plot.scatter(
            x="k_features",
            y="avg_score",
            xlabel=xlab,
            ylabel=metric,
            title=f"Average {metric}",
            ax=ax[0],
        )
    )

    cv_scores = {
        int(i): sign * c["cv_scores"] for i, c in sfs_object.subsets_.items()
    }
    (
        pd.DataFrame(cv_scores).plot.box(
            xlabel=xlab,
            title=f"{metric} in CV splits",
            ax=ax[1],
        )
    )

    ax[0].xaxis.set_major_locator(mticker.MaxNLocator(integer=True))
    ax[1].xaxis.set_major_locator(mticker.MaxNLocator(integer=True))

    if not sfs_object.forward:
        ax[1].invert_xaxis()

    main_title = (
        f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV "
        + f"\n{sub_title}"
    )

    fig.suptitle(main_title)
    plt.tight_layout()
    plt.show()

    # Print results
    if not sfs_object.interrupted_:
        if sfs_object.is_parsimonious:
            note = "[Parsimonious]"
            k_selected = f"k = {len(sfs_object.k_feature_names_)}"
            score_at_k = f"avg. {metric} = {sign * sfs_object.k_score_:.3f}"
            note_2 = "Smallest number of predictors at best ± 1 SE score"
        else:
            note = "[Best]"
            if sign < 0:
                best = averages.nsmallest(1, "avg_score")
            else:
                best = averages.nlargest(1, "avg_score")
            k_selected = f"k = {int(best.k_features.values)}"
            score_at_k = f"avg. {metric} = {float(best.avg_score.values):.3f}"
            note_2 = "Number of predictors at best score"

        print(f"{k_selected}, {score_at_k} {note}\n({note_2})")


def sfs_list_features(sfs_result):
    """List features by order when they were added.
    Current implementation correctly works with forward selection only.

    Args:
        sfs_result (SFS object)
    """

    def rename_metric(x):
        return x.replace("score", metric)

    scoring = sfs_result.get_params()["scoring"]

    if scoring == "neg_root_mean_squared_error":
        metric = "RMSE"
        sign = -1
    elif scoring == "balanced_accuracy":
        metric = "BAcc"
        sign = 1
    else:
        raise Exception(f"Unsupported scoring metric: {scoring}")

    feature_dict = sfs_result.get_metric_dict()
    lst = [[*feature_dict[i]["feature_names"]] for i in feature_dict]
    feature = []
    for x, y in zip(lst[0::], lst[1::]):
        feature.append(*set(y).difference(x))

    return (
        pd.DataFrame(
            {
                "added_feature": [*lst[0], *feature],
                "score": [
                    sign * feature_dict[i]["avg_score"] for i in feature_dict
                ],
            }
        )
        .assign(score_improvement=lambda x: sign * x.score.diff())
        .assign(
            score_percentage_change=lambda x: sign * x.score.pct_change() * 100
        )
        .index_start_at(1)
        .rename_axis("step")
        .rename(columns=rename_metric)
    )


# Functions for regression/classification
def get_regression_performance(y_true, y_pred, name=""):
    """Evaluate regression model performance

    Calculate R², RMSE, and SD of predicted variable

    Args.:
      y_true, y_pred: true and predicted numeric values.
      name (str): the name of investigated set.
    """
    return (
        pd.DataFrame(
            {
                "set": name,
                "n": len(y_true),
                "SD": [float(np.std(y_true))],
                "RMSE": [float(np.sqrt(mse(y_true, y_pred)))],
                "R²": [r2_score(y_true, y_pred)],
            }
        )
        .eval("RMSE_SD_ratio = RMSE/SD")
        .eval("SD_RMSE_ratio = SD/RMSE")
    )


def get_classification_performance(true_class, predicted_class, name=""):
    """Evaluate classification model performance

    Calculate accuracy (Acc),
    Balanced accuracy (BAcc),
    Balanced accuracy adjusted to be between 0 and 1 (BAcc_01),
    F1 macro average (F1_macro),
    F1 weighted macro average (F1_weighted),
    Cohen's Kappa.

    Args.:
      true_class, predicted_class: true and predicted numeric values.
      name (str): the name of investigated set.
    """
    acc = accuracy_score(true_class, predicted_class)
    bacc = balanced_accuracy_score(true_class, predicted_class)
    bacc01 = balanced_accuracy_score(true_class, predicted_class, adjusted=True)
    f1_macro = f1_score(true_class, predicted_class, average="macro")
    f1_weighted = f1_score(true_class, predicted_class, average="weighted")
    kappa = cohen_kappa_score(true_class, predicted_class)

    return pd.DataFrame(
        {
            "set": name,
            "n": len(true_class),
            "Accuracy": [acc],
            "BAcc": [bacc],
            "BAcc_01": [bacc01],
            "f1_macro": [f1_macro],
            "f1_weighted": [f1_weighted],
            "Kappa": [kappa],
        }
    )


def print_classification_report(true_class, predicted_class, name=""):
    """Print summary of classification performance

    Args.:
        true_class, predicted_class: data sequences of the same length:
                                     with class names/indicators.
        name (str): the name of investigated set of data.
    """
    print(
        get_classification_performance(true_class, predicted_class, name=name)
    )
    print("")
    print(classification_report(true_class, predicted_class, zero_division=0))
    print("")
    print("Confusion matrix (rows - true, columns - predicted):")
    print(confusion_matrix(true_class, predicted_class))


# For Random Forests
def get_rf_importances(obj):
    """Get random forest feature importance

    Args:
        obj (fitted instance of RandomForestRegressor()):
            Random Forest.

    Returns:
        pandas.DataFrame: dataframe with feature names and their importance.
    """
    return pd.DataFrame(
        {
            "features": obj.feature_names_in_,
            "importance": obj.feature_importances_,
        }
    ).sort_values("importance", ascending=False)


def plot_importances(data, n=20):
    """Plot 2 plots with feature importance: "overview" and zoomed plot.

    Args:
        data (pandas.DataFrame):
            dataframe with columns `features` and `importance`
    """
    fig, ax = plt.subplots(2, 1, height_ratios=(1, 3))

    data.plot.bar(
        x="features",
        y="importance",
        ylabel="",
        xlabel="",
        legend=False,
        ax=ax[0],
    )

    ax[0].xaxis.set_ticklabels([])

    data.head(n).plot.bar(
        x="features", y="importance", ylabel="", xlabel="Features", ax=ax[1]
    )

    fig.suptitle("Feature Importance: All and Several Top Variables")

    return fig, ax


# PCA
def pca_screeplot(data, n_components=30):
    """Plot PCA screeplot

    Args:
        data (pandas.Dataframe): Numeric data
        n_components (int, optional):
            Max number of principal components to extract.
            Defaults to 30.

    Returns:
        3 objects: plot (fig and ax) and pca object.
    """
    scale = StandardScaler()
    pca = PCA(n_components=n_components)

    scaled_data = scale.fit_transform(data)
    pca.fit(scaled_data)

    pct_explained = pca.explained_variance_ratio_ * 100

    fig, ax = plt.subplots(figsize=(8, 4))
    ax.plot(pct_explained, "-o", color="tab:green")

    ax.set_xlabel("Number of components")
    ax.set_ylabel("% of explained variance")

    return fig, ax, pca


def do_pca(data, target: str, n_components: int = 10, scale=None, pca=None):
    """Do PCA on numeric non-target variables

    Args:
        data (pandas.Dataframe): data
        target (str): Target variable name
        n_components (int, optional):
            Number of PCA components to extract.
            Defaults to 10.
            n_components is ignored if `pca` is not None.
        scale (instance of sklearn.preprocessing.StandardScaler or None):
            Fitted object to scale data.
        pca (instance of sklearn.decomposition.PCA or None):
            Fitted PCA object.

    Returns:
        tuple with 6 elements:
          - 4 data frames: d_target, d_num, d_other, d_pca
          - fitted instance of sklearn.preprocessing.StandardScaler.
          - fitted instance of sklearn.decomposition.PCA.
    """
    d_target, d_num, d_other = get_columns_by_purpose(data, target)

    if scale is None:
        scale = StandardScaler()
        sc_data = scale.fit_transform(d_num)
    else:
        sc_data = scale.transform(d_num)

    if pca is None:
        pca = PCA(n_components=n_components)
        pc_num = pca.fit_transform(sc_data)
    else:
        pc_num = pca.transform(sc_data)
        n_components = pc_num.shape[1]

    # Convert to DataFrame and name columns (pc_1, pc_2, etc.)
    d_pca = pd.DataFrame(
        pc_num,
        index=d_num.index,
        columns=[f"pc_{i}" for i in np.arange(1, n_components + 1)],
    )

    return (d_target, d_num, d_other, d_pca, scale, pca)
Code
"""New methods for Pandas Series and DataFrames"""

# Setup -----------------------------------------------------------------
import warnings
import pandas as pd
import pandas_flavor as pf
from typing import Union
import janitor  # imports additional Pandas methods

import functions.fun_utils as my  # Custom module

# Series methods --------------------------------------------------------
@pf.register_series_method
def to_df(
    self: pd.Series,
    values_name: str = None,
    key_name: Union[str, list[str], tuple[str, ...]] = None,
) -> pd.DataFrame:
    """Convert Series to DataFrame with desired or default column names.

    Similar to `pandas.Series.to_frame()`, but the main purpose of this method
    is to be used with the result of `.value_counts()`. So appropriate default
    column names are pre-defined. And index is always reset.

    Args:
        self (pandas.Series):
            The object the method is applied to.
        values_name (str):
            Name for series values (applied before conversion to DataFrame).
            Defaults "count".
        key_name (str or sequence of str):
            New name for the columns, that are created from Series index
            that was present before the conversion to DataFrame.
            Defaults to `self.index.names`, if index has names,
            to `self.name` if index has no names but series has name,
            or to "value" otherwise.

    Return:
        pandas.DataFrame

    Examples:
    >>> import pandas as pd
    >>> df = pd.Series({'right': 138409, 'left': 44733}).rename("foot")

    >>> df.to_df()

    >>> # Compared to .to_frame()
    >>> df.to_frame()
    """

    k_name = None
    v_name = None

    # Check if defaults can be set based on non-missing attribute values
    if my.index_has_names(self):
        k_name = self.index.names
        if self.name is not None:
            v_name = self.name
    else:
        k_name = self.name

    # Set user-defined values or defaults
    if key_name is not None:
        k_name = key_name
    elif k_name is None:
        k_name = "value"  # Default

    if values_name is not None:
        v_name = values_name
    elif v_name is None:
        v_name = "count"  # Default

    # Output
    return self.rename_axis(k_name).rename(v_name).reset_index()


@pf.register_series_method
def to_category(self, categories=None, ordered=False):
    """Convert variable to categorical one.

    NOTE: method with the same name but for DataFrame also exists.

    Args:
        self (pandas.Series):
            The object the method is applied to.
        categories (list of values, optional):
            Categories listed here will become the first categories.
            The remaining ones (not in this list) will follow.
            Defaults to None: use default order.
        ordered (bool, optional):
            Whether or not this categorical is treated as ordered categorical.
            Defaults to False.

    Return:
        pandas.Series
    """
    self = self.astype("category")
    all_cats = self.cat.categories.values
    if categories is not None:
        # new order
        all_cats = [
            *categories,
            *sorted(list(set(all_cats).difference(categories))),
        ]
    return self.cat.reorder_categories(all_cats, ordered=ordered)


# DataFrame methods --------------------------------------------------------
@pf.register_dataframe_method
def relocate(self, col, before=0):
    """Change position of a column.
    Do transformations in-place and return a data frame.

    Args:
        self (pd.DataFrame):
            The object the method is applied to.
        col (str):
            The name of column to relocate.
        before (int|str):
            The name or index of the column before which `col` will be inserted.

    Return:
        pandas.DataFrame

    Examples:
        >>> import pandas as pd
        >>> data = pd.DataFrame({"a": 1, "b":2, "c":3})
        >>> data.relocate("c")
        >>> data
        >>> data.relocate("b", before="a")
        >>> data
    """
    columns = self.columns
    assert col in columns

    if before is None:
        position = 0

    if isinstance(before, int) or isinstance(before, float):
        position = int(before)

    if isinstance(before, str):
        assert before in columns
        position = columns.get_loc(before)
        col_position = columns.get_loc(col)
        if col_position <= position:
            position -= 1

    col_to_relocate = self.pop(col)
    self.insert(loc=position, column=col, value=col_to_relocate)

    return self


@pf.register_dataframe_method
def index_start_at(self, start=1):
    """Create a new sequential index that starts at indicated number.

    Args.:
        self (pd.DataFrame):
            The object the method is applied to.
        start (int):
            The start of an index

    Return:
        pandas.DataFrame
    """
    i = self.index
    self.index = range(start, len(i) + start)
    return self


with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # There is a deprecated method with the same name in `pyjanitor` package
    # So warning is suppressed
    @pf.register_dataframe_method
    def to_datetime(self, columns, **kwargs):
        """Convert indicated columns to datetime.

        Args.:
            self (pd.DataFrame):
                The object the method is applied to.
            columns (str or list[str]):
                Column names to convert to datetime.
            **kwargs:
                Named arguments to be passed to pandas.to_datetime()

        Return:
            pandas.DataFrame
        """
        if isinstance(columns, str):
            columns = [columns]

        self[columns] = self[columns].apply(pd.to_datetime, **kwargs)
        return self


@pf.register_dataframe_method
def to_category(self, columns, categories=None, ordered=False):
    """Convert indicated columns to categorical variables.

    NOTE: method with the same name but for Series also exists.

    Args.:
        self (pd.DataFrame):
            The object the method is applied to.
        column (str or list[str]):
            Column names to convert to category.
        categories (list of values, optional):
            Categories listed here will become the first categories.
            The remaining ones (not in this list) will follow.
            Defaults to None: use default order.
        ordered (bool, optional):
            Whether or not this categorical is treated as ordered categorical.
            Defaults to False.

    Return:
        pandas.DataFrame
    """
    if isinstance(columns, str):
        columns = [columns]

    # res = self.loc[:, columns].copy(deep=True)

    # self.loc[:, columns] = res.apply(lambda x: x.to_category(
    #     categories=categories, ordered=ordered
    # ))

    self.transform_columns(
        columns,
        lambda x: x.to_category(categories=categories, ordered=ordered),
        elementwise=False,
    )

    return self


@pf.register_dataframe_method
def make_dummies(
    self, exclude=None, drop_first=True, prefix_sep="__", **kwargs
):
    """Convert categorical variables in data frame to dummies and remove
       target variable.

    Args:
        exclude (str or None, optional):
            Name of target variable (exclude from the feature list).
            Defaults to None.
        drop_first (bool, optional):
            See pandas.get_dummies(). Defaults to True.
        prefix_sep (str, optional):
            See pandas.get_dummies(). Defaults to "__".

    Returns:
        If `exclude` is not present: pandas.DataFrame (the X)
        If `exclude` is present: pandas.DataFrame and pandas.Series
                              (the X and y)
    """
    if exclude is not None:
        y = self[exclude]
        self = self.drop(columns=exclude)

    df_with_dummies = pd.get_dummies(
        self, drop_first=drop_first, prefix_sep=prefix_sep, **kwargs
    )

    if exclude is not None:
        return (df_with_dummies, y)

    return df_with_dummies
Code
# R functions for data analysis

#' Convert Post-Hoc Test Results to CLD
#'
#' Convert p values from pairwise comparisons to CLD.
#'
#' CLD - compact letter display.
#' This function is a wrapper around [multcompView::multcompLetters()].
#'
#' @note
#' No hyphens are allowed in group names
#' (vaues of culumns `group1` and `group2`).
#'
#' @param .data (data frame with at least 3 columns)
#'        The result of pairwise comparison test usually from \pkg{rstatix}
#'        package.
#' @param group1,group2 Name of the columns in `.data`, which contain the names
#'        of first and second group. Defaults to "group1" and "group2".
#' @param p_name Name of the column, which contains p values.
#'        Defaults to `p.adj`.
#' @param alpha Significance level. Defaults to 0.05.
#'
#' @return Data frame with compared group names and CLD representation of
#'         test results. Contains columns with group names and CLD results
#'         (`cld` and `spaced_cld`).

convert_pairwise_p_to_cld <- function(.data,
                                      group1 = "group1",
                                      group2 = "group2",
                                      p_name = "p.adj",
                                      output_gr_var = "group",
                                      alpha = 0.05) {

  # Checking input
  col_names <- c(group1, group2, p_name)
  missing_col <- !col_names %in% colnames(.data)

  if (any(missing_col)) {
    stop(
      "Check you input as these columns are not present in data: ",
      paste(col_names[missing_col], sep = ",")
    )
  }

  # Analysis
  pair_names <- stringr::str_glue("{.data[[group1]]}-{.data[[group2]]}")

  # Prepare input data
   cld_obj <- purrr::set_names(.data[[p_name]], pair_names) |>
    # Get CLD
    multcompView::multcompLetters(threshold = alpha) 

    # If no differences are detected, then "$monospacedLetters" is not created,
    # then "$Letters" is used instead.
    if (is.null(cld_obj$monospacedLetters)) {
      cld_obj$monospacedLetters <- cld_obj$Letters
    }
    # Format the results
    cld_obj |>
      with(
        dplyr::full_join(
          Letters |>
            tibble::enframe(output_gr_var, "cld"),
          monospacedLetters |>
            tibble::enframe(output_gr_var, "spaced_cld") |>
            dplyr::mutate(
              spaced_cld = stringr::str_replace_all(spaced_cld, " ", "_")
            ),
          by = output_gr_var
        )
      )
}

2 Methods

This section shortly introduces the main aspects of inferential statistics and predictive modeling used in this project.

2.1 Statistical Inference

For difference in proportions, χ² (chi-squared) test was performed with pair-wise χ² as post-hoc. Goodman’s method was used to calculate confidence intervals of multinomial proportions.

For differences between groups of numeric variables, Kruskal-Wallis test was performed followed by Conover-Iman test as post-hoc. Confidence intervals of means were calculating using t-distribution based method.

In this project, confidence level is 95%, significance level is 0.05.

2.2 Predictive Modelling

For predictive modeling, training (data from all seasons except the last one) and test (data from the last season only) sets were used. The training set was used for model selection and the test set for performance evaluation of the selected models.

For the regression task, linear regression and random forests (RF) were used. For the classification task, logistic regression and RF were used. Forward sequential feature selection (SFS) with 5-fold cross-validation (CV) was used to find an optimal combination of variables. The optimized metric in the regression was RMSE** (root mean squared error), in classification BAcc (balanced accuracy), which takes into account class imbalance.

As some calculations take a lot of time, in some analyses either the total available number of features or the number of features allowed to be included in the analysis, or both were limited to fit into a reasonable amount of available time: the decision was made either based on the RF feature importance analysis or the results of previous calculations (number of possibly valuable features and time that was needed to perform a certain amount of calculations).

Models with greater performance were desirable but less complex models with almost the same level of performance as the best one were preferred.

3 Initial Exploration

In this section, the database is presented. Data summaries as well as database tables are explored to better understand the data itself and what steps of pre-precessing are needed.

3.1 Database

The “Ultimate 25k+ Matches Football Database – European” (v2) was downloaded from Kaggle. The database consists of 7 tables. The entity relationship diagram (ERD) is shown below (Fig. 3.1): pay attention that some columns from table Match are not shown in the ERD.

Code: Create connection to SQL database
db = sqlite3.connect("data/database.sqlite")
Code
query = """--sql
SELECT name 
FROM sqlite_master 
WHERE type = 'table' AND name != 'sqlite_sequence';
"""
cursor = db.cursor()
cursor.execute(query)

print("Data tables in the database: ")
for i, tbl in enumerate(cursor.fetchall(), start=1):
    print("  ", i, ". ", *tbl, sep="")
Data tables in the database: 
  1. Player_Attributes
  2. Player
  3. Match
  4. League
  5. Country
  6. Team
  7. Team_Attributes

Fig. 3.1. ERD of European Football Matches Database created with dbSchema. Some columns in Match table are hidden. Notation: # – numeric variable, t – text variable, – reference to other table, foreign key, – reference from other table.

3.2 Tables Country and League

In tables country and league has 11 distinct records each. As Scotland and England are regions of the United Kingdom, UK, there are 10 countries only.

Code
# Working with SQL database
import sqlite3

query = """--sql
SELECT
    (SELECT COUNT(DISTINCT name) FROM Country) n_regions,
    (SELECT COUNT(DISTINCT name) FROM League) n_leagues;
"""
pd.read_sql_query(query, db).style.hide(axis="index")
Table 3.1. Inspection: number of unique items in country and league tables.
n_regions n_leagues
11 11
Code
pd.read_sql_query("SELECT * FROM Country", db).index_start_at(1).style
Table 3.2. Inspection: table country.
  id name
1 1 Belgium
2 1,729 England
3 4,769 France
4 7,809 Germany
5 10,257 Italy
6 13,274 Netherlands
7 15,722 Poland
8 17,642 Portugal
9 19,694 Scotland
10 21,518 Spain
11 24,558 Switzerland
Code
pd.read_sql_query("SELECT * FROM League", db).index_start_at(1).style
Table 3.3. Inspection: table league.
  id country_id name
1 1 1 Belgium Jupiler League
2 1,729 1,729 England Premier League
3 4,769 4,769 France Ligue 1
4 7,809 7,809 Germany 1. Bundesliga
5 10,257 10,257 Italy Serie A
6 13,274 13,274 Netherlands Eredivisie
7 15,722 15,722 Poland Ekstraklasa
8 17,642 17,642 Portugal Liga ZON Sagres
9 19,694 19,694 Scotland Premier League
10 21,518 21,518 Spain LIGA BBVA
11 24,558 24,558 Switzerland Super League

League and county/region id codes coincide so these variables contain redundant information.

Details: Country/Region and league IDs are the same.
Code
query = """--sql
SELECT 
    id league_id, 
    country_id region_id, 
    IIF(id==country_id, 'yes', 'no') id_are_equal
FROM League;
"""
pd.read_sql_query(query, db).index_start_at(1).style
  league_id region_id id_are_equal
1 1 1 yes
2 1,729 1,729 yes
3 4,769 4,769 yes
4 7,809 7,809 yes
5 10,257 10,257 yes
6 13,274 13,274 yes
7 15,722 15,722 yes
8 17,642 17,642 yes
9 19,694 19,694 yes
10 21,518 21,518 yes
11 24,558 24,558 yes

3.3 Table Match

Table match includes information on 25,979 matches from 2008-07-18 to 2016-05-25 (seasons from 2008/2009 to 2015/2016), approximately 3,200-3,400 matches per season (except the season 2013/2014, where some data is likely to be missing). More details on match dataset in Tables 3.43.5.

Code
query = """--sql
SELECT 
    (SELECT COUNT(1) FROM Match) n_records,
    (SELECT COUNT(DISTINCT country_id) FROM Match) n_regions,
    (SELECT COUNT(DISTINCT league_id)  FROM Match) n_leagues,
    (SELECT COUNT(DISTINCT season)     FROM Match) n_seasons,
    (SELECT COUNT(DISTINCT team) FROM (
        SELECT home_team_api_id team FROM Match UNION
        SELECT away_team_api_id team FROM Match
    )) n_teams,
    (SELECT COUNT(DISTINCT player) FROM (
        SELECT home_player_1  player FROM Match UNION
        SELECT home_player_2  player FROM Match UNION
        SELECT home_player_3  player FROM Match UNION
        SELECT home_player_4  player FROM Match UNION
        SELECT home_player_5  player FROM Match UNION
        SELECT home_player_6  player FROM Match UNION
        SELECT home_player_7  player FROM Match UNION
        SELECT home_player_8  player FROM Match UNION
        SELECT home_player_9  player FROM Match UNION
        SELECT home_player_10 player FROM Match UNION
        SELECT home_player_11 player FROM Match UNION
        SELECT away_player_1  player FROM Match UNION
        SELECT away_player_2  player FROM Match UNION
        SELECT away_player_3  player FROM Match UNION
        SELECT away_player_4  player FROM Match UNION
        SELECT away_player_5  player FROM Match UNION
        SELECT away_player_6  player FROM Match UNION
        SELECT away_player_7  player FROM Match UNION
        SELECT away_player_8  player FROM Match UNION
        SELECT away_player_9  player FROM Match UNION
        SELECT away_player_10 player FROM Match UNION
        SELECT away_player_11 player FROM Match
    )) n_players,
    (SELECT COUNT(DISTINCT match_api_id) FROM Match) n_matches;
"""
n_matches = pd.read_sql_query(query, db)
n_matches.style.hide(axis="index")
Table 3.4. Inspection: number of unique items in match table.
n_records n_regions n_leagues n_seasons n_teams n_players n_matches
25,979 11 11 8 299 11,060 25,979
Code
query = """--sql
SELECT season, COUNT(season) n_matches FROM Match GROUP BY season;
"""
pd.read_sql_query(query, db).index_start_at(1).style
Table 3.5. Number of matches per season in match table.
  season n_matches
1 2008/2009 3,326
2 2009/2010 3,230
3 2010/2011 3,260
4 2011/2012 3,220
5 2012/2013 3,260
6 2013/2014 3,032
7 2014/2015 3,325
8 2015/2016 3,326
Code: Import match
match = pd.read_sql_query("SELECT * FROM Match", db)
# Fix datetime data type
match = match.to_datetime("date")
# Print
match.head(2)
Table 3.6. Inspection: a few rows of table match.
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal away_team_goal home_player_X1 home_player_X2 home_player_X3 home_player_X4 home_player_X5 home_player_X6 home_player_X7 home_player_X8 home_player_X9 home_player_X10 home_player_X11 away_player_X1 away_player_X2 away_player_X3 away_player_X4 away_player_X5 away_player_X6 away_player_X7 away_player_X8 away_player_X9 away_player_X10 away_player_X11 home_player_Y1 home_player_Y2 home_player_Y3 home_player_Y4 home_player_Y5 home_player_Y6 home_player_Y7 home_player_Y8 home_player_Y9 home_player_Y10 home_player_Y11 away_player_Y1 away_player_Y2 away_player_Y3 away_player_Y4 away_player_Y5 away_player_Y6 away_player_Y7 away_player_Y8 away_player_Y9 away_player_Y10 away_player_Y11 home_player_1 home_player_2 home_player_3 home_player_4 home_player_5 home_player_6 home_player_7 home_player_8 home_player_9 home_player_10 home_player_11 away_player_1 away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11 goal shoton shotoff foulcommit card cross corner possession B365H B365D B365A BWH BWD BWA IWH IWD IWA LBH LBD LBA PSH PSD PSA WHH WHD WHA SJH SJD SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
0 1 1 1 2008/2009 1 2008-08-17 492473 9987 9993 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 1.73 3.40 5.00 1.75 3.35 4.20 1.85 3.20 3.50 1.80 3.30 3.75 NaN NaN NaN 1.70 3.30 4.33 1.90 3.30 4.00 1.65 3.40 4.50 1.78 3.25 4.00 1.73 3.40 4.20
1 2 1 1 2008/2009 1 2008-08-16 492474 10000 9994 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 1.95 3.20 3.60 1.80 3.30 3.95 1.90 3.20 3.50 1.90 3.20 3.50 NaN NaN NaN 1.83 3.30 3.60 1.95 3.30 3.80 2.00 3.25 3.25 1.85 3.25 3.75 1.91 3.25 3.60


The following variables have non-cleaned HTML/XML-like text values and many missing values (45% cases with NAs), so they will not be included in the further analysis:

  • goal
  • shoton
  • shotoff
  • foulcommit
  • card
  • cross
  • corner
  • possession

Variables with player coordinates (such as home_player_X1 through away_player_Y11) will be excluded too.

Dataset contains columns with betting odds information from various betting websites. In betting odds-related variable names (e.g.: B365H), the first few symbols indicates betting websites and the meaning of the last letter is following:

  • A – Away wins,
  • D – Draw,
  • H – Home wins.

These variables can renamed to make easier-to-understand variable names. Next, betting odds from some websites abbreviated as PS (57% NAs), SJ (34%), GB (45%), BS (45%) have many missing values.

Other highlights from the profiling report:

  • as expected, distribution of matches show yearly patterns (section on variable date in data profiling report).
  • correlation between various betting odds is high (section on correlation in the report). This could be investigated in more detail.
Details: Text columns to exclude

This is just a short illustration of the issue (see column top with the most frequent values of lines goal and below). See the column of missing values in the overview of match table. More details can be explored in the data profiling report for match table.

Code
match.describe(include="O").T
count unique top freq
season 25979 8 2008/2009 3326
goal 14217 13225 <goal /> 993
shoton 14217 8464 <shoton /> 5754
shotoff 14217 8464 <shotoff /> 5754
foulcommit 14217 8466 <foulcommit /> 5752
card 14217 13777 <card /> 441
cross 14217 8466 <cross /> 5752
corner 14217 8465 <corner /> 5753
possession 14217 8420 <possession /> 5798
EDA: Overview of match table
Code
skim(match)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 25979  │ │ float64     │ 96    │                                                          │
│ │ Number of columns │ 115    │ │ int32       │ 9     │                                                          │
│ └───────────────────┴────────┘ │ string      │ 9     │                                                          │
│                                │ datetime64  │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name         NA      NA %   mean      sd       p0       p25      p75       p100     hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ id                     0    0   13000   7500      1   6500   19000  26000██████ │  │
│ │ country_id             0    0   12000   7600      1   4800   18000  25000▇█▄▆▆▇ │  │
│ │ league_id              0    0   12000   7600      1   4800   18000  25000▇█▄▆▆▇ │  │
│ │ stage                  0    0      18     10      1      9      27     38█▇▇▇▇▅ │  │
│ │ match_api_id           0    0 1200000 490000 480000 770000 17000002200000█▇▅▄▄▄ │  │
│ │ home_team_api_id       0    0   10000  14000   1600   8500    9900 270000 │  │
│ │ away_team_api_id       0    0   10000  14000   1600   8500    9900 270000 │  │
│ │ home_team_goal         0    0     1.5    1.3      0      1       2     10 █▅▁   │  │
│ │ away_team_goal         0    0     1.2    1.1      0      0       2      9 █▂▁   │  │
│ │ home_player_X1      1800    7       1  0.022      0      1       1      2 │  │
│ │ home_player_X2      1800    7     2.1   0.39      0      2       2      8  █▁   │  │
│ │ home_player_X3      1800  7.1     4.1   0.39      1      4       4      8 │  │
│ │ home_player_X4      1800  7.1       6   0.45      2      6       6      8    █▁ │  │
│ │ home_player_X5      1800  7.1     7.5    1.6      1      8       8      9▁    █ │  │
│ │ home_player_X6      1800  7.1     3.2    1.2      1      2       4      9 █▆▇▂  │  │
│ │ home_player_X7      1800  7.1     4.8    1.1      1      4       6      9  ▁▅█  │  │
│ │ home_player_X8      1800  7.1     5.3    1.7      1      3       7      9 ▅▁█▅▁ │  │
│ │ home_player_X9      1800  7.1     5.8      2      1      5       8      9 ▄▁█▁█ │  │
│ │ home_player_X10     1800  7.1     5.4    1.5      1      4       7      9  █▅▅▁ │  │
│ │ home_player_X11     1800  7.1     5.8   0.76      1      5       6      7    ▅█ │  │
│ │ away_player_X1      1800  7.1       1  0.033      1      1       1      6 │  │
│ │ away_player_X2      1800  7.1     2.1    0.4      1      2       2      8  █▁   │  │
│ │ away_player_X3      1800  7.1     4.1   0.39      2      4       4      9 │  │
│ │ away_player_X4      1800  7.1     6.1   0.45      1      6       6      8    █▁ │  │
│ │ away_player_X5      1800  7.1     7.5    1.6      1      8       8      9▁    █ │  │
│ │ away_player_X6      1800  7.1     3.2    1.3      1      2       4      9 █▆▇▂  │  │
│ │ away_player_X7      1800  7.1     4.7    1.1      1      4       6      9 ▁▁▅█  │  │
│ │ away_player_X8      1800  7.1     5.3    1.7      1      3       7      9 ▅▁█▅▁ │  │
│ │ away_player_X9      1800  7.1     5.8      2      1      5       8      9 ▄▁█▂█ │  │
│ │ away_player_X10     1800  7.1     5.5    1.5      1      4       7      9  █▆▆▂ │  │
│ │ away_player_X11     1800  7.1     5.8   0.76      3      5       6      8  █▇▄  │  │
│ │ home_player_Y1      1800    7       1  0.025      0      1       1      3 │  │
│ │ home_player_Y2      1800    7       3  0.064      0      3       3      3 │  │
│ │ home_player_Y3      1800  7.1       3  0.013      3      3       3      5 │  │
│ │ home_player_Y4      1800  7.1       3  0.029      3      3       3      5 │  │
│ │ home_player_Y5      1800  7.1     3.2   0.94      3      3       3      8 │  │
│ │ home_player_Y6      1800  7.1     6.5   0.74      3      6       7      9  ▁▅█  │  │
│ │ home_player_Y7      1800  7.1     6.7   0.59      3      6       7      9   ▄█  │  │
│ │ home_player_Y8      1800  7.1     7.2   0.59      3      7       8     10   █▄  │  │
│ │ home_player_Y9      1800  7.1       8    1.1      1      7       8     10    █▃ │  │
│ │ home_player_Y10     1800  7.1     9.2    1.1      3      8      10     11   ▅▁█ │  │
│ │ home_player_Y11     1800  7.1      10   0.51      1     10      11     11 │  │
│ │ away_player_Y1      1800  7.1       1  0.022      1      1       1      3 │  │
│ │ away_player_Y2      1800  7.1       3      0      3      3       3      3 │  │
│ │ away_player_Y3      1800  7.1       3  0.026      3      3       3      7 │  │
│ │ away_player_Y4      1800  7.1       3  0.029      3      3       3      7 │  │
│ │ away_player_Y5      1800  7.1     3.2   0.96      3      3       3      9█   ▁  │  │
│ │ away_player_Y6      1800  7.1     6.5   0.76      3      6       7     10  ▁▅█  │  │
│ │ away_player_Y7      1800  7.1     6.7   0.59      3      6       7     10   ▄█  │  │
│ │ away_player_Y8      1800  7.1     7.2   0.58      3      7       8     10   █▄  │  │
│ │ away_player_Y9      1800  7.1       8    1.1      5      7       8     11  █▆▁▄ │  │
│ │ away_player_Y10     1800  7.1     9.2    1.1      6      8      10     11 ▁▄▁█  │  │
│ │ away_player_Y11     1800  7.1      10    0.5      7     10      11     11    █▇ │  │
│ │ home_player_1       1200  4.7   77000  88000   3000  31000   97000 700000  █▁   │  │
│ │ home_player_2       1300  5.1  110000 110000   2800  33000  160000 750000 █▂▁   │  │
│ │ home_player_3       1300  4.9   92000 100000   2800  31000  130000 710000 █▂▁   │  │
│ │ home_player_4       1300  5.1   95000 100000   2800  31000  150000 720000 █▂▁   │  │
│ │ home_player_5       1300  5.1  110000 110000   2800  34000  160000 730000 █▂▁   │  │
│ │ home_player_6       1300  5.1  100000 110000   2600  31000  150000 750000 █▂▁   │  │
│ │ home_player_7       1200  4.7   97000 110000   2600  31000  140000 690000 █▂▁   │  │
│ │ home_player_8       1300    5  110000 110000   2600  33000  160000 690000 █▂▁   │  │
│ │ home_player_9       1300  4.9  110000 120000   2600  33000  160000 730000 █▃▁   │  │
│ │ home_player_10      1400  5.5  110000 110000   2600  32000  160000 740000 █▂▁   │  │
│ │ home_player_11      1600    6  100000 110000   2800  33000  160000 730000 █▂▁   │  │
│ │ away_player_1       1200  4.7   77000  87000   2800  31000   97000 700000  █▁   │  │
│ │ away_player_2       1300  4.9  110000 110000   2800  33000  160000 750000 █▃▁   │  │
│ │ away_player_3       1300    5   91000 100000   2800  30000  120000 710000 █▂▁   │  │
│ │ away_player_4       1300  5.1   95000 100000   2800  31000  150000 730000 █▂▁   │  │
│ │ away_player_5       1300  5.1  110000 110000   2800  33000  160000 750000 █▂▁   │  │
│ │ away_player_6       1300  5.1  100000 110000   2600  31000  150000 720000 █▂▁   │  │
│ │ away_player_7       1200  4.8   98000 110000   2600  31000  140000 750000  █▂   │  │
│ │ away_player_8       1300  5.2  110000 120000   2600  33000  160000 720000 █▂▁   │  │
│ │ away_player_9       1300  5.1  110000 120000   2600  33000  160000 720000 █▃▁   │  │
│ │ away_player_10      1400  5.5  110000 110000   2800  33000  160000 720000 █▂▁   │  │
│ │ away_player_11      1600    6  100000 110000   2800  33000  160000 730000 █▂▁   │  │
│ │ B365H               3400   13     2.6    1.8      1    1.7     2.8     26 │  │
│ │ B365D               3400   13     3.8    1.1    1.4    3.3       4     17  █▂   │  │
│ │ B365A               3400   13     4.7    3.7    1.1    2.5     5.2     51  █▁   │  │
│ │ BWH                 3400   13     2.6    1.6      1    1.6     2.8     34 │  │
│ │ BWD                 3400   13     3.7      1    1.6    3.2     3.8     20  █▁   │  │
│ │ BWA                 3400   13     4.4    3.3    1.1    2.5       5     51  █▁   │  │
│ │ IWH                 3500   13     2.5    1.4      1    1.6     2.6     20  █▁   │  │
│ │ IWD                 3500   13     3.6    0.8    1.5    3.2     3.7     11 ▁█▁   │  │
│ │ IWA                 3500   13     4.2    2.9    1.1    2.5     4.6     25  █▂   │  │
│ │ LBH                 3400   13     2.5    1.6      1    1.7     2.7     26 │  │
│ │ LBD                 3400   13     3.7      1    1.4    3.2     3.8     19  █▁   │  │
│ │ LBA                 3400   13     4.4    3.4    1.1    2.5       5     51  █▁   │  │
│ │ PSH                15000   57     2.8    2.2      1    1.7       3     36 │  │
│ │ PSD                15000   57     4.1    1.5    2.2    3.4     4.2     29 │  │
│ │ PSA                15000   57       5    4.5    1.1    2.6     5.4     48  █▁   │  │
│ │ WHH                 3400   13     2.6    1.7      1    1.7     2.8     26 │  │
│ │ WHD                 3400   13     3.7   0.96      1    3.2     3.8     17  █▃   │  │
│ │ WHA                 3400   13     4.5    3.6    1.1    2.5       5     51  █▁   │  │
│ │ SJH                 8900   34     2.6    1.7      1    1.7     2.8     23  █▁   │  │
│ │ SJD                 8900   34     3.8      1    1.4    3.2     3.8     15  █▃   │  │
│ │ SJA                 8900   34     4.6    3.6    1.1    2.5     5.2     41  █▁   │  │
│ │ VCH                 3400   13     2.7    1.9      1    1.7     2.8     36 │  │
│ │ VCD                 3400   13     3.9    1.2    1.6    3.3       4     26  █▁   │  │
│ │ VCA                 3400   13     4.8    4.3    1.1    2.5     5.4     67 │  │
│ │ GBH                12000   45     2.5    1.5    1.1    1.7     2.6     21  █▁   │  │
│ │ GBD                12000   45     3.6   0.87    1.4    3.2     3.8     11 ▁█▁   │  │
│ │ GBA                12000   45     4.4      3    1.1    2.5       5     34  █▁   │  │
│ │ BSH                12000   45     2.5    1.5      1    1.7     2.6     17  █▁   │  │
│ │ BSD                12000   45     3.7   0.87    1.3    3.2     3.8     13 ▅█▁   │  │
│ │ BSA                12000   45     4.4    3.2    1.1    2.5       5     34  █▁   │  │
│ └────────────────────┴────────┴───────┴──────────┴─────────┴─────────┴─────────┴──────────┴─────────┴────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name             NA      NA %       first                last                 frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ date                       0        0    2008-07-18         2016-05-25     None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name               NA            NA %        words per row               total words            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ season                             0         0                         1                 26000 │  │
│ │ goal                           12000        45                         1                 26000 │  │
│ │ shoton                         12000        45                         1                 26000 │  │
│ │ shotoff                        12000        45                         1                 26000 │  │
│ │ foulcommit                     12000        45                         1                 26000 │  │
│ │ card                           12000        45                         1                 26000 │  │
│ │ cross                          12000        45                         1                 26000 │  │
│ │ corner                         12000        45                         1                 26000 │  │
│ │ possession                     12000        45                         1                 26000 │  │
│ └──────────────────────────┴──────────────┴────────────┴────────────────────────────┴────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
EDA: Data Profiling Report of match
Code
if do_eda:
    eda.ProfileReport(
        match,
        title="Data Profiling Report: match",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

3.4 Table Player

Database includes information on 11,060 European football players. No missing values or other obvious discrepancies in this dataset were found.

Code
query = """--sql
SELECT COUNT(*) n_records, COUNT(player_api_id) n_players FROM Player;
"""
n_players = pd.read_sql_query(query, db)

# Print
n_players.style.hide(axis="index")
Table 3.7. Inspection: number of unique items in player table.
n_records n_players
11,060 11,060
Code: Import player
player = pd.read_sql_query("SELECT * FROM Player;", db)
# Fix datetime data type
player = player.to_datetime("birthday")
# Print
player.head(2)
Table 3.8. Inspection: a few rows of table player.
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 182.88 187
1 2 155782 Aaron Cresswell 189615 1989-12-15 170.18 146
EDA: Overview of player table
Code
skim(player)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 11060  │ │ int32       │ 4     │                                                          │
│ │ Number of columns │ 7      │ │ string      │ 1     │                                                          │
│ └───────────────────┴────────┘ │ datetime64  │ 1     │                                                          │
│                                │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name             NA   NA %   mean      sd        p0     p25      p75      p100     hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩  │
│ │ id                      0    0    5500    3200    1   2800   8300  11000██████  │  │
│ │ player_api_id           0    0  160000  160000 2600  36000 210000 750000 █▃▁▁▁  │  │
│ │ player_fifa_api_id      0    0  170000   59000    2 150000 200000 230000▂▁▁▂█▇  │  │
│ │ height                  0    0     180     6.4  160    180    190    210  ▂▆█▁  │  │
│ │ weight                  0    0     170      15  120    160    180    240  ▃█▃   │  │
│ └────────────────────────┴─────┴───────┴──────────┴──────────┴───────┴─────────┴─────────┴─────────┴─────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name             NA      NA %       first                last                 frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ birthday                   0        0    1967-01-23         1999-04-24     None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                NA       NA %        words per row                 total words              ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ player_name                    0         0                           2                   22000 │  │
│ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
EDA: Data Profiling Report of player
Code
if do_eda:
    eda.ProfileReport(
        player,
        title="Data Profiling Report: player",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

3.5 Table Player_Attributes

Table Player_Attributes contains 183,978 records on 11,060 players about various properties of theirs. Variables in this dataset have from 0.45% to 1.46% of missing values.

Some numeric variables are bimodal and they might indicate different importance for different player roles:

  • ball_control
  • interceptions
  • marking
  • standing_tackle
  • sliding_tackle

Goalkeeper-related variables also have distinct distribution: a few players high scores (most probably they are goalkeepers) and many with low scores (most probably the remaining roles):

  • gk_diving
  • gk_handling
  • gk_kicking
  • gk_positioning
  • gk_reflexes

There are 3 categorical variables including 2 related to working rates. FIFA defines working rates categories as either “low”, “medium” or “high” [1]. In the dataset there are more values in these columns, and the additional values can be treated as errors in most cases especially when values make no sense. What is more, comparing attacking and defensive work rate columns, some errors in one column indicate what kind of errors will be in the other column. Some of those errors are characteristic only to data dated before 2012, which indicates that this might be data scraping errors or missing information on the scrapped webpages.

Code
query = """--sql
SELECT COUNT(1) n_records, COUNT(DISTINCT player_api_id) n_players
FROM Player_Attributes;
"""

n_player_attributes = pd.read_sql_query(query, db)
n_player_attributes.style.hide(axis="index")
Table 3.9. Inspection: number of unique items in player_attributes table.
n_records n_players
183,978 11,060
Code: Import player_attributes
# Import
player_attributes = pd.read_sql_query("SELECT * FROM Player_Attributes", db)

# Fix datetime data type
player_attributes = player_attributes.to_datetime("date")

# Print
player_attributes.head(2)
Table 3.10. Inspection: a few rows of table player_attributes.
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 1 218353 505942 2016-02-18 67.00 71.00 right medium medium 49.00 44.00 71.00 61.00 44.00 51.00 45.00 39.00 64.00 49.00 60.00 64.00 59.00 47.00 65.00 55.00 58.00 54.00 76.00 35.00 71.00 70.00 45.00 54.00 48.00 65.00 69.00 69.00 6.00 11.00 10.00 8.00 8.00
1 2 218353 505942 2015-11-19 67.00 71.00 right medium medium 49.00 44.00 71.00 61.00 44.00 51.00 45.00 39.00 64.00 49.00 60.00 64.00 59.00 47.00 65.00 55.00 58.00 54.00 76.00 35.00 71.00 70.00 45.00 54.00 48.00 65.00 69.00 69.00 6.00 11.00 10.00 8.00 8.00


EDA: Overview of player_attributes table
Code
skim(player_attributes)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 183978 │ │ float64     │ 35    │                                                          │
│ │ Number of columns │ 42     │ │ int32       │ 3     │                                                          │
│ └───────────────────┴────────┘ │ string      │ 3     │                                                          │
│                                │ datetime64  │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name             NA     NA %   mean      sd       p0     p25      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ id                        0    0   92000  53000    1  46000 140000 180000██████ │  │
│ │ player_fifa_api_id        0    0  170000  54000    2 160000 200000 230000▁▁▁▂█▅ │  │
│ │ player_api_id             0    0  140000 140000 2600  35000 190000 750000 █▃▁▁  │  │
│ │ overall_rating          840 0.45      69      7   33     64     73     94  ▃█▃  │  │
│ │ potential               840 0.45      73    6.6   39     69     78     97  ▃█▄  │  │
│ │ crossing                840 0.45      55     17    1     45     68     95▁▂▄██▁ │  │
│ │ finishing               840 0.45      50     19    1     34     65     97▁▅▆█▇▁ │  │
│ │ heading_accuracy        840 0.45      57     16    1     49     68     98▁▁▃█▆▁ │  │
│ │ short_passing           840 0.45      62     14    3     57     72     97 ▁▁▇█▁ │  │
│ │ volleys                2700  1.5      49     18    1     35     64     93▁▄▅█▆▁ │  │
│ │ dribbling               840 0.45      59     18    1     52     72     97▁▁▂▆█▁ │  │
│ │ curve                  2700  1.5      53     18    2     41     67     94▁▃▅█▇▁ │  │
│ │ free_kick_accuracy      840 0.45      49     18    1     36     63     97▁▄▇█▆▁ │  │
│ │ long_passing            840 0.45      57     14    3     49     67     97 ▂▃█▅  │  │
│ │ ball_control            840 0.45      63     15    5     58     73     97 ▁▁▆█▁ │  │
│ │ acceleration            840 0.45      68     13   10     61     77     97 ▁▂▅█▂ │  │
│ │ sprint_speed            840 0.45      68     13   12     62     77     97 ▁▂▆█▂ │  │
│ │ agility                2700  1.5      66     13   11     58     75     96 ▁▂▇█▂ │  │
│ │ reactions               840 0.45      66    9.2   17     61     72     96  ▂█▆  │  │
│ │ balance                2700  1.5      65     13   12     58     74     96 ▁▃▇█▂ │  │
│ │ shot_power              840 0.45      62     16    2     54     73     97 ▁▂▆█▁ │  │
│ │ jumping                2700  1.5      67     11   14     60     74     96  ▂██▁ │  │
│ │ stamina                 840 0.45      67     13   10     61     76     96 ▁▁▆█▂ │  │
│ │ strength                840 0.45      67     12   10     60     76     96  ▁▆█▂ │  │
│ │ long_shots              840 0.45      53     18    1     41     67     96▁▃▄█▇▁ │  │
│ │ aggression              840 0.45      61     16    6     51     73     97 ▂▃▇█▂ │  │
│ │ interceptions           840 0.45      52     19    1     34     68     96 ▆▄▇█▁ │  │
│ │ positioning             840 0.45      56     18    2     45     69     96▁▃▃▇█▁ │  │
│ │ vision                 2700  1.5      58     15    1     49     69     97 ▁▃█▇▁ │  │
│ │ penalties               840 0.45      55     16    2     45     67     96 ▂▄█▆▁ │  │
│ │ marking                 840 0.45      47     21    1     25     66     96▂█▄▇▇▁ │  │
│ │ standing_tackle         840 0.45      50     21    1     29     69     95▁▆▃▄█▁ │  │
│ │ sliding_tackle         2700  1.5      48     22    2     25     67     95▂▇▄▅█▁ │  │
│ │ gk_diving               840 0.45      15     17    1      7     13     94 │  │
│ │ gk_handling             840 0.45      16     16    1      8     15     93█▁  ▁  │  │
│ │ gk_kicking              840 0.45      21     21    1      8     15     97█  ▁▁  │  │
│ │ gk_positioning          840 0.45      16     16    1      8     15     96  █▁   │  │
│ │ gk_reflexes             840 0.45      16     17    1      8     15     96█▁  ▁  │  │
│ └────────────────────────┴───────┴───────┴──────────┴─────────┴───────┴─────────┴─────────┴─────────┴────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name             NA      NA %       first                last                 frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ date                       0        0    2007-02-22         2016-07-07     None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA         NA %       words per row            total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ preferred_foot                          840     0.45                      1             180000 │  │
│ │ attacking_work_rate                    3200      1.8                      1             180000 │  │
│ │ defensive_work_rate                     840     0.45                      1             180000 │  │
│ └────────────────────────────────────┴───────────┴───────────┴─────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
EDA for categorical variables in player_attributes (1: existing values)
Code
player_attributes.preferred_foot.value_counts().to_df()
preferred_foot count
0 right 138409
1 left 44733
Code
player_attributes.attacking_work_rate.value_counts().to_df()
attacking_work_rate count
0 medium 125070
1 high 42823
2 low 8569
3 None 3639
4 norm 348
5 y 106
6 le 104
7 stoc 89
Code
player_attributes.defensive_work_rate.value_counts().to_df()
defensive_work_rate count
0 medium 130846
1 high 27041
2 low 18432
3 _0 2394
4 o 1550
5 1 441
6 ormal 348
7 2 342
8 3 258
9 5 234
10 7 217
11 0 197
12 6 197
13 9 152
14 4 116
15 es 106
16 ean 104
17 tocky 89
18 8 78
EDA for categorical variables in player_attributes (2: patterns)

Cells with zero values are in pastel red.

Code
# wr - work rate
wr_cats = ["low", "medium", "high"]

(
    pd.crosstab(
        player_attributes.defensive_work_rate.to_category(wr_cats),
        player_attributes.attacking_work_rate.to_category(wr_cats),
    )
    .style.background_gradient()
    .highlight_between(left=0, right=0, color="#FFBBBB")
)
attacking_work_rate low medium high None le norm stoc y
defensive_work_rate                
low 695 12,003 5,727 7 0 0 0 0
medium 4,525 97,154 29,085 82 0 0 0 0
high 3,319 15,714 7,939 69 0 0 0 0
0 0 9 11 177 0 0 0 0
1 0 35 9 397 0 0 0 0
2 0 76 13 253 0 0 0 0
3 12 11 0 235 0 0 0 0
4 18 9 0 89 0 0 0 0
5 0 11 17 206 0 0 0 0
6 0 21 13 163 0 0 0 0
7 0 9 5 203 0 0 0 0
8 0 5 0 73 0 0 0 0
9 0 13 4 135 0 0 0 0
ean 0 0 0 0 104 0 0 0
es 0 0 0 0 0 0 0 106
o 0 0 0 1,550 0 0 0 0
ormal 0 0 0 0 0 348 0 0
tocky 0 0 0 0 0 0 89 0
Code
pd.crosstab(
    player_attributes.attacking_work_rate.to_category(wr_cats),
    player_attributes.date.dt.year.rename("year"),
).style.highlight_between(left=0, right=0, color="#FFBBBB")
year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
attacking_work_rate                    
low 719 302 490 560 640 613 1,729 1,565 1,352 599
medium 10,445 3,687 5,851 7,329 8,717 9,384 27,524 22,285 20,955 8,893
high 2,432 929 1,524 1,854 2,182 2,497 9,175 8,606 9,162 4,462
None 735 338 436 498 270 131 402 336 349 144
le 35 17 21 23 8 0 0 0 0 0
norm 111 56 64 94 23 0 0 0 0 0
stoc 25 13 20 21 10 0 0 0 0 0
y 32 15 24 25 10 0 0 0 0 0
Code
pd.crosstab(
    player_attributes.defensive_work_rate.to_category(wr_cats),
    player_attributes.date.dt.year.rename("year"),
).style.highlight_between(left=0, right=0, color="#FFBBBB")
year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
defensive_work_rate                    
low 1,368 567 890 1,050 1,170 1,236 3,986 3,347 3,325 1,493
medium 10,539 3,633 5,812 7,313 8,805 9,582 28,658 23,858 22,847 9,799
high 1,674 711 1,154 1,370 1,559 1,667 5,764 5,199 5,281 2,662
0 19 10 15 12 8 9 29 41 37 17
1 42 11 18 27 27 19 89 85 88 35
2 30 8 11 19 17 21 72 56 80 28
3 39 16 21 21 21 23 51 37 25 4
4 18 8 8 14 13 10 21 14 8 2
5 29 6 8 13 13 20 49 57 25 14
6 26 10 13 14 12 10 27 33 32 20
7 22 9 14 18 14 8 32 34 49 17
8 9 4 6 6 7 8 20 9 5 4
9 20 8 12 16 11 12 32 22 16 3
_0 868 439 560 419 108 0 0 0 0 0
ean 35 17 21 23 8 0 0 0 0 0
es 32 15 24 25 10 0 0 0 0 0
o 496 255 319 348 132 0 0 0 0 0
ormal 111 56 64 94 23 0 0 0 0 0
tocky 25 13 20 21 10 0 0 0 0 0
EDA: Data Profiling Report of player_attributes
Code
if do_eda:
    eda.ProfileReport(
        player_attributes,
        title="Data Profiling Report: player_attributes",
        config_file="_config/ydata_profile_config--default.yaml",
    )

3.6 Table Team

Table team contains records on 299 football teams.

Code
query = """--sql
SELECT COUNT(1) n_records, COUNT(DISTINCT team_api_id) n_teams FROM Team;
"""
n_teams = pd.read_sql_query(query, db)
n_teams.style.hide(axis="index")
Table 3.11. Inspection: number of unique items in team table.
n_records n_teams
299 299
Code: Import team
team = pd.read_sql_query("SELECT * FROM Team ", db)
# Print
team.head(2).style.hide(axis="index").format(precision=1)
Table 3.12. Inspection: a few rows of table team.
id team_api_id team_fifa_api_id team_long_name team_short_name
1 9987 673.0 KRC Genk GEN
2 9993 675.0 Beerschot AC BAC
EDA: Overview of team table
Code
skim(team)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 299    │ │ int32       │ 2     │                                                          │
│ │ Number of columns │ 5      │ │ string      │ 2     │                                                          │
│ └───────────────────┴────────┘ │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name             NA    NA %    mean     sd       p0      p25     p75      p100     hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩  │
│ │ id                       0     0  24000  15000     1  9600  36000  52000██▅▇▆▆  │  │
│ │ team_api_id              0     0  12000  26000  1600  8300   9900 270000 │  │
│ │ team_fifa_api_id        11   3.7  22000  42000     1   180   1900 110000█    ▂  │  │
│ └────────────────────────┴──────┴────────┴─────────┴─────────┴────────┴────────┴─────────┴─────────┴─────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                      NA      NA %        words per row               total words           ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ team_long_name                      0         0                       2.1                  610 │  │
│ │ team_short_name                     0         0                       2.1                  610 │  │
│ └─────────────────────────────────┴────────┴────────────┴────────────────────────────┴───────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
EDA: Data Profiling Report of team
Code
if do_eda:
    eda.ProfileReport(
        team,
        title="Data Profiling Report: team",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

3.7 Table Team_Attributes

Table teams_attributes dataset contains 1,458 records on 288 teams. It is 11 teams less than in teams dataset. What is more, data is available only from year 2010.

Some variables like buildUpPlayDribbling and buildUpPlayDribblingClass have both numeric (without word Class in the name) and categorical (with word Class) versions. Graphical inspection show that numeric values in categorical classes do not overlap.

Categorical variables buildUpPlayPositioningClass, chanceCreationPositioningClass, and defenceDefenderLineClass do not have numeric equivalents.

Code
query = """--sql
SELECT COUNT(1) n_records, COUNT(DISTINCT team_api_id) n_teams
FROM Team_Attributes;
"""
pd.read_sql_query(query, db).style.hide(axis="index")
Table 3.13. Inspection: number of unique items in team_attributes table.
n_records n_teams
1,458 288
Code: Import team_attributes
# Import
team_attributes = pd.read_sql_query("SELECT * FROM Team_Attributes;", db)
# Pre-process
team_attributes = team_attributes.to_datetime("date")
# Print
team_attributes.head(2)
Table 3.14. Inspection: a few rows of table team_attributes.
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing chanceCreationPassingClass chanceCreationCrossing chanceCreationCrossingClass chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 60 Balanced NaN Little 50 Mixed Organised 60 Normal 65 Normal 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 52 Balanced 48.00 Normal 56 Mixed Organised 54 Normal 63 Normal 64 Normal Organised 47 Medium 44 Press 54 Normal Cover


EDA: Overview of team_attributes table
Code
skim(team_attributes)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 1458   │ │ string      │ 12    │                                                          │
│ │ Number of columns │ 25     │ │ int32       │ 11    │                                                          │
│ └───────────────────┴────────┘ │ datetime64  │ 1     │                                                          │
│                                │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name                NA     NA %    mean     sd       p0     p25    p75    p100     hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩  │
│ │ id                           0     0    730    420    1  370 1100   1500██████  │  │
│ │ team_fifa_api_id             0     0  18000  39000    1  110 1900 110000█    ▁  │  │
│ │ team_api_id                  0     0  10000  13000 1600 8500 9900 270000 │  │
│ │ buildUpPlaySpeed             0     0     52     12   20   45   62     80 ▄▆█▆▂  │  │
│ │ buildUpPlayDribbling       970    66     49    9.7   24   42   55     77▁▄██▂▁  │  │
│ │ buildUpPlayPassing           0     0     48     11   20   40   55     80 ▅▆█▃▁  │  │
│ │ chanceCreationPassin         0     0     52     10   21   46   59     80 ▁▃▇█▅  │  │
│ │ chanceCreationCrossi         0     0     54     11   20   47   62     80 ▂▄█▅▂  │  │
│ │ chanceCreationShooti         0     0     54     10   22   48   61     80 ▂▅█▅▁  │  │
│ │ defencePressure              0     0     46     10   23   39   51     72▂▅█▅▃▂  │  │
│ │ defenceAggression            0     0     49    9.7   24   44   55     72▁▂█▇▃▂  │  │
│ │ defenceTeamWidth             0     0     52    9.6   29   47   58     73▂▂▇█▄▂  │  │
│ └───────────────────────────┴───────┴────────┴─────────┴─────────┴───────┴───────┴───────┴─────────┴─────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name             NA      NA %       first                last                 frequency        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩  │
│ │ date                       0        0    2010-02-22         2015-09-10     None             │  │
│ └────────────────────────┴────────┴───────────┴─────────────────────┴─────────────────────┴──────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                            NA     NA %       words per row             total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ buildUpPlaySpeedClas                     0        0                       1               1500 │  │
│ │ buildUpPlayDribbling                     0        0                       1               1500 │  │
│ │ buildUpPlayPassingCl                     0        0                       1               1500 │  │
│ │ buildUpPlayPositioni                     0        0                       1               1500 │  │
│ │ chanceCreationPassin                     0        0                       1               1500 │  │
│ │ chanceCreationCrossi                     0        0                       1               1500 │  │
│ │ chanceCreationShooti                     0        0                       1               1500 │  │
│ │ chanceCreationPositi                     0        0                       1               1500 │  │
│ │ defencePressureClass                     0        0                       1               1500 │  │
│ │ defenceAggressionCla                     0        0                       1               1500 │  │
│ │ defenceTeamWidthClas                     0        0                       1               1500 │  │
│ │ defenceDefenderLineC                     0        0                       1               1500 │  │
│ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
EDA: Data Profiling Report of team_attributes
Code
if do_eda:
    eda.ProfileReport(
        team_attributes,
        title="Data Profiling Report: team_attributes",
        config_file="_config/ydata_profile_config--default.yaml",
    )
EDA: boxplots of variable pairs (numeric and categorical counterparts)

Boxplots indicate non-overlapping values in each appropriate class.

Code
sns.boxplot(team_attributes, x="buildUpPlaySpeed", y="buildUpPlaySpeedClass");

Code
sns.boxplot(
    team_attributes, x="buildUpPlayDribbling", y="buildUpPlayDribblingClass"
);

Code
sns.boxplot(
    team_attributes, x="buildUpPlayPassing", y="buildUpPlayPassingClass"
);

Code
sns.boxplot(
    team_attributes, x="chanceCreationPassing", y="chanceCreationPassingClass"
);

Code
sns.boxplot(
    team_attributes, x="chanceCreationCrossing", y="chanceCreationCrossingClass"
);

Code
sns.boxplot(
    team_attributes, x="chanceCreationShooting", y="chanceCreationShootingClass"
);

Code
sns.boxplot(team_attributes, x="defencePressure", y="defencePressureClass");

Code
sns.boxplot(team_attributes, x="defenceAggression", y="defenceAggressionClass");

Code
sns.boxplot(team_attributes, x="defenceTeamWidth", y="defenceTeamWidthClass");

3.8 Delete Tables

The tables in this section were imported for exploratory purposes only. In the next section they will be imported in the form that is needed to answer the main questions of this analysis.

Code
df_to_delete = [match, player, player_attributes, team, team_attributes]
del df_to_delete

4 Data Import & Pre-Processing

In this section, data will be imported and pre-processed to create the following tables required for the main analyses:

  1. To present analyzed counties and leagues:
    • leagues
  2. To compare resultativeness by leagues and seasons:
    • goals_summary
  3. To identify and analyze top teams:
    • teams_top_bottom_goals
    • teams_wins_per_season
  4. To identify top players in 2015/2016 and what factors make them best:
    • players
  5. To investigate, if home advantage exists:
    • matches
  6. To investigate relationship between betting odds from different companies/websites:
    • matches_betting_odds
  7. For team score prediction in a match:
    • team_train
    • team_test
  8. For match outcome (home wins, draw, away wins) prediction:
    • match_train
    • match_test

Some additional tables will be created ad-hoc in the analysis section.

4.1 Import

This section contains code that imports data to Python. Some pre-processing in SQL is also performed.

Before importing into Python:

  • tables country and league were merged and the result was called leagues.
  • new column country in table was created where Scotland and England were treated as the same country United Kingdom, UK,
  • column region was created to indicate regions of UK.
Code: Import leagues (country + league)
query = """--sql
SELECT 
    l.id league_id,
    CASE 
        WHEN c.name IN ('England', 'Scotland') THEN 'United Kingdom'
        ELSE c.name
    END country,
    CASE  
        WHEN c.name IN ('England', 'Scotland') THEN c.name
        ELSE ''
    END region,
    l.name league
FROM Country c FULL JOIN League l ON ( l.country_id = c.id );
"""
leagues = pd.read_sql_query(query, db)

# Print
leagues.head(2)
Table 4.1. Inspection: a few rows of table leagues.
league_id country region league
0 1 Belgium Belgium Jupiler League
1 1729 United Kingdom England England Premier League
Code
leagues.shape
(11, 4)

Before importing into Python, team, and team_attributes tables were merged.

Code: Import teams (team + team_attributes)
# EXCLUDE [t.id, t.team_fifa_api_id, ta.id, ta.team_fifa_api_id, ta.team_api_id]
query = """--sql
SELECT
    t.team_api_id team_id,
    t.team_long_name team_name,
    t.team_short_name,
    
    ta.date team_info_date, 
    ta.buildUpPlayPositioningClass,
    ta.chanceCreationPositioningClass,
    ta.defenceDefenderLineClass,
    
    ta.buildUpPlaySpeed,
    ta.buildUpPlayDribbling,
    ta.buildUpPlayPassing,
    ta.chanceCreationPassing,
    ta.chanceCreationCrossing,
    ta.chanceCreationShooting,
    ta.defencePressure,
    ta.defenceAggression,
    ta.defenceTeamWidth,
    
    ta.buildUpPlaySpeedClass,
    ta.buildUpPlayDribblingClass,
    ta.buildUpPlayPassingClass,
    ta.chanceCreationPassingClass,
    ta.chanceCreationCrossingClass,
    ta.chanceCreationShootingClass,
    ta.defencePressureClass,
    ta.defenceAggressionClass,
    ta.defenceTeamWidthClass

FROM Team t FULL JOIN Team_Attributes ta 
ON ( ta.team_api_id = t.team_api_id );
"""
teams = pd.read_sql_query(query, db)

# Print
teams.head(2).style.hide(axis="index").format(precision=1)
Table 4.2. Inspection: a few rows of table teams.
team_id team_name team_short_name team_info_date buildUpPlayPositioningClass chanceCreationPositioningClass defenceDefenderLineClass buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassingClass chanceCreationPassingClass chanceCreationCrossingClass chanceCreationShootingClass defencePressureClass defenceAggressionClass defenceTeamWidthClass
9987 KRC Genk GEN 2010-02-22 00:00:00 Organised Organised Cover 45.0 nan 45.0 50.0 35.0 60.0 70.0 65.0 70.0 Balanced Little Mixed Normal Normal Normal High Press Wide
9987 KRC Genk GEN 2011-02-22 00:00:00 Organised Organised Offside Trap 66.0 nan 52.0 65.0 66.0 51.0 48.0 47.0 54.0 Balanced Little Mixed Normal Normal Normal Medium Press Normal
Code
teams.shape
(1469, 25)

Before importing into Python, data about players were pre-processed:

  • Weight was converted to kilograms.
  • Birth year was extracted as separate column.
  • Body mass index (BMI) was calculated.
  • player and player_attributes tables were merged.
Code: Import players (player + player_attributes)
# EXCLUDE [p.id, p.player_fifa_api_id, pa.id, pa.player_fifa_api_id]
query = """--sql
SELECT 
    -- id info
    p.player_api_id player_id, 
    pa.date player_info_date,
    -- player
    p.player_name, 
    p.birthday,
    STRFTIME('%Y', p.birthday) birth_year,
    p.height,
    p.weight/2.205 weight_kg, 
    (p.weight/2.205) / ((p.height/100)*(p.height/100)) bmi,
    -- player attributes
    pa.overall_rating, pa.potential, 
    pa.preferred_foot,pa.attacking_work_rate, pa.defensive_work_rate, 
    pa.crossing, pa.finishing, pa.heading_accuracy, pa.short_passing, 
    pa.volleys, pa.dribbling, pa.curve, pa.free_kick_accuracy, 
    pa.long_passing, pa.ball_control, pa.acceleration, pa.sprint_speed, 
    pa.agility, pa.reactions, pa.balance, pa.shot_power, pa.jumping, 
    pa.stamina, pa.strength, pa.long_shots, pa.aggression, pa.interceptions, 
    pa.positioning, pa.vision, pa.penalties, pa.marking, pa.standing_tackle,
    pa.sliding_tackle, pa.gk_diving, pa.gk_handling, pa.gk_kicking, 
    pa.gk_positioning, pa.gk_reflexes
FROM Player p JOIN Player_Attributes pa 
ON ( pa.player_api_id = p.player_api_id );
"""
players = pd.read_sql_query(query, db)

# Print
players.head(2).style.hide(axis="index").format(precision=1)
Table 4.3. Inspection: a few rows of table players.
player_id player_info_date player_name birthday birth_year height weight_kg bmi overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
505942 2016-02-18 00:00:00 Aaron Appindangoye 1992-02-29 00:00:00 1992 182.9 84.8 25.4 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
505942 2015-11-19 00:00:00 Aaron Appindangoye 1992-02-29 00:00:00 1992 182.9 84.8 25.4 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
Code
players.shape
(183978, 46)

From table match only the columns of interest were imported. The table was named matches.

Code: Import matches
query = """--sql
SELECT 
    -- match info
    m.id match_id, m.league_id, m.season, m.stage, m.date match_date,
    -- team info
    m.home_team_api_id home_team_id, m.away_team_api_id away_team_id,
    m.home_team_goal, m.away_team_goal,
    -- players
    m.home_player_1, m.home_player_2, m.home_player_3, m.home_player_4,
    m.home_player_5, m.home_player_6, m.home_player_7, m.home_player_8, 
    m.home_player_9, m.home_player_10, m.home_player_11, 
    m.away_player_1, m.away_player_2, m.away_player_3, m.away_player_4,
    m.away_player_5, m.away_player_6, m.away_player_7, m.away_player_8,
    m.away_player_9, m.away_player_10, m.away_player_11,
    -- betting odds
    m.B365H, m.B365D, m.B365A, m.BWH, m.BWD, m.BWA, m.IWH, m.IWD, m.IWA, 
    m.LBH, m.LBD, m.LBA, m.PSH, m.PSD, m.PSA, m.WHH, m.WHD, m.WHA, 
    m.SJH, m.SJD, m.SJA, m.VCH, m.VCD, m.VCA, m.GBH, m.GBD, m.GBA,
    m.BSH, m.BSD, m.BSA
FROM Match m;
"""
matches = pd.read_sql_query(query, db)

# Print
matches.head(2).style.hide(axis="index").format(precision=1)
Table 4.4. Inspection: a few rows of table matches (1).
match_id league_id season stage match_date home_team_id away_team_id home_team_goal away_team_goal home_player_1 home_player_2 home_player_3 home_player_4 home_player_5 home_player_6 home_player_7 home_player_8 home_player_9 home_player_10 home_player_11 away_player_1 away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11 B365H B365D B365A BWH BWD BWA IWH IWD IWA LBH LBD LBA PSH PSD PSA WHH WHD WHA SJH SJD SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
1 1 2008/2009 1 2008-08-17 00:00:00 9987 9993 1 1 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.7 3.4 5.0 1.8 3.4 4.2 1.9 3.2 3.5 1.8 3.3 3.8 nan nan nan 1.7 3.3 4.3 1.9 3.3 4.0 1.6 3.4 4.5 1.8 3.2 4.0 1.7 3.4 4.2
2 1 2008/2009 1 2008-08-16 00:00:00 10000 9994 0 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.9 3.2 3.6 1.8 3.3 4.0 1.9 3.2 3.5 1.9 3.2 3.5 nan nan nan 1.8 3.3 3.6 1.9 3.3 3.8 2.0 3.2 3.2 1.9 3.2 3.8 1.9 3.2 3.6
Code
matches.shape
(25979, 61)

4.2 Pre-Process in Python

This section contains code that pre-processes data in Python.

Pre-process teams table.

Code: Pre-process teams
teams = teams.to_datetime("team_info_date").sort_values(["team_info_date"])

Pre-process players table.

Code: Pre-process players
# For work rate (wr) variables' pre-processing
wr_categories = pd.CategoricalDtype(
    categories=["low", "medium", "high"], ordered=True
)

# Pre-process
players = (
    players.to_datetime(["birthday", "player_info_date"])
    .astype({"birth_year": int})
    .sort_values(["player_info_date"])
    .to_category("preferred_foot", ["left", "right"])
    .astype(
        {
            "defensive_work_rate": wr_categories,
            "attacking_work_rate": wr_categories,
        }
    )
)

players.head(2)
Table 4.5. Inspection: a few rows of table players (2).
player_id player_info_date player_name birthday birth_year height weight_kg bmi overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
183977 39902 2007-02-22 Zvjezdan Misimovic 1982-06-05 1982 180.34 79.82 24.54 80.00 81.00 right medium low 74.00 68.00 57.00 88.00 77.00 87.00 86.00 53.00 78.00 91.00 58.00 64.00 77.00 66.00 73.00 72.00 58.00 67.00 59.00 78.00 63.00 63.00 68.00 88.00 53.00 38.00 32.00 30.00 9.00 9.00 78.00 7.00 15.00
79627 38343 2007-02-22 Jef Delen 1976-06-29 1976 175.26 63.04 20.52 67.00 69.00 left medium medium 63.00 62.00 59.00 63.00 38.00 68.00 53.00 65.00 51.00 62.00 61.00 68.00 65.00 64.00 61.00 66.00 69.00 83.00 54.00 61.00 58.00 62.00 60.00 63.00 65.00 44.00 64.00 64.00 7.00 15.00 51.00 8.00 6.00


Merge dataset leagues to matches and pre-process matches (this dataset contains one row per match):

Code: Pre-process matches
# Prepare for pre-processing ----------------------------------------------
# Recode match goal difference (goal_diff > 0, if home wins) into words.
def who_wins(goal_diff):
    """Recode outcome of match to text values"""
    if goal_diff < 0:
        return "Away Wins"
    elif goal_diff == 0:
        return "Draw"
    else:
        return "Home Wins"


# Objects to create categorical variables
# fmt: off
season_categories=[
    "2008/2009", "2009/2010", "2010/2011", "2011/2012", 
    "2012/2013", "2013/2014", "2014/2015", "2015/2016",
]

# Objects to rename betting odds
## Old names 
betting_odds_names_old = [
    "B365H", "B365D", "B365A", 
    "BWH", "BWD", "BWA", "IWH", "IWD", "IWA", "LBH", "LBD", "LBA", 
    "PSH", "PSD", "PSA", "WHH", "WHD", "WHA", "SJH", "SJD", "SJA",
    "VCH", "VCD", "VCA", "GBH", "GBD", "GBA", "BSH", "BSD", "BSA",
]

## New names 
betting_odds_names_new = [
    f"{i[:-1]}_home_wins"      if (i.endswith("H"))
    else f"{i[:-1]}_draw"      if (i.endswith("D"))
    else f"{i[:-1]}_away_wins" if (i.endswith("A"))
    else "error"
    for i in betting_odds_names_old
]
# fmt: on

## Names map
odds_names_map = dict(zip(betting_odds_names_old, betting_odds_names_new))

# Pre-process `matches` dataset -------------------------------------------
matches = (
    # Merge matches and leagues
    pd.merge(matches, leagues, on="league_id")
    # Drop columns
    .drop(columns="league_id")
    # Rename columns
    .rename(columns=odds_names_map)
    # Fix data types
    .to_datetime("match_date")
    .to_category("season", season_categories, ordered=True)
    .to_category("league")
    # Create new variables
    .assign(
        goal_sum=lambda x: x.home_team_goal + x.away_team_goal,
        # goal_diff > 0, if home wins:
        goal_diff=lambda x: x.home_team_goal - x.away_team_goal,
        goal_diff_sign=lambda x: np.sign(x.goal_diff),
        match_winner=lambda x: x.goal_diff.apply(who_wins).to_category(),
        # Ratio ha: "home wins / away wins"
        B365_ratio_ha=lambda x: x.B365_home_wins / x.B365_away_wins,
        BW_ratio_ha=lambda x: x.BW_home_wins / x.BW_away_wins,
        PS_ratio_ha=lambda x: x.PS_home_wins / x.PS_away_wins,
        VC_ratio_ha=lambda x: x.VC_home_wins / x.VC_away_wins,
        IW_ratio_ha=lambda x: x.IW_home_wins / x.IW_away_wins,
        WH_ratio_ha=lambda x: x.WH_home_wins / x.WH_away_wins,
        GB_ratio_ha=lambda x: x.GB_home_wins / x.GB_away_wins,
        LB_ratio_ha=lambda x: x.LB_home_wins / x.LB_away_wins,
        SJ_ratio_ha=lambda x: x.SJ_home_wins / x.SJ_away_wins,
        BS_ratio_ha=lambda x: x.BS_home_wins / x.BS_away_wins,
        # Log-ratios of ha
        B365_log_ratio_ha=lambda x: np.log(x.B365_home_wins / x.B365_away_wins),
        BW_log_ratio_ha=lambda x: np.log(x.BW_home_wins / x.BW_away_wins),
        PS_log_ratio_ha=lambda x: np.log(x.PS_home_wins / x.PS_away_wins),
        VC_log_ratio_ha=lambda x: np.log(x.VC_home_wins / x.VC_away_wins),
        IW_log_ratio_ha=lambda x: np.log(x.IW_home_wins / x.IW_away_wins),
        WH_log_ratio_ha=lambda x: np.log(x.WH_home_wins / x.WH_away_wins),
        GB_log_ratio_ha=lambda x: np.log(x.GB_home_wins / x.GB_away_wins),
        LB_log_ratio_ha=lambda x: np.log(x.LB_home_wins / x.LB_away_wins),
        SJ_log_ratio_ha=lambda x: np.log(x.SJ_home_wins / x.SJ_away_wins),
        BS_log_ratio_ha=lambda x: np.log(x.BS_home_wins / x.BS_away_wins),
    )
    # Change position of columns
    .relocate("league", before="season")
    .relocate("region", before="league")
    .relocate("country", before="region")
    .relocate("goal_sum", before="home_player_1")
    .relocate("goal_diff", before="home_player_1")
    .relocate("goal_diff_sign", before="home_player_1")
    .relocate("match_winner", before="home_player_1")
    # Sort rows by date
    .sort_values("match_date")
)

matches.head(2)
Table 4.6. Inspection: a few rows of table matches (2).
match_id country region league season stage match_date home_team_id away_team_id home_team_goal away_team_goal goal_sum goal_diff goal_diff_sign match_winner home_player_1 home_player_2 home_player_3 home_player_4 home_player_5 home_player_6 home_player_7 home_player_8 home_player_9 home_player_10 home_player_11 away_player_1 away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11 B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins PS_home_wins PS_draw PS_away_wins WH_home_wins WH_draw WH_away_wins SJ_home_wins SJ_draw SJ_away_wins VC_home_wins VC_draw VC_away_wins GB_home_wins GB_draw GB_away_wins BS_home_wins BS_draw BS_away_wins B365_ratio_ha BW_ratio_ha PS_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha GB_ratio_ha LB_ratio_ha SJ_ratio_ha BS_ratio_ha B365_log_ratio_ha BW_log_ratio_ha PS_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha GB_log_ratio_ha LB_log_ratio_ha SJ_log_ratio_ha BS_log_ratio_ha
24558 24559 Switzerland Switzerland Super League 2008/2009 1 2008-07-18 10192 9931 1 2 3 -1 -1 Away Wins NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24559 24560 Switzerland Switzerland Super League 2008/2009 1 2008-07-19 9930 10179 3 1 4 2 1 Home Wins NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Code: Create goals_summary
# Goals summary dataset for each league and season
goals_summary = (
    matches.groupby(["league", "season"])
    .goal_sum.agg(["count", "sum"])
    .rename(columns={"count": "n_matches_total", "sum": "n_goals_total"})
    .eval("n_goals_per_match = n_goals_total/n_matches_total")
)

# Rank leagues by average number of goals per match per season
leagues_by_goals = [
    *goals_summary.groupby("league")
    .n_goals_per_match.mean()
    .sort_values(ascending=False)
    .index
]

# Sort leagues by "resultativeness": update goals summary table
goals_summary = (
    goals_summary.reset_index()
    .assign(league=lambda x: x.league.to_category(leagues_by_goals))
    .set_index(["league", "season"])
    .sort_index()
)

# Reorder categories (leagues) in `match`
matches = matches.to_category("league", leagues_by_goals)
Code: Create matches_betting_odds table
# Dataset for betting odds analysis

# Variable names
cols_to_include_for_odds = [
    "date",
    "stage",
    # Goal statistics/match outcomes
    "home_team_goal",
    "away_team_goal",
    "goal_sum",
    "goal_diff",
    "goal_diff_sign",
    "match_winner",
    # Betting odds
    "B365_home_wins",
    "BW_home_wins",
    "IW_home_wins",
    "LB_home_wins",
    "PS_home_wins",
    "WH_home_wins",
    "SJ_home_wins",
    "VC_home_wins",
    "GB_home_wins",
    "BS_home_wins",
    "B365_draw",
    "BW_draw",
    "IW_draw",
    "LB_draw",
    "PS_draw",
    "WH_draw",
    "SJ_draw",
    "VC_draw",
    "GB_draw",
    "BS_draw",
    "B365_away_wins",
    "BW_away_wins",
    "IW_away_wins",
    "LB_away_wins",
    "PS_away_wins",
    "WH_away_wins",
    "SJ_away_wins",
    "VC_away_wins",
    "GB_away_wins",
    "BS_away_wins",
    # Derivative/Calculated variables;
    # "ha" means Home/Away betting odds ratio
    "B365_ratio_ha",
    "BW_ratio_ha",
    "PS_ratio_ha",
    "VC_ratio_ha",
    "IW_ratio_ha",
    "WH_ratio_ha",
    "GB_ratio_ha",
    "LB_ratio_ha",
    "SJ_ratio_ha",
    "BS_ratio_ha",
    "B365_log_ratio_ha",
    "BW_log_ratio_ha",
    "PS_log_ratio_ha",
    "VC_log_ratio_ha",
    "IW_log_ratio_ha",
    "WH_log_ratio_ha",
    "GB_log_ratio_ha",
    "LB_log_ratio_ha",
    "SJ_log_ratio_ha",
    "BS_log_ratio_ha",
]

matches_betting_odds = matches.filter(cols_to_include_for_odds)

From matches, let’s create a dataset matches_long_team with one row per team:

Code: Create matches_long_team
# Add column `won_or_lost`, which indicates match status for the team:
def team_won_or_lost(df):
    """Return outcome if a team won or lost a match or there was draw."""
    if df.match_winner == "Draw":
        return "draw"
    elif (df.team_type == "home") and (df.match_winner == "Home Wins"):
        return "won"
    elif (df.team_type == "away") and (df.match_winner == "Away Wins"):
        return "won"
    else:
        return "lost"


def negate_for_away_team(df):
    """Negate goal difference for away team.
    Negative goal difference here means that the team lost.
    """
    if df.team_type == "away":
        return -df.goal_diff
    else:
        return df.goal_diff


matches_long_team = (
    matches.pivot_longer(
        column_names=re.compile("^(home|away)_(.+)"),
        names_pattern="^(home|away)_(.+)",
        names_to=("team_type", ".value"),
        sort_by_appearance=True,
    )
    .to_category("team_type")
    .rename(columns={"team_goal": "team_goals"})
    .assign(
        team_outcome=lambda x: x.apply(team_won_or_lost, axis=1),
        team_goal_diff=lambda x: x.apply(negate_for_away_team, axis=1),
        team_goal_diff_sign=lambda x: np.sign(x.team_goal_diff),
    )
    .relocate("team_id", before="B365_home_wins")
    .relocate("team_type", before="B365_home_wins")
    .relocate("team_goals", before="B365_home_wins")
    .relocate("team_goal_diff", before="B365_home_wins")
    .relocate("team_goal_diff_sign", before="B365_home_wins")
    .relocate("team_outcome", before="B365_home_wins")
)

# Check output
print(
    "Expected ratio is 2, got: ", matches_long_team.shape[0] / matches.shape[0]
)
matches_long_team.head(2)
Expected ratio is 2, got:  2.0
Table 4.7. Inspection: a few rows of table matches_long_team (1).
match_id country region league season stage match_date goal_sum goal_diff goal_diff_sign match_winner team_id team_type team_goals team_goal_diff team_goal_diff_sign team_outcome B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins PS_home_wins PS_draw PS_away_wins WH_home_wins WH_draw WH_away_wins SJ_home_wins SJ_draw SJ_away_wins VC_home_wins VC_draw VC_away_wins GB_home_wins GB_draw GB_away_wins BS_home_wins BS_draw BS_away_wins B365_ratio_ha BW_ratio_ha PS_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha GB_ratio_ha LB_ratio_ha SJ_ratio_ha BS_ratio_ha B365_log_ratio_ha BW_log_ratio_ha PS_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha GB_log_ratio_ha LB_log_ratio_ha SJ_log_ratio_ha BS_log_ratio_ha player_1 player_2 player_3 player_4 player_5 player_6 player_7 player_8 player_9 player_10 player_11
0 24559 Switzerland Switzerland Super League 2008/2009 1 2008-07-18 3 -1 -1 Away Wins 10192 home 1 -1 -1 lost NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 24559 Switzerland Switzerland Super League 2008/2009 1 2008-07-18 3 -1 -1 Away Wins 9931 away 2 1 1 won NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN


To each match, merge last known information about the team before that particular match.

Note, that only teams, which had information in team_attributes, will be merged (merge requires non-null values in team_info_date). As some teams did not have this information, their names were not merged too, as they were present in teams table. if this is an issue, team and team_attributes should be merged separately to matches_long_team.

Code: Merge matches_long_team and teams
matches_long_team = pd.merge_asof(
    left=matches_long_team,
    right=teams.dropna(subset=["team_info_date"]),
    left_on="match_date",
    right_on="team_info_date",
    by="team_id",
).relocate("team_info_date", before="goal_sum")

matches_long_team.tail(2)
Table 4.8. Inspection: a few rows of table matches_long_team (2).
match_id country region league season stage match_date team_info_date goal_sum goal_diff goal_diff_sign match_winner team_id team_type team_goals team_goal_diff team_goal_diff_sign team_outcome B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins PS_home_wins PS_draw PS_away_wins WH_home_wins WH_draw WH_away_wins SJ_home_wins SJ_draw SJ_away_wins VC_home_wins VC_draw VC_away_wins GB_home_wins GB_draw GB_away_wins BS_home_wins BS_draw BS_away_wins B365_ratio_ha BW_ratio_ha PS_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha GB_ratio_ha LB_ratio_ha SJ_ratio_ha BS_ratio_ha B365_log_ratio_ha BW_log_ratio_ha PS_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha GB_log_ratio_ha LB_log_ratio_ha SJ_log_ratio_ha BS_log_ratio_ha player_1 player_2 player_3 player_4 player_5 player_6 player_7 player_8 player_9 player_10 player_11 team_name team_short_name buildUpPlayPositioningClass chanceCreationPositioningClass defenceDefenderLineClass buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassingClass chanceCreationPassingClass chanceCreationCrossingClass chanceCreationShootingClass defencePressureClass defenceAggressionClass defenceTeamWidthClass
51956 25949 Switzerland Switzerland Super League 2015/2016 36 2016-05-25 2015-09-10 4 2 1 Home Wins 10243 home 3 2 1 won NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7621.00 197757.00 115700.00 113235.00 121080.00 41116.00 632356.00 465399.00 462608.00 198082.00 3517.00 FC Zürich ZUR Organised Organised Cover 62.00 49.00 46.00 47.00 50.00 54.00 47.00 43.00 56.00 Balanced Normal Mixed Normal Normal Normal Medium Press Normal
51957 25949 Switzerland Switzerland Super League 2015/2016 36 2016-05-25 2015-09-10 4 2 1 Home Wins 9824 away 1 -2 -1 lost NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 154261.00 294256.00 257845.00 41621.00 214344.00 114011.00 56868.00 488297.00 113227.00 531309.00 493418.00 FC Vaduz VAD Organised Organised Cover 53.00 32.00 56.00 38.00 53.00 46.00 42.00 33.00 58.00 Balanced Little Mixed Normal Normal Normal Medium Contain Normal

Let’s calculate goal scoring statistics of each team per season for team analysis.

Code: Create teams_goals_per_team
teams_goals_per_team = (
    matches_long_team.groupby(["team_name", "season", "league"])
    .team_goals.agg(["count", "sum"])
    .rename({"count": "n_matches", "sum": "n_goals"}, axis=1)
    .reset_index()
    # Exclude teams that did not play in that season
    .query("n_matches > 0")
    # Goals per match
    .eval("n_goals_per_match = n_goals/n_matches")
)

teams_goals_per_team.head(n=2)
Table 4.9. Inspection: a few rows of table teams_goals_per_team.
team_name season league n_matches n_goals n_goals_per_match
0 1. FC Kaiserslautern 2010/2011 Germany 1. Bundesliga 34 48 1.41
1 1. FC Kaiserslautern 2011/2012 Germany 1. Bundesliga 34 24 0.71

Let’s find several best and worst performing teams in all leagues per season.

Code: Create teams_top_bottom_goals
# Select Top 5 and Bottom 5 teams (by **goals per match**)
# in each season (all leagues)
def select_5(data, column: str, best: bool = True):
    """Select best/worst teams

    Args:
        data (pandas.dataframe)
        column (str): column name to perform computations on.
        best(bool): Should the best (if True) of worst (if False) be found?

    If several teams share the same result as the 5-th, then more than 5
    teams are returned.
    """
    if best:
        return data.nlargest(5, column, keep="all")
    else:
        return data.nsmallest(5, column, keep="all")


def select_5_per_season_by_goals(best: bool):
    """Select best/worst teams in each season"""
    return (
        teams_goals_per_team.groupby("season", as_index=False)
        .apply(select_5, "n_goals_per_match", best=best)
        .sort_values(["season", "n_goals_per_match"], ascending=[True, False])
        .reset_index(drop=True)
    )


teams_top_bottom_goals = pd.concat(
    [
        select_5_per_season_by_goals(best=True).assign(which="Top 5"),
        select_5_per_season_by_goals(best=False).assign(which="Bottom 5"),
    ]
).index_start_at(1)

# Preview
pd.concat([teams_top_bottom_goals.head(n=2), teams_top_bottom_goals.tail(n=2)])
Table 4.10. Inspection: a few rows of table teams_top_bottom_goals.
team_name season league n_matches n_goals n_goals_per_match which
1 Ajax 2009/2010 Netherlands Eredivisie 10 37 3.70 Top 5
2 Chelsea 2009/2010 England Premier League 11 40 3.64 Top 5
72 Aston Villa 2015/2016 England Premier League 38 27 0.71 Bottom 5
73 Boavista FC 2015/2016 Portugal Liga ZON Sagres 31 21 0.68 Bottom 5

Let’s find Top teams by percentage of matches that they won.

Code: Create teams_wins_per_season
teams_wins_per_season = (
    matches_long_team.groupby(["season", "league", "team_name"])
    .team_outcome.value_counts(normalize=True)
    .apply(lambda x: x * 100)
    .unstack("team_outcome")
    .relocate("lost", before="draw")
    .sort_values("won", ascending=False)
    .rename_axis(columns=None)
    .reset_index()
    .fillna(0)  # NaN = 0%
    .groupby(["season"], as_index=False)
    .apply(select_5, "won", best=True)
    .rename(columns=str.capitalize)
    .rename(columns={"Team_name": "Team"})
    .set_index(["Season", "League", "Team"])
    .sort_values(["Season", "Won"], ascending=[True, False])
)

teams_wins_per_season.head(2)
Table 4.11. Inspection: a few rows of table teams_wins_per_season. Columns Lost, Draw, Won indicate percentage of games per season with the indicated outcome.
Lost Draw Won
Season League Team
2009/2010 Belgium Jupiler League RSC Anderlecht 0.00 0.00 100.00
Netherlands Eredivisie Ajax 0.00 0.00 100.00

From matches_long_team, let’s create dataset matches_long_player with one row per player:

Code: Create matches_long_player
matches_long_player = matches_long_team.pivot_longer(
    column_names=re.compile("player_.+"),
    names_pattern="player_(.+)",
    names_to="player_no",
    values_to="player_id",
    sort_by_appearance=True,
)

print(
    "Expected ratio is 11, got: ",
    matches_long_player.shape[0] / matches_long_team.shape[0],
)
matches_long_player.head(2)
Expected ratio is 11, got:  11.0
Table 4.12. Inspection: a few rows of table matches_long_player (1).
match_id country region league season stage match_date team_info_date goal_sum goal_diff goal_diff_sign match_winner team_id team_type team_goals team_goal_diff team_goal_diff_sign team_outcome B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins PS_home_wins PS_draw PS_away_wins WH_home_wins WH_draw WH_away_wins SJ_home_wins SJ_draw SJ_away_wins VC_home_wins VC_draw VC_away_wins GB_home_wins GB_draw GB_away_wins BS_home_wins BS_draw BS_away_wins B365_ratio_ha BW_ratio_ha PS_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha GB_ratio_ha LB_ratio_ha SJ_ratio_ha BS_ratio_ha B365_log_ratio_ha BW_log_ratio_ha PS_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha GB_log_ratio_ha LB_log_ratio_ha SJ_log_ratio_ha BS_log_ratio_ha team_name team_short_name buildUpPlayPositioningClass chanceCreationPositioningClass defenceDefenderLineClass buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassingClass chanceCreationPassingClass chanceCreationCrossingClass chanceCreationShootingClass defencePressureClass defenceAggressionClass defenceTeamWidthClass player_no player_id
0 24559 Switzerland Switzerland Super League 2008/2009 1 2008-07-18 NaT 3 -1 -1 Away Wins 10192 home 1 -1 -1 lost NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN
1 24559 Switzerland Switzerland Super League 2008/2009 1 2008-07-18 NaT 3 -1 -1 Away Wins 10192 home 1 -1 -1 lost NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN

To each match, merge at that time last known information about each player.

Code: Merge matches_long_team and teams
matches_long_player = (
    pd.merge_asof(
        left=(
            matches_long_player.dropna(subset=["player_id"]).astype(
                {"player_id": np.int64}
            )
        ),
        right=players,
        left_on="match_date",
        right_on="player_info_date",
        by="player_id",
    )
    .relocate("player_info_date", before="goal_sum")
    .assign(
        player_age=lambda x: (
            (x.match_date - x.birthday) / np.timedelta64(1, "Y")
        )
    )
)

matches_long_player.tail(2)
Table 4.13. Inspection: a few rows of table matches_long_player (2).
match_id country region league season stage match_date team_info_date player_info_date goal_sum goal_diff goal_diff_sign match_winner team_id team_type team_goals team_goal_diff team_goal_diff_sign team_outcome B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins PS_home_wins PS_draw PS_away_wins WH_home_wins WH_draw WH_away_wins SJ_home_wins SJ_draw SJ_away_wins VC_home_wins VC_draw VC_away_wins GB_home_wins GB_draw GB_away_wins BS_home_wins BS_draw BS_away_wins B365_ratio_ha BW_ratio_ha PS_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha GB_ratio_ha LB_ratio_ha SJ_ratio_ha BS_ratio_ha B365_log_ratio_ha BW_log_ratio_ha PS_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha GB_log_ratio_ha LB_log_ratio_ha SJ_log_ratio_ha BS_log_ratio_ha team_name team_short_name buildUpPlayPositioningClass chanceCreationPositioningClass defenceDefenderLineClass buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassingClass chanceCreationPassingClass chanceCreationCrossingClass chanceCreationShootingClass defencePressureClass defenceAggressionClass defenceTeamWidthClass player_no player_id player_name birthday birth_year height weight_kg bmi overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_age
542279 25949 Switzerland Switzerland Super League 2015/2016 36 2016-05-25 2015-09-10 2016-04-21 4 2 1 Home Wins 9824 away 1 -2 -1 lost NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN FC Vaduz VAD Organised Organised Cover 53.00 32.00 56.00 38.00 53.00 46.00 42.00 33.00 58.00 Balanced Little Mixed Normal Normal Normal Medium Contain Normal 10 531309 Robin Kamber 1996-02-15 1996 187.96 82.99 23.49 52.00 65.00 right high low 49.00 47.00 49.00 63.00 47.00 50.00 53.00 48.00 66.00 56.00 65.00 60.00 56.00 48.00 55.00 48.00 31.00 45.00 71.00 40.00 43.00 31.00 46.00 60.00 50.00 43.00 45.00 42.00 12.00 11.00 8.00 9.00 8.00 20.27
542280 25949 Switzerland Switzerland Super League 2015/2016 36 2016-05-25 2015-09-10 2016-03-03 4 2 1 Home Wins 9824 away 1 -2 -1 lost NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN FC Vaduz VAD Organised Organised Cover 53.00 32.00 56.00 38.00 53.00 46.00 42.00 33.00 58.00 Balanced Little Mixed Normal Normal Normal Medium Contain Normal 11 493418 Albion Avdijaj 1994-01-12 1994 190.50 79.82 21.99 54.00 61.00 right medium medium 29.00 60.00 72.00 40.00 47.00 47.00 30.00 23.00 23.00 49.00 48.00 53.00 62.00 47.00 49.00 49.00 67.00 60.00 66.00 42.00 50.00 24.00 56.00 48.00 56.00 27.00 20.00 21.00 7.00 6.00 15.00 10.00 8.00 22.37
Code
matches_long_player.shape
(542281, 139)

Let’s aggregate player information (numeric variables) to get one row for team-match combination.

Code: Create table team_player_summary
# Prepare dataset for aggregation
include = [
    "height",
    "weight_kg",
    "bmi",
    "overall_rating",
    "potential",
    "crossing",
    "finishing",
    "heading_accuracy",
    "short_passing",
    "volleys",
    "dribbling",
    "curve",
    "free_kick_accuracy",
    "long_passing",
    "ball_control",
    "acceleration",
    "sprint_speed",
    "agility",
    "reactions",
    "balance",
    "shot_power",
    "jumping",
    "stamina",
    "strength",
    "long_shots",
    "aggression",
    "interceptions",
    "positioning",
    "vision",
    "penalties",
    "marking",
    "standing_tackle",
    "sliding_tackle",
    "gk_diving",
    "gk_handling",
    "gk_kicking",
    "gk_positioning",
    "gk_reflexes",
    "player_age",
]

for_agg = matches_long_player.groupby(["match_id", "team_id"])[include]

# Evaluate which cases include all 11 players
n_player_ok = for_agg.count().min(axis=1).to_frame("players_summarized")
percent_ok = round(n_player_ok.eval("players_summarized==11").mean(), 3) * 100
print(
    f"In {percent_ok:.1f}% cases, summaries include all 11 players. \n"
    "Only these cases will be analyzed next."
)

# Calculate summary statistics for each selected attribute.
# Include only those cases where 11 players are aggregated.
team_player_summary = for_agg.agg(["min", "mean", "std", "max"])
team_player_summary.columns = (
    team_player_summary.columns.to_flat_index().str.join("__")
)
team_player_summary = (
    n_player_ok.join(team_player_summary)
    .query("players_summarized==11")
    .drop(columns="players_summarized")
)
team_player_summary.head(2)
In 82.9% cases, summaries include all 11 players. 
Only these cases will be analyzed next.
Table 4.14. Inspection: a few rows of table team_player_summary.
height__min height__mean height__std height__max weight_kg__min weight_kg__mean weight_kg__std weight_kg__max bmi__min bmi__mean bmi__std bmi__max overall_rating__min overall_rating__mean overall_rating__std overall_rating__max potential__min potential__mean potential__std potential__max crossing__min crossing__mean crossing__std crossing__max finishing__min finishing__mean finishing__std finishing__max heading_accuracy__min heading_accuracy__mean heading_accuracy__std heading_accuracy__max short_passing__min short_passing__mean short_passing__std short_passing__max volleys__min volleys__mean volleys__std volleys__max dribbling__min dribbling__mean dribbling__std dribbling__max curve__min curve__mean curve__std curve__max free_kick_accuracy__min free_kick_accuracy__mean free_kick_accuracy__std free_kick_accuracy__max long_passing__min long_passing__mean long_passing__std long_passing__max ball_control__min ball_control__mean ball_control__std ball_control__max acceleration__min acceleration__mean acceleration__std acceleration__max sprint_speed__min sprint_speed__mean sprint_speed__std sprint_speed__max agility__min agility__mean agility__std agility__max reactions__min reactions__mean reactions__std reactions__max balance__min balance__mean balance__std balance__max shot_power__min shot_power__mean shot_power__std shot_power__max jumping__min jumping__mean jumping__std jumping__max stamina__min stamina__mean stamina__std stamina__max strength__min strength__mean strength__std strength__max long_shots__min long_shots__mean long_shots__std long_shots__max aggression__min aggression__mean aggression__std aggression__max interceptions__min interceptions__mean interceptions__std interceptions__max positioning__min positioning__mean positioning__std positioning__max vision__min vision__mean vision__std vision__max penalties__min penalties__mean penalties__std penalties__max marking__min marking__mean marking__std marking__max standing_tackle__min standing_tackle__mean standing_tackle__std standing_tackle__max sliding_tackle__min sliding_tackle__mean sliding_tackle__std sliding_tackle__max gk_diving__min gk_diving__mean gk_diving__std gk_diving__max gk_handling__min gk_handling__mean gk_handling__std gk_handling__max gk_kicking__min gk_kicking__mean gk_kicking__std gk_kicking__max gk_positioning__min gk_positioning__mean gk_positioning__std gk_positioning__max gk_reflexes__min gk_reflexes__mean gk_reflexes__std gk_reflexes__max player_age__min player_age__mean player_age__std player_age__max
match_id team_id
145 8635 167.64 183.34 7.08 193.04 60.77 78.87 9.00 93.88 21.62 23.39 1.32 25.59 57.00 69.45 4.80 75.00 69.00 74.36 2.84 78.00 29.00 57.82 14.86 78.00 23.00 49.27 18.31 71.00 25.00 61.00 16.77 83.00 51.00 65.73 8.91 78.00 9.00 47.82 21.34 69.00 23.00 55.45 18.62 85.00 11.00 48.27 19.80 77.00 23.00 51.18 19.39 79.00 48.00 62.82 10.22 79.00 51.00 64.45 10.52 82.00 48.00 66.00 9.83 78.00 58.00 68.91 6.49 77.00 48.00 64.73 10.53 82.00 57.00 67.64 6.77 82.00 47.00 68.64 13.04 91.00 25.00 62.00 16.96 85.00 61.00 67.82 4.75 77.00 55.00 73.45 9.17 85.00 42.00 68.73 16.60 91.00 23.00 53.27 17.77 74.00 32.00 67.45 16.62 93.00 31.00 62.09 15.47 82.00 13.00 59.64 20.44 83.00 49.00 67.64 10.49 84.00 42.00 62.64 13.46 83.00 24.00 55.36 19.07 74.00 22.00 57.09 20.20 78.00 12.00 56.55 22.62 74.00 1.00 14.64 18.03 67.00 20.00 25.82 13.71 67.00 48.00 62.82 10.22 79.00 20.00 25.64 13.11 65.00 20.00 26.09 14.61 70.00 18.90 25.71 3.56 31.00
146 9987 170.18 181.26 6.84 193.04 60.77 73.92 8.41 89.80 20.34 22.42 1.16 24.10 54.00 64.09 6.11 72.00 62.00 71.27 5.39 83.00 22.00 54.64 14.66 75.00 22.00 48.91 18.53 74.00 22.00 51.09 13.92 75.00 26.00 56.00 12.77 72.00 25.00 53.73 14.28 72.00 22.00 54.18 19.54 77.00 25.00 53.18 13.47 70.00 11.00 48.18 17.97 72.00 42.00 56.00 9.27 67.00 22.00 58.45 17.30 76.00 56.00 68.18 6.55 77.00 48.00 69.27 8.87 79.00 37.00 65.36 10.46 75.00 56.00 64.27 5.82 72.00 51.00 63.73 7.79 77.00 22.00 56.18 17.67 81.00 61.00 67.09 3.91 73.00 43.00 66.82 9.71 83.00 47.00 63.27 12.55 89.00 22.00 51.73 16.66 69.00 44.00 62.36 11.41 82.00 30.00 59.27 11.59 72.00 30.00 57.45 16.59 77.00 25.00 60.82 13.47 74.00 31.00 58.55 15.44 82.00 21.00 45.09 17.70 65.00 21.00 47.36 17.32 74.00 22.00 49.27 20.23 72.00 1.00 13.82 16.52 62.00 20.00 24.91 12.68 63.00 42.00 56.27 9.57 67.00 20.00 24.55 11.48 59.00 20.00 25.00 12.98 64.00 18.78 23.39 3.22 27.43

Let’s prepare datasets for predictive modelling. Several principles I followed:

  • mainly numeric variables will be included in the analysis. Exception will be for variable team_type witch indicates if the team is playing at home or away.
  • some variables (especially betting odds which are highly inter-correlated) with may missing values were also excluded in order not to have more complete cases.

Prepare betting odds for team analysis: instead of home_wins, away_wins which are less correct in this analysis, here _win (victory) and _loose (loss) betting odds will be used. Ratios will be calculate using the new variables accordingly. The ratios will wave names ending in _ratio_wl (ratio betting odds to win / betting odds to loose).

Code: Create table team_betting_odds
team_betting_odds_pre1 = matches_long_team.set_index(["match_id", "team_id"])

team_betting_odds = (
    pd.concat(
        [
            # Transformations for home team
            team_betting_odds_pre1.query("team_type == 'home'").assign(
                # Team wins
                B365_win=lambda x: x.B365_home_wins,
                BW_win=lambda x: x.BW_home_wins,
                VC_win=lambda x: x.VC_home_wins,
                IW_win=lambda x: x.IW_home_wins,
                WH_win=lambda x: x.WH_home_wins,
                LB_win=lambda x: x.LB_home_wins,
                # Team looses
                B365_loose=lambda x: x.B365_away_wins,
                BW_loose=lambda x: x.BW_away_wins,
                VC_loose=lambda x: x.VC_away_wins,
                IW_loose=lambda x: x.IW_away_wins,
                WH_loose=lambda x: x.WH_away_wins,
                LB_loose=lambda x: x.LB_away_wins,
            ),
            # Transformations for away team
            team_betting_odds_pre1.query("team_type == 'away'").assign(
                # Team wins
                B365_win=lambda x: x.B365_away_wins,
                BW_win=lambda x: x.BW_away_wins,
                VC_win=lambda x: x.VC_away_wins,
                IW_win=lambda x: x.IW_away_wins,
                WH_win=lambda x: x.WH_away_wins,
                LB_win=lambda x: x.LB_away_wins,
                # Team looses
                B365_loose=lambda x: x.B365_home_wins,
                BW_loose=lambda x: x.BW_home_wins,
                VC_loose=lambda x: x.VC_home_wins,
                IW_loose=lambda x: x.IW_home_wins,
                WH_loose=lambda x: x.WH_home_wins,
                LB_loose=lambda x: x.LB_home_wins,
            ),
        ]
    )
    .assign(
        # Ratio wl: "team wins / team looses"
        B365_ratio_wl=lambda x: x.B365_win / x.B365_loose,
        BW_ratio_wl=lambda x: x.BW_win / x.BW_loose,
        VC_ratio_wl=lambda x: x.VC_win / x.VC_loose,
        IW_ratio_wl=lambda x: x.IW_win / x.IW_loose,
        WH_ratio_wl=lambda x: x.WH_win / x.WH_loose,
        LB_ratio_wl=lambda x: x.LB_win / x.LB_loose,
        # Log-ratios of ha
        B365_log_ratio_wl=lambda x: np.log(x.B365_win / x.B365_loose),
        BW_log_ratio_wl=lambda x: np.log(x.BW_win / x.BW_loose),
        VC_log_ratio_wl=lambda x: np.log(x.VC_win / x.VC_loose),
        IW_log_ratio_wl=lambda x: np.log(x.IW_win / x.IW_loose),
        WH_log_ratio_wl=lambda x: np.log(x.WH_win / x.WH_loose),
        LB_log_ratio_wl=lambda x: np.log(x.LB_win / x.LB_loose),
    )
    # Keep just betting odds of interest
    .filter(regex="(?<!PS|GB|SJ|BS)_(win$|draw$|loose$|(log_)?ratio_wl$)")
    .dropna()
)

del [team_betting_odds_pre1]

# Inspect
team_betting_odds.tail(2)
Table 4.15. Inspection: a few rows of table team_betting_odds.
B365_draw BW_draw IW_draw LB_draw WH_draw VC_draw B365_win BW_win VC_win IW_win WH_win LB_win B365_loose BW_loose VC_loose IW_loose WH_loose LB_loose B365_ratio_wl BW_ratio_wl VC_ratio_wl IW_ratio_wl WH_ratio_wl LB_ratio_wl B365_log_ratio_wl BW_log_ratio_wl VC_log_ratio_wl IW_log_ratio_wl WH_log_ratio_wl LB_log_ratio_wl
match_id team_id
24491 8305 3.80 3.80 4.00 3.80 3.80 4.00 1.70 1.70 1.73 1.60 1.60 1.70 5.00 4.60 5.00 4.80 5.00 4.50 0.34 0.37 0.35 0.33 0.32 0.38 -1.08 -1.00 -1.06 -1.10 -1.14 -0.97
4702 8678 4.20 4.00 3.70 3.80 4.00 4.10 5.25 5.00 5.20 4.50 4.75 5.25 1.67 1.67 1.67 1.70 1.67 1.65 3.14 2.99 3.11 2.65 2.84 3.18 1.15 1.10 1.14 0.97 1.05 1.16
Code
team_betting_odds.shape
(44864, 30)
Code: Create table matches_long_team1
cols = matches_long_team.columns
# Remove player info and column with many NA values
condition_1 = ("player_", "buildUpPlayDribbling")
# Remove some categorical variables and betting odds info
condition_2 = ("Class", "_draw", "_wins", "_ha")
col_index = ~(cols.str.startswith(condition_1) | cols.str.endswith(condition_2))
# Remove rows with no team info
row_index = ~matches_long_team.team_info_date.isna()

matches_long_team1 = matches_long_team.loc[row_index, col_index]

# Join team info, player info and betting odds info
matches_long_team1 = (
    matches_long_team1.set_index(["match_id", "team_id"])
    .join(team_player_summary)
    .join(team_betting_odds)
)

# Remove intermediate results
del [cols, condition_1, condition_2, col_index, row_index]

# Inspect
matches_long_team1.head(2)
Table 4.16. Inspection: a few rows of table matches_long_team1.
country region league season stage match_date team_info_date goal_sum goal_diff goal_diff_sign match_winner team_type team_goals team_goal_diff team_goal_diff_sign team_outcome team_name team_short_name buildUpPlaySpeed buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth height__min height__mean height__std height__max weight_kg__min weight_kg__mean weight_kg__std weight_kg__max bmi__min bmi__mean bmi__std bmi__max overall_rating__min overall_rating__mean overall_rating__std overall_rating__max potential__min potential__mean potential__std potential__max crossing__min crossing__mean crossing__std crossing__max finishing__min finishing__mean finishing__std finishing__max heading_accuracy__min heading_accuracy__mean heading_accuracy__std heading_accuracy__max short_passing__min short_passing__mean short_passing__std short_passing__max volleys__min volleys__mean volleys__std volleys__max dribbling__min dribbling__mean dribbling__std dribbling__max curve__min curve__mean curve__std curve__max free_kick_accuracy__min free_kick_accuracy__mean free_kick_accuracy__std free_kick_accuracy__max long_passing__min long_passing__mean long_passing__std long_passing__max ball_control__min ball_control__mean ball_control__std ball_control__max acceleration__min acceleration__mean acceleration__std acceleration__max sprint_speed__min sprint_speed__mean sprint_speed__std sprint_speed__max agility__min agility__mean agility__std agility__max reactions__min reactions__mean reactions__std reactions__max balance__min balance__mean balance__std balance__max shot_power__min shot_power__mean shot_power__std shot_power__max jumping__min jumping__mean jumping__std jumping__max stamina__min stamina__mean stamina__std stamina__max strength__min strength__mean strength__std strength__max long_shots__min long_shots__mean long_shots__std long_shots__max aggression__min aggression__mean aggression__std aggression__max interceptions__min interceptions__mean interceptions__std interceptions__max positioning__min positioning__mean positioning__std positioning__max vision__min vision__mean vision__std vision__max penalties__min penalties__mean penalties__std penalties__max marking__min marking__mean marking__std marking__max standing_tackle__min standing_tackle__mean standing_tackle__std standing_tackle__max sliding_tackle__min sliding_tackle__mean sliding_tackle__std sliding_tackle__max gk_diving__min gk_diving__mean gk_diving__std gk_diving__max gk_handling__min gk_handling__mean gk_handling__std gk_handling__max gk_kicking__min gk_kicking__mean gk_kicking__std gk_kicking__max gk_positioning__min gk_positioning__mean gk_positioning__std gk_positioning__max gk_reflexes__min gk_reflexes__mean gk_reflexes__std gk_reflexes__max player_age__min player_age__mean player_age__std player_age__max B365_draw BW_draw IW_draw LB_draw WH_draw VC_draw B365_win BW_win VC_win IW_win WH_win LB_win B365_loose BW_loose VC_loose IW_loose WH_loose LB_loose B365_ratio_wl BW_ratio_wl VC_ratio_wl IW_ratio_wl WH_ratio_wl LB_ratio_wl B365_log_ratio_wl BW_log_ratio_wl VC_log_ratio_wl IW_log_ratio_wl WH_log_ratio_wl LB_log_ratio_wl
match_id team_id
22055 10267 Spain Spain LIGA BBVA 2009/2010 23 2010-02-22 2010-02-22 3 1 1 Home Wins home 2 1 1 won Valencia CF VAL 30.00 30.00 55.00 60.00 70.00 55.00 60.00 60.00 170.18 178.95 4.87 185.42 67.12 74.99 4.62 82.09 22.37 23.40 0.65 24.67 77.00 80.55 3.64 88.00 80.00 84.82 3.66 91.00 21.00 62.91 24.16 89.00 21.00 55.09 26.52 94.00 21.00 65.00 17.37 82.00 21.00 73.00 18.74 90.00 9.00 57.18 23.11 87.00 21.00 68.18 22.03 89.00 21.00 64.64 20.46 88.00 11.00 58.55 22.02 86.00 55.00 67.18 8.39 86.00 32.00 75.18 16.22 91.00 35.00 73.18 15.08 88.00 35.00 72.73 14.96 87.00 45.00 68.82 14.90 87.00 59.00 77.00 8.93 93.00 56.00 74.45 8.15 82.00 21.00 69.64 20.05 91.00 58.00 69.09 9.20 85.00 46.00 74.36 10.71 85.00 58.00 73.18 9.52 85.00 21.00 63.27 24.01 88.00 52.00 72.36 10.68 89.00 60.00 76.09 8.93 89.00 11.00 74.82 22.26 93.00 58.00 75.64 10.76 90.00 66.00 74.27 6.10 86.00 21.00 55.36 29.53 86.00 21.00 54.82 28.98 85.00 8.00 56.18 27.17 81.00 7.00 16.00 20.37 77.00 20.00 26.82 15.05 72.00 55.00 67.18 8.39 86.00 20.00 28.27 19.86 88.00 20.00 27.09 15.95 75.00 21.65 28.60 4.54 38.48 4.00 3.80 3.70 3.60 3.50 3.75 1.57 1.50 1.53 1.55 1.53 1.50 5.50 6.50 6.50 6.00 6.00 5.50 0.29 0.23 0.24 0.26 0.26 0.27 -1.25 -1.47 -1.45 -1.35 -1.37 -1.30
8305 Spain Spain LIGA BBVA 2009/2010 23 2010-02-22 2010-02-22 3 1 1 Home Wins away 1 -1 -1 lost Getafe CF GET 30.00 35.00 35.00 50.00 70.00 40.00 30.00 50.00 175.26 182.42 3.74 187.96 68.03 75.82 3.93 81.18 21.83 22.77 0.61 23.87 72.00 75.45 2.38 80.00 75.00 80.18 3.71 86.00 24.00 62.36 17.15 89.00 21.00 51.64 22.73 80.00 24.00 64.18 15.67 82.00 24.00 68.18 16.82 83.00 11.00 55.91 20.40 74.00 24.00 63.09 17.31 83.00 7.00 59.00 20.98 86.00 12.00 59.45 19.65 80.00 49.00 68.09 12.36 85.00 26.00 67.45 14.20 75.00 68.00 74.18 4.87 82.00 63.00 72.82 6.24 83.00 57.00 66.91 6.44 80.00 59.00 71.73 5.39 77.00 62.00 71.55 5.92 81.00 24.00 66.18 20.86 92.00 49.00 69.55 8.72 81.00 48.00 72.55 10.57 85.00 66.00 74.73 6.17 87.00 24.00 60.09 17.19 83.00 27.00 66.82 16.46 82.00 58.00 70.64 5.68 76.00 11.00 68.27 19.60 83.00 55.00 66.91 11.09 84.00 55.00 65.09 5.03 73.00 23.00 57.00 24.42 87.00 24.00 59.73 23.25 84.00 12.00 56.82 26.10 84.00 1.00 13.18 20.49 74.00 20.00 27.55 16.45 77.00 49.00 68.45 12.44 85.00 20.00 27.36 15.85 75.00 20.00 27.91 17.65 81.00 20.85 26.45 3.64 33.95 4.00 3.80 3.70 3.60 3.50 3.75 5.50 6.50 6.50 6.00 6.00 5.50 1.57 1.50 1.53 1.55 1.53 1.50 3.50 4.33 4.25 3.87 3.92 3.67 1.25 1.47 1.45 1.35 1.37 1.30
Code
matches_long_team1.shape
(39840, 212)

In the following code blocks:

  • Separate datasets were created for team-related and match-related analysis.
  • Data were split ro train and sets sets:
    • Training data was included all seasons except the last one.
    • In test set there were data from the last season only (2015/2016).
  • Non-complete cases were removed.
Code: Tables and variables for team-related predictive modeling`
# For predictive modelling (teams)
# Target:
team_target = "team_goals"

# Predictors by type:
df = matches_long_team1
team_vars_team = df.loc[:, "buildUpPlaySpeed":"defenceTeamWidth"].columns
team_vars_player = df.loc[:, "height__min":"player_age__max"].columns
team_vars_betting_odds = df.loc[:, "B365_win":"LB_log_ratio_wl"].columns

team_predictors = [
    "team_type",
    *team_vars_team,
    *team_vars_player,
    *team_vars_betting_odds,
]

# Whole dataset
team_model = df.filter(
    ["season", team_target, *team_predictors], axis=1
).dropna()

# Training/Test sets
team_train = team_model.query("season != '2015/2016'").drop(columns="season")
team_test = team_model.query("season == '2015/2016'").drop(columns="season")

# Remove intermediate results
del [df, team_model]

# Inspect
team_train.head(2)
Table 4.17. Inspection: a few rows of table team_train.
team_goals team_type buildUpPlaySpeed buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth height__min height__mean height__std height__max weight_kg__min weight_kg__mean weight_kg__std weight_kg__max bmi__min bmi__mean bmi__std bmi__max overall_rating__min overall_rating__mean overall_rating__std overall_rating__max potential__min potential__mean potential__std potential__max crossing__min crossing__mean crossing__std crossing__max finishing__min finishing__mean finishing__std finishing__max heading_accuracy__min heading_accuracy__mean heading_accuracy__std heading_accuracy__max short_passing__min short_passing__mean short_passing__std short_passing__max volleys__min volleys__mean volleys__std volleys__max dribbling__min dribbling__mean dribbling__std dribbling__max curve__min curve__mean curve__std curve__max free_kick_accuracy__min free_kick_accuracy__mean free_kick_accuracy__std free_kick_accuracy__max long_passing__min long_passing__mean long_passing__std long_passing__max ball_control__min ball_control__mean ball_control__std ball_control__max acceleration__min acceleration__mean acceleration__std acceleration__max sprint_speed__min sprint_speed__mean sprint_speed__std sprint_speed__max agility__min agility__mean agility__std agility__max reactions__min reactions__mean reactions__std reactions__max balance__min balance__mean balance__std balance__max shot_power__min shot_power__mean shot_power__std shot_power__max jumping__min jumping__mean jumping__std jumping__max stamina__min stamina__mean stamina__std stamina__max strength__min strength__mean strength__std strength__max long_shots__min long_shots__mean long_shots__std long_shots__max aggression__min aggression__mean aggression__std aggression__max interceptions__min interceptions__mean interceptions__std interceptions__max positioning__min positioning__mean positioning__std positioning__max vision__min vision__mean vision__std vision__max penalties__min penalties__mean penalties__std penalties__max marking__min marking__mean marking__std marking__max standing_tackle__min standing_tackle__mean standing_tackle__std standing_tackle__max sliding_tackle__min sliding_tackle__mean sliding_tackle__std sliding_tackle__max gk_diving__min gk_diving__mean gk_diving__std gk_diving__max gk_handling__min gk_handling__mean gk_handling__std gk_handling__max gk_kicking__min gk_kicking__mean gk_kicking__std gk_kicking__max gk_positioning__min gk_positioning__mean gk_positioning__std gk_positioning__max gk_reflexes__min gk_reflexes__mean gk_reflexes__std gk_reflexes__max player_age__min player_age__mean player_age__std player_age__max B365_win BW_win VC_win IW_win WH_win LB_win B365_loose BW_loose VC_loose IW_loose WH_loose LB_loose B365_ratio_wl BW_ratio_wl VC_ratio_wl IW_ratio_wl WH_ratio_wl LB_ratio_wl B365_log_ratio_wl BW_log_ratio_wl VC_log_ratio_wl IW_log_ratio_wl WH_log_ratio_wl LB_log_ratio_wl
match_id team_id
22055 10267 2 home 30.00 30.00 55.00 60.00 70.00 55.00 60.00 60.00 170.18 178.95 4.87 185.42 67.12 74.99 4.62 82.09 22.37 23.40 0.65 24.67 77.00 80.55 3.64 88.00 80.00 84.82 3.66 91.00 21.00 62.91 24.16 89.00 21.00 55.09 26.52 94.00 21.00 65.00 17.37 82.00 21.00 73.00 18.74 90.00 9.00 57.18 23.11 87.00 21.00 68.18 22.03 89.00 21.00 64.64 20.46 88.00 11.00 58.55 22.02 86.00 55.00 67.18 8.39 86.00 32.00 75.18 16.22 91.00 35.00 73.18 15.08 88.00 35.00 72.73 14.96 87.00 45.00 68.82 14.90 87.00 59.00 77.00 8.93 93.00 56.00 74.45 8.15 82.00 21.00 69.64 20.05 91.00 58.00 69.09 9.20 85.00 46.00 74.36 10.71 85.00 58.00 73.18 9.52 85.00 21.00 63.27 24.01 88.00 52.00 72.36 10.68 89.00 60.00 76.09 8.93 89.00 11.00 74.82 22.26 93.00 58.00 75.64 10.76 90.00 66.00 74.27 6.10 86.00 21.00 55.36 29.53 86.00 21.00 54.82 28.98 85.00 8.00 56.18 27.17 81.00 7.00 16.00 20.37 77.00 20.00 26.82 15.05 72.00 55.00 67.18 8.39 86.00 20.00 28.27 19.86 88.00 20.00 27.09 15.95 75.00 21.65 28.60 4.54 38.48 1.57 1.50 1.53 1.55 1.53 1.50 5.50 6.50 6.50 6.00 6.00 5.50 0.29 0.23 0.24 0.26 0.26 0.27 -1.25 -1.47 -1.45 -1.35 -1.37 -1.30
8305 1 away 30.00 35.00 35.00 50.00 70.00 40.00 30.00 50.00 175.26 182.42 3.74 187.96 68.03 75.82 3.93 81.18 21.83 22.77 0.61 23.87 72.00 75.45 2.38 80.00 75.00 80.18 3.71 86.00 24.00 62.36 17.15 89.00 21.00 51.64 22.73 80.00 24.00 64.18 15.67 82.00 24.00 68.18 16.82 83.00 11.00 55.91 20.40 74.00 24.00 63.09 17.31 83.00 7.00 59.00 20.98 86.00 12.00 59.45 19.65 80.00 49.00 68.09 12.36 85.00 26.00 67.45 14.20 75.00 68.00 74.18 4.87 82.00 63.00 72.82 6.24 83.00 57.00 66.91 6.44 80.00 59.00 71.73 5.39 77.00 62.00 71.55 5.92 81.00 24.00 66.18 20.86 92.00 49.00 69.55 8.72 81.00 48.00 72.55 10.57 85.00 66.00 74.73 6.17 87.00 24.00 60.09 17.19 83.00 27.00 66.82 16.46 82.00 58.00 70.64 5.68 76.00 11.00 68.27 19.60 83.00 55.00 66.91 11.09 84.00 55.00 65.09 5.03 73.00 23.00 57.00 24.42 87.00 24.00 59.73 23.25 84.00 12.00 56.82 26.10 84.00 1.00 13.18 20.49 74.00 20.00 27.55 16.45 77.00 49.00 68.45 12.44 85.00 20.00 27.36 15.85 75.00 20.00 27.91 17.65 81.00 20.85 26.45 3.64 33.95 5.50 6.50 6.50 6.00 6.00 5.50 1.57 1.50 1.53 1.55 1.53 1.50 3.50 4.33 4.25 3.87 3.92 3.67 1.25 1.47 1.45 1.35 1.37 1.30
Code
team_train.shape
(27200, 190)

In a similar way, data for matches outcome analysis will be prepared.

Code: Create table matches_long_team2
cols = matches_long_team.columns
condition_1 = ("PS_", "GB_", "SJ_", "BS_", "player_", "buildUpPlayDribbling")

col_index = ~(cols.str.startswith(condition_1) | cols.str.endswith("Class"))
row_index = ~matches_long_team.team_info_date.isna()

matches_long_team2 = matches_long_team.loc[row_index, col_index]
matches_long_team2 = (
    matches_long_team2.set_index(["match_id", "team_id"])
    .join(team_player_summary)
    .relocate("team_short_name", before="goal_sum")
    .relocate("team_name", before="team_short_name")
)

# Remove intermediate results
del [cols, condition_1, col_index, row_index]

# Inspect
matches_long_team2.head(2)
Table 4.18. Inspection: a few rows of table matches_long_team2.
country region league season stage match_date team_info_date team_name team_short_name goal_sum goal_diff goal_diff_sign match_winner team_type team_goals team_goal_diff team_goal_diff_sign team_outcome B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins WH_home_wins WH_draw WH_away_wins VC_home_wins VC_draw VC_away_wins B365_ratio_ha BW_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha LB_ratio_ha B365_log_ratio_ha BW_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha LB_log_ratio_ha buildUpPlaySpeed buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth height__min height__mean height__std height__max weight_kg__min weight_kg__mean weight_kg__std weight_kg__max bmi__min bmi__mean bmi__std bmi__max overall_rating__min overall_rating__mean overall_rating__std overall_rating__max potential__min potential__mean potential__std potential__max crossing__min crossing__mean crossing__std crossing__max finishing__min finishing__mean finishing__std finishing__max heading_accuracy__min heading_accuracy__mean heading_accuracy__std heading_accuracy__max short_passing__min short_passing__mean short_passing__std short_passing__max volleys__min volleys__mean volleys__std volleys__max dribbling__min dribbling__mean dribbling__std dribbling__max curve__min curve__mean curve__std curve__max free_kick_accuracy__min free_kick_accuracy__mean free_kick_accuracy__std free_kick_accuracy__max long_passing__min long_passing__mean long_passing__std long_passing__max ball_control__min ball_control__mean ball_control__std ball_control__max acceleration__min acceleration__mean acceleration__std acceleration__max sprint_speed__min sprint_speed__mean sprint_speed__std sprint_speed__max agility__min agility__mean agility__std agility__max reactions__min reactions__mean reactions__std reactions__max balance__min balance__mean balance__std balance__max shot_power__min shot_power__mean shot_power__std shot_power__max jumping__min jumping__mean jumping__std jumping__max stamina__min stamina__mean stamina__std stamina__max strength__min strength__mean strength__std strength__max long_shots__min long_shots__mean long_shots__std long_shots__max aggression__min aggression__mean aggression__std aggression__max interceptions__min interceptions__mean interceptions__std interceptions__max positioning__min positioning__mean positioning__std positioning__max vision__min vision__mean vision__std vision__max penalties__min penalties__mean penalties__std penalties__max marking__min marking__mean marking__std marking__max standing_tackle__min standing_tackle__mean standing_tackle__std standing_tackle__max sliding_tackle__min sliding_tackle__mean sliding_tackle__std sliding_tackle__max gk_diving__min gk_diving__mean gk_diving__std gk_diving__max gk_handling__min gk_handling__mean gk_handling__std gk_handling__max gk_kicking__min gk_kicking__mean gk_kicking__std gk_kicking__max gk_positioning__min gk_positioning__mean gk_positioning__std gk_positioning__max gk_reflexes__min gk_reflexes__mean gk_reflexes__std gk_reflexes__max player_age__min player_age__mean player_age__std player_age__max
match_id team_id
22055 10267 Spain Spain LIGA BBVA 2009/2010 23 2010-02-22 2010-02-22 Valencia CF VAL 3 1 1 Home Wins home 2 1 1 won 1.57 4.00 5.50 1.50 3.80 6.50 1.55 3.70 6.00 1.50 3.60 5.50 1.53 3.50 6.00 1.53 3.75 6.50 0.29 0.23 0.24 0.26 0.26 0.27 -1.25 -1.47 -1.45 -1.35 -1.37 -1.30 30.00 30.00 55.00 60.00 70.00 55.00 60.00 60.00 170.18 178.95 4.87 185.42 67.12 74.99 4.62 82.09 22.37 23.40 0.65 24.67 77.00 80.55 3.64 88.00 80.00 84.82 3.66 91.00 21.00 62.91 24.16 89.00 21.00 55.09 26.52 94.00 21.00 65.00 17.37 82.00 21.00 73.00 18.74 90.00 9.00 57.18 23.11 87.00 21.00 68.18 22.03 89.00 21.00 64.64 20.46 88.00 11.00 58.55 22.02 86.00 55.00 67.18 8.39 86.00 32.00 75.18 16.22 91.00 35.00 73.18 15.08 88.00 35.00 72.73 14.96 87.00 45.00 68.82 14.90 87.00 59.00 77.00 8.93 93.00 56.00 74.45 8.15 82.00 21.00 69.64 20.05 91.00 58.00 69.09 9.20 85.00 46.00 74.36 10.71 85.00 58.00 73.18 9.52 85.00 21.00 63.27 24.01 88.00 52.00 72.36 10.68 89.00 60.00 76.09 8.93 89.00 11.00 74.82 22.26 93.00 58.00 75.64 10.76 90.00 66.00 74.27 6.10 86.00 21.00 55.36 29.53 86.00 21.00 54.82 28.98 85.00 8.00 56.18 27.17 81.00 7.00 16.00 20.37 77.00 20.00 26.82 15.05 72.00 55.00 67.18 8.39 86.00 20.00 28.27 19.86 88.00 20.00 27.09 15.95 75.00 21.65 28.60 4.54 38.48
8305 Spain Spain LIGA BBVA 2009/2010 23 2010-02-22 2010-02-22 Getafe CF GET 3 1 1 Home Wins away 1 -1 -1 lost 1.57 4.00 5.50 1.50 3.80 6.50 1.55 3.70 6.00 1.50 3.60 5.50 1.53 3.50 6.00 1.53 3.75 6.50 0.29 0.23 0.24 0.26 0.26 0.27 -1.25 -1.47 -1.45 -1.35 -1.37 -1.30 30.00 35.00 35.00 50.00 70.00 40.00 30.00 50.00 175.26 182.42 3.74 187.96 68.03 75.82 3.93 81.18 21.83 22.77 0.61 23.87 72.00 75.45 2.38 80.00 75.00 80.18 3.71 86.00 24.00 62.36 17.15 89.00 21.00 51.64 22.73 80.00 24.00 64.18 15.67 82.00 24.00 68.18 16.82 83.00 11.00 55.91 20.40 74.00 24.00 63.09 17.31 83.00 7.00 59.00 20.98 86.00 12.00 59.45 19.65 80.00 49.00 68.09 12.36 85.00 26.00 67.45 14.20 75.00 68.00 74.18 4.87 82.00 63.00 72.82 6.24 83.00 57.00 66.91 6.44 80.00 59.00 71.73 5.39 77.00 62.00 71.55 5.92 81.00 24.00 66.18 20.86 92.00 49.00 69.55 8.72 81.00 48.00 72.55 10.57 85.00 66.00 74.73 6.17 87.00 24.00 60.09 17.19 83.00 27.00 66.82 16.46 82.00 58.00 70.64 5.68 76.00 11.00 68.27 19.60 83.00 55.00 66.91 11.09 84.00 55.00 65.09 5.03 73.00 23.00 57.00 24.42 87.00 24.00 59.73 23.25 84.00 12.00 56.82 26.10 84.00 1.00 13.18 20.49 74.00 20.00 27.55 16.45 77.00 49.00 68.45 12.44 85.00 20.00 27.36 15.85 75.00 20.00 27.91 17.65 81.00 20.85 26.45 3.64 33.95
Code
matches_long_team2.shape
(39840, 212)
Code: Tables and variables for match-related predictive modeling`
# For predictive modelling (match)
# Target:
match_target = "match_winner"

# Variable groups for transformation
vars_predictors = matches_long_team2.loc[
    :, "B365_home_wins":"player_age__max"
].columns
vars_betting_odds = matches_long_team2.loc[
    :, "B365_home_wins":"LB_log_ratio_ha"
].columns

# Whole dataset
match_model = (
    matches_long_team2.filter(
        ["season", "team_type", match_target, *vars_predictors], axis=1
    )
    .reset_index()
    .drop(columns="team_id")
    .pivot_wider(
        index=["match_id", match_target, "season", *vars_betting_odds],
        names_from="team_type",
    )
    .set_index("match_id")
    .dropna()
)

# Predictors by type
match_vars_team = match_model.loc[
    :, "buildUpPlaySpeed_away":"defenceTeamWidth_home"
].columns
match_vars_player = match_model.loc[
    :, "height__min_away":"player_age__max_home"
].columns
match_vars_betting_odds = match_model.loc[
    :, "B365_home_wins":"LB_log_ratio_ha"
].columns

match_predictors = [
    "team_type",
    *match_vars_team,
    *match_vars_player,
    *match_vars_betting_odds,
]

# Training/Test sets
match_train = match_model.query("season != '2015/2016'").drop(columns="season")
match_test = match_model.query("season == '2015/2016'").drop(columns="season")

# Remove intermediate results
del [match_model, vars_predictors, vars_betting_odds]

# Inspect
match_train.head(2)
Table 4.19. Inspection: a few rows of table match_train.
match_winner B365_home_wins B365_draw B365_away_wins BW_home_wins BW_draw BW_away_wins IW_home_wins IW_draw IW_away_wins LB_home_wins LB_draw LB_away_wins WH_home_wins WH_draw WH_away_wins VC_home_wins VC_draw VC_away_wins B365_ratio_ha BW_ratio_ha VC_ratio_ha IW_ratio_ha WH_ratio_ha LB_ratio_ha B365_log_ratio_ha BW_log_ratio_ha VC_log_ratio_ha IW_log_ratio_ha WH_log_ratio_ha LB_log_ratio_ha buildUpPlaySpeed_away buildUpPlaySpeed_home buildUpPlayPassing_away buildUpPlayPassing_home chanceCreationPassing_away chanceCreationPassing_home chanceCreationCrossing_away chanceCreationCrossing_home chanceCreationShooting_away chanceCreationShooting_home defencePressure_away defencePressure_home defenceAggression_away defenceAggression_home defenceTeamWidth_away defenceTeamWidth_home height__min_away height__min_home height__mean_away height__mean_home height__std_away height__std_home height__max_away height__max_home weight_kg__min_away weight_kg__min_home weight_kg__mean_away weight_kg__mean_home weight_kg__std_away weight_kg__std_home weight_kg__max_away weight_kg__max_home bmi__min_away bmi__min_home bmi__mean_away bmi__mean_home bmi__std_away bmi__std_home bmi__max_away bmi__max_home overall_rating__min_away overall_rating__min_home overall_rating__mean_away overall_rating__mean_home overall_rating__std_away overall_rating__std_home overall_rating__max_away overall_rating__max_home potential__min_away potential__min_home potential__mean_away potential__mean_home potential__std_away potential__std_home potential__max_away potential__max_home crossing__min_away crossing__min_home crossing__mean_away crossing__mean_home crossing__std_away crossing__std_home crossing__max_away crossing__max_home finishing__min_away finishing__min_home finishing__mean_away finishing__mean_home finishing__std_away finishing__std_home finishing__max_away finishing__max_home heading_accuracy__min_away heading_accuracy__min_home heading_accuracy__mean_away heading_accuracy__mean_home heading_accuracy__std_away heading_accuracy__std_home heading_accuracy__max_away heading_accuracy__max_home short_passing__min_away short_passing__min_home short_passing__mean_away short_passing__mean_home short_passing__std_away short_passing__std_home short_passing__max_away short_passing__max_home volleys__min_away volleys__min_home volleys__mean_away volleys__mean_home volleys__std_away volleys__std_home volleys__max_away volleys__max_home dribbling__min_away dribbling__min_home dribbling__mean_away dribbling__mean_home dribbling__std_away dribbling__std_home dribbling__max_away dribbling__max_home curve__min_away curve__min_home curve__mean_away curve__mean_home curve__std_away curve__std_home curve__max_away curve__max_home free_kick_accuracy__min_away free_kick_accuracy__min_home free_kick_accuracy__mean_away free_kick_accuracy__mean_home free_kick_accuracy__std_away free_kick_accuracy__std_home free_kick_accuracy__max_away ... shot_power__mean_away shot_power__mean_home shot_power__std_away shot_power__std_home shot_power__max_away shot_power__max_home jumping__min_away jumping__min_home jumping__mean_away jumping__mean_home jumping__std_away jumping__std_home jumping__max_away jumping__max_home stamina__min_away stamina__min_home stamina__mean_away stamina__mean_home stamina__std_away stamina__std_home stamina__max_away stamina__max_home strength__min_away strength__min_home strength__mean_away strength__mean_home strength__std_away strength__std_home strength__max_away strength__max_home long_shots__min_away long_shots__min_home long_shots__mean_away long_shots__mean_home long_shots__std_away long_shots__std_home long_shots__max_away long_shots__max_home aggression__min_away aggression__min_home aggression__mean_away aggression__mean_home aggression__std_away aggression__std_home aggression__max_away aggression__max_home interceptions__min_away interceptions__min_home interceptions__mean_away interceptions__mean_home interceptions__std_away interceptions__std_home interceptions__max_away interceptions__max_home positioning__min_away positioning__min_home positioning__mean_away positioning__mean_home positioning__std_away positioning__std_home positioning__max_away positioning__max_home vision__min_away vision__min_home vision__mean_away vision__mean_home vision__std_away vision__std_home vision__max_away vision__max_home penalties__min_away penalties__min_home penalties__mean_away penalties__mean_home penalties__std_away penalties__std_home penalties__max_away penalties__max_home marking__min_away marking__min_home marking__mean_away marking__mean_home marking__std_away marking__std_home marking__max_away marking__max_home standing_tackle__min_away standing_tackle__min_home standing_tackle__mean_away standing_tackle__mean_home standing_tackle__std_away standing_tackle__std_home standing_tackle__max_away standing_tackle__max_home sliding_tackle__min_away sliding_tackle__min_home sliding_tackle__mean_away sliding_tackle__mean_home sliding_tackle__std_away sliding_tackle__std_home sliding_tackle__max_away sliding_tackle__max_home gk_diving__min_away gk_diving__min_home gk_diving__mean_away gk_diving__mean_home gk_diving__std_away gk_diving__std_home gk_diving__max_away gk_diving__max_home gk_handling__min_away gk_handling__min_home gk_handling__mean_away gk_handling__mean_home gk_handling__std_away gk_handling__std_home gk_handling__max_away gk_handling__max_home gk_kicking__min_away gk_kicking__min_home gk_kicking__mean_away gk_kicking__mean_home gk_kicking__std_away gk_kicking__std_home gk_kicking__max_away gk_kicking__max_home gk_positioning__min_away gk_positioning__min_home gk_positioning__mean_away gk_positioning__mean_home gk_positioning__std_away gk_positioning__std_home gk_positioning__max_away gk_positioning__max_home gk_reflexes__min_away gk_reflexes__min_home gk_reflexes__mean_away gk_reflexes__mean_home gk_reflexes__std_away gk_reflexes__std_home gk_reflexes__max_away gk_reflexes__max_home player_age__min_away player_age__min_home player_age__mean_away player_age__mean_home player_age__std_away player_age__std_home player_age__max_away player_age__max_home
match_id
449 Home Wins 2.50 3.25 2.80 2.40 3.30 2.60 2.40 3.10 2.50 2.62 3.25 2.30 2.62 3.20 2.50 2.50 3.20 2.62 0.89 0.92 0.95 0.96 1.05 1.14 -0.11 -0.08 -0.05 -0.04 0.05 0.13 45.00 45.00 45.00 35.00 50.00 70.00 35.00 45.00 60.00 55.00 70.00 65.00 65.00 60.00 70.00 70.00 175.26 177.80 184.03 184.03 5.94 5.13 198.12 193.04 68.03 68.93 77.14 78.71 6.60 6.72 87.98 88.89 20.34 20.31 22.75 23.24 1.21 1.72 24.40 26.58 57.00 63.00 65.45 65.09 4.06 2.02 70.00 70.00 59.00 62.00 72.09 67.73 4.93 3.41 77.00 76.00 24.00 20.00 54.36 53.27 12.92 14.00 69.00 69.00 22.00 22.00 50.91 47.09 18.28 16.53 73.00 73.00 36.00 27.00 57.27 59.36 10.69 13.76 75.00 76.00 38.00 38.00 62.09 61.91 10.77 9.80 74.00 73.00 18.00 22.00 52.91 49.00 18.40 18.08 72.00 76.00 27.00 23.00 56.82 49.91 15.99 16.14 74.00 72.00 18.00 27.00 54.18 47.27 20.13 14.55 77.00 69.00 21.00 30.00 50.82 52.27 17.69 12.71 71.00 ... 63.00 64.55 12.67 6.67 85.00 72.00 55.00 55.00 65.45 62.55 4.37 4.57 70.00 71.00 52.00 43.00 65.45 70.09 5.72 11.45 74.00 85.00 52.00 53.00 64.73 66.18 8.88 10.68 79.00 81.00 21.00 21.00 54.00 51.45 17.56 17.11 74.00 73.00 23.00 37.00 54.82 62.00 15.44 13.11 71.00 78.00 48.00 52.00 62.64 65.73 6.85 9.84 73.00 80.00 37.00 53.00 62.36 65.18 11.16 7.99 82.00 78.00 37.00 52.00 64.82 66.82 11.16 8.67 77.00 83.00 39.00 53.00 61.18 65.09 12.66 8.97 83.00 83.00 21.00 20.00 42.45 51.45 19.43 18.91 65.00 71.00 22.00 20.00 47.18 55.45 19.61 16.23 76.00 69.00 14.00 19.00 48.36 52.64 20.52 18.46 72.00 70.00 1.00 5.00 10.73 15.36 14.96 16.19 55.00 63.00 20.00 20.00 24.27 25.73 9.60 12.55 53.00 63.00 48.00 51.00 60.36 59.18 7.37 7.00 69.00 71.00 20.00 20.00 24.55 25.55 10.50 11.95 56.00 61.00 20.00 20.00 25.18 26.00 12.60 13.44 63.00 66.00 17.80 25.07 23.69 29.20 3.71 3.16 29.02 33.64
451 Draw 2.15 3.30 3.40 2.15 3.25 3.05 2.20 3.10 2.80 2.10 3.20 3.00 2.20 3.20 3.10 2.05 3.20 3.25 0.63 0.70 0.63 0.79 0.71 0.70 -0.46 -0.35 -0.46 -0.24 -0.34 -0.36 50.00 65.00 60.00 60.00 50.00 50.00 50.00 40.00 50.00 50.00 60.00 60.00 60.00 70.00 65.00 60.00 175.26 175.26 183.11 181.73 4.88 5.83 190.50 193.04 68.03 68.93 76.60 78.95 5.47 5.48 84.81 88.89 20.88 22.44 22.83 23.89 0.95 0.76 24.01 24.96 62.00 57.00 65.00 63.73 1.61 2.90 67.00 67.00 64.00 63.00 69.64 68.27 4.01 2.53 78.00 72.00 21.00 25.00 52.36 53.00 15.35 13.65 70.00 71.00 26.00 25.00 52.91 46.82 15.67 14.74 67.00 67.00 8.00 25.00 53.09 53.36 18.21 11.88 73.00 67.00 21.00 25.00 56.73 58.73 13.24 12.02 70.00 69.00 14.00 10.00 50.09 45.73 18.01 17.59 66.00 67.00 13.00 25.00 53.64 54.27 18.92 13.87 76.00 68.00 12.00 16.00 50.91 51.36 18.17 16.92 72.00 72.00 13.00 12.00 50.64 49.64 16.24 17.51 69.00 ... 63.27 61.00 9.96 13.18 74.00 72.00 60.00 53.00 66.45 67.00 5.61 9.10 79.00 83.00 56.00 40.00 70.18 66.73 5.72 9.31 75.00 74.00 42.00 53.00 62.91 65.55 12.43 8.50 78.00 79.00 21.00 25.00 54.45 52.64 15.86 15.41 69.00 69.00 27.00 48.00 56.18 63.73 17.39 7.25 81.00 73.00 45.00 40.00 58.27 60.27 9.03 9.12 73.00 72.00 21.00 33.00 56.09 58.82 13.41 9.37 69.00 66.00 45.00 50.00 61.55 62.09 8.47 6.77 75.00 71.00 47.00 48.00 60.45 61.00 7.49 6.24 68.00 69.00 21.00 20.00 43.73 44.82 17.88 16.86 64.00 65.00 21.00 25.00 50.55 47.18 17.27 14.68 68.00 64.00 10.00 26.00 47.09 51.55 20.79 15.10 67.00 66.00 1.00 2.00 13.27 12.18 19.57 17.74 70.00 65.00 20.00 20.00 26.18 24.64 12.96 11.52 65.00 59.00 40.00 42.00 57.00 57.00 8.57 7.39 67.00 66.00 20.00 20.00 26.00 24.64 12.36 11.52 63.00 59.00 20.00 20.00 26.73 25.55 14.76 14.51 71.00 69.00 19.24 21.66 24.71 25.14 3.00 2.26 28.89 29.60

2 rows × 359 columns

Code
match_train.shape
(12634, 359)

Let’s clean up: remove unnecessary (intermediate) datasets.

Code
del [
    teams,
    teams_goals_per_team,
    team_player_summary,
    team_betting_odds,
    matches_long_team,
    matches_long_team1,
    matches_long_team2,
    matches_long_player,
]

5 Analysis

This is the main part there the most important data-based insights are created.

  1. At the top of each subsection, there will be one or several questions provided.
  2. Next, the summary and the main findings of that subsection will follow.
  3. Next, the details (plots, tables, etc.) will be provided.

Each subsection main contain ad-hoc data preprocessing and analysis code.

5.1 Included Countries and Leagues

  • Which leagues are in which countries?

There are 10 countries (Scotland and England are parts of United Kingdom, UK) and 11 leagues in the database: 1 league per country except UK. See details in map 5.1 and Table 5.1.

Code
europe_boundaries = Polygon([(-25, 35), (40, 35), (40, 75), (-25, 75)])
map_world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
map_europe = (
    map_world.query("continent=='Europe'")
    .pipe(gpd.clip, europe_boundaries)
    .assign(included_countries=lambda x: x.name.isin(leagues.country))
)
cmap = LinearSegmentedColormap.from_list("", ["#DDD", "limegreen"])
ax = map_europe.plot(
    edgecolor="black", column="included_countries", cmap=cmap, figsize=(5, 5)
)
ax.set_axis_off();

Fig. 5.1. Map of European countries (in green) the leagues of which are included in this analysis.
Code
(
    leagues.drop(columns="league_id")
    .rename(columns=str.capitalize)
    .index_start_at(1)
    .style
)
Table 5.1. Football leagues in each country.
  Country Region League
1 Belgium Belgium Jupiler League
2 United Kingdom England England Premier League
3 France France Ligue 1
4 Germany Germany 1. Bundesliga
5 Italy Italy Serie A
6 Netherlands Netherlands Eredivisie
7 Poland Poland Ekstraklasa
8 Portugal Portugal Liga ZON Sagres
9 United Kingdom Scotland Scotland Premier League
10 Spain Spain LIGA BBVA
11 Switzerland Switzerland Super League

5.2 Comparing Leagues and Seasons

  1. Which leagues score the most and the fewest goals?
  2. Are there any goal-scoring patterns between seasons?

Main points of this section:

  1. Leagues differ in number of matches per season:
    • Fewest games are played in Switzerland (180 matches per season);
    • Most games are played in Italy, France, England, and Spain (380 matches per season).
  2. As some matches are missing from the datasets and leagues are of different size, it is more correct to compare leagues by the goals per match than by total goals scored.
  3. Leagues differ by resultativeness:
    • Most scoring league is in the Netherlands (3.08 goals per match) and it does not differ significantly from leagues in Switzerland, and Germany.
    • Least scoring league is in Poland (2.42 goals per match) and it does not differ significantly from France, Portugal, Italy, and Scotland.
  4. By comparing seasons, no significant patterns (differences) were found.

Find the details in the following subsections.

5.2.1 Both (Leagues and Seasons)

First, slices of each league and season were analyzed (Table 5.2). The result revealed that, e.g., in Belgium Jupiler League 2013/2014, some games are clearly missing: Wikipedia article indicates that 299 matches were played and Table 5.2 shows only 12. Looking at Wikipedia pages of some other seasons and leagues, it is clear that in some cases all games are included in the dataset, but in other cases some games are missing. So it is not correct to compare total matches and total goals per league. Average number of goals per match is a more appropriate measure.

Code
(
    goals_summary.style.format(
        {"n_goals_per_match": "{:.2f}", "n_goals_total": "{0:,.0f}"}
    ).bar(cmap="RdYlGn", height=80, width=50)
)
Table 5.2. Goal statistics for each league and season.
    n_matches_total n_goals_total n_goals_per_match
league season      
Netherlands Eredivisie 2008/2009 306 870 2.84
2009/2010 306 892 2.92
2010/2011 306 987 3.23
2011/2012 306 997 3.26
2012/2013 306 964 3.15
2013/2014 306 978 3.20
2014/2015 306 942 3.08
2015/2016 306 912 2.98
Switzerland Super League 2008/2009 180 540 3.00
2009/2010 180 599 3.33
2010/2011 180 537 2.98
2011/2012 162 425 2.62
2012/2013 180 462 2.57
2013/2014 180 520 2.89
2014/2015 180 517 2.87
2015/2016 180 566 3.14
Germany 1. Bundesliga 2008/2009 306 894 2.92
2009/2010 306 866 2.83
2010/2011 306 894 2.92
2011/2012 306 875 2.86
2012/2013 306 898 2.93
2013/2014 306 967 3.16
2014/2015 306 843 2.75
2015/2016 306 866 2.83
Spain LIGA BBVA 2008/2009 380 1,101 2.90
2009/2010 380 1,031 2.71
2010/2011 380 1,042 2.74
2011/2012 380 1,050 2.76
2012/2013 380 1,091 2.87
2013/2014 380 1,045 2.75
2014/2015 380 1,009 2.66
2015/2016 380 1,043 2.74
Belgium Jupiler League 2008/2009 306 855 2.79
2009/2010 210 565 2.69
2010/2011 240 635 2.65
2011/2012 240 691 2.88
2012/2013 240 703 2.93
2013/2014 12 30 2.50
2014/2015 240 668 2.78
2015/2016 240 694 2.89
England Premier League 2008/2009 380 942 2.48
2009/2010 380 1,053 2.77
2010/2011 380 1,063 2.80
2011/2012 380 1,066 2.81
2012/2013 380 1,063 2.80
2013/2014 380 1,052 2.77
2014/2015 380 975 2.57
2015/2016 380 1,026 2.70
Scotland Premier League 2008/2009 228 548 2.40
2009/2010 228 585 2.57
2010/2011 228 584 2.56
2011/2012 228 601 2.64
2012/2013 228 623 2.73
2013/2014 228 626 2.75
2014/2015 228 587 2.57
2015/2016 228 650 2.85
Italy Serie A 2008/2009 380 988 2.60
2009/2010 380 992 2.61
2010/2011 380 955 2.51
2011/2012 358 925 2.58
2012/2013 380 1,003 2.64
2013/2014 380 1,035 2.72
2014/2015 379 1,018 2.69
2015/2016 380 979 2.58
Portugal Liga ZON Sagres 2008/2009 240 552 2.30
2009/2010 240 601 2.50
2010/2011 240 584 2.43
2011/2012 240 634 2.64
2012/2013 240 667 2.78
2013/2014 240 569 2.37
2014/2015 306 763 2.49
2015/2016 306 831 2.72
France Ligue 1 2008/2009 380 858 2.26
2009/2010 380 916 2.41
2010/2011 380 890 2.34
2011/2012 380 956 2.52
2012/2013 380 967 2.54
2013/2014 380 933 2.46
2014/2015 380 947 2.49
2015/2016 380 960 2.53
Poland Ekstraklasa 2008/2009 240 524 2.18
2009/2010 240 532 2.22
2010/2011 240 578 2.41
2011/2012 240 527 2.20
2012/2013 240 598 2.49
2013/2014 240 634 2.64
2014/2015 240 628 2.62
2015/2016 240 635 2.65

5.2.2 Leagues

This subsection concentrates more on leagues. It compares leagues by size, i.e., number of games per season (Figure 5.2) and by resultativeness, i.e., average number of goals per match (Figure 5.3). Numeric summaries are displayed in Table 5.3.

Code
ax = (
    goals_summary.reset_index()
    .assign(tmp=lambda x: x.groupby("league").n_matches_total.transform("mean"))
    .sort_values("tmp", ascending=True)
    .plot.scatter(
        x="league", y="n_matches_total", c=green, alpha=0.4, edgecolor="black"
    )
)
ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("League")
ax.set_ylabel("Number of matches\nper season")
ax.set_ylim([0, 400]);

Fig. 5.2. Number of matches per season in each league. Darker points indicate that more leagues had this number of matches per season. Transparent points indicate that there are some variation in number of games: possibly some missing data or natural changes in leagues’ rules.
Code
ax = goals_summary.reset_index().plot.scatter(
    x="league", y="n_goals_per_match", c=blue, alpha=0.5, edgecolor="darkblue"
)

ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("League")
ax.set_ylabel("Number of goals per match")
ax.set_ylim([1, 3.5]);

Fig. 5.3. Average performance (number of goals per match) in each league. A point represents a mean of each season.
Code
res_goals_by_league = an.AnalyzeNumericGroups(
    goals_summary.reset_index(), y="n_goals_per_match", by="league"
).fit()

res_goals_by_league.display()
Omnibus (Kruskal-Wallis) test results
Source ddof1 H p-unc
Kruskal league 10 58.83 p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)
league cld spaced_cld mean ci_lower ci_upper
0 Netherlands Eredivisie a a____ 3.08 2.95 3.21
1 Switzerland Super League ab ab___ 2.93 2.72 3.14
2 Germany 1. Bundesliga ab ab___ 2.90 2.80 3.00
3 Spain LIGA BBVA bc _bc__ 2.77 2.70 2.83
4 Belgium Jupiler League bc _bc__ 2.76 2.64 2.89
5 England Premier League bcd _bcd_ 2.71 2.61 2.81
6 Scotland Premier League cde __cde 2.63 2.52 2.75
7 Italy Serie A cde __cde 2.62 2.56 2.67
8 Portugal Liga ZON Sagres de ___de 2.53 2.39 2.67
9 France Ligue 1 e ____e 2.44 2.36 2.53
10 Poland Ekstraklasa e ____e 2.42 2.25 2.60
Descriptive statistics of group (league) means
count min max range mean median std mad skew
mean 11 2.42 3.08 0.66 2.71 2.71 0.21 0.18 0.28
Code
(
    goals_summary.groupby("league")
    .agg(["mean", "std"])
    .style.format(precision=1)
    .format("{:.2f}", subset=["n_goals_per_match"])
    .highlight_max(color="#FFFF77", subset="n_goals_per_match")
    .highlight_min(color="#FFBBBB", subset="n_goals_per_match")
)
Table 5.3. Statistics of performance in each league and season: summaries for each league. Yellow cells indicate maximum and pale-red ones indicate minimum values in column.
  n_matches_total n_goals_total n_goals_per_match
  mean std mean std mean std
league            
Netherlands Eredivisie 306.0 0.0 942.8 46.9 3.08 0.15
Switzerland Super League 177.8 6.4 520.8 55.3 2.93 0.25
Germany 1. Bundesliga 306.0 0.0 887.9 37.0 2.90 0.12
Spain LIGA BBVA 380.0 0.0 1051.5 30.3 2.77 0.08
Belgium Jupiler League 216.0 86.7 605.1 246.3 2.76 0.15
England Premier League 380.0 0.0 1030.0 46.7 2.71 0.12
Scotland Premier League 228.0 0.0 600.5 31.8 2.63 0.14
Italy Serie A 377.1 7.7 986.9 34.8 2.62 0.07
Portugal Liga ZON Sagres 256.5 30.6 650.1 99.3 2.53 0.17
France Ligue 1 380.0 0.0 928.4 38.2 2.44 0.10
Poland Ekstraklasa 240.0 0.0 582.0 49.0 2.42 0.20

5.2.3 Seasons

This subsection concentrates more on seasons. It compares seasons by size, i.e., number of games per league (Figure 5.4) and by resultativeness, i.e., average number of goals per match (Figure 5.5). Numerical summaries are displayed in Table 5.4.

Code
ax = goals_summary.reset_index().plot.scatter(
    x="season", y="n_matches_total", alpha=0.4, c=green, edgecolor="black"
)
ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("Season")
ax.set_ylabel("Number of matches\nper league")
ax.set_ylim([0, 400]);

Fig. 5.4. Number of matches per league in each season. Darker points indicate that more seasons had this number of matches per league.
Code
ax = goals_summary.reset_index().plot.scatter(
    x="season", y="n_goals_per_match", c=blue, alpha=0.5, edgecolor="darkblue"
)
ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("Season")
ax.set_ylabel("Number of goals\nper match")
ax.set_ylim([1, 3.5]);

Fig. 5.5. Average performance (number of goals per match) in each season. A point represents a mean of each league.
Code
res_goals_by_season = an.AnalyzeNumericGroups(
    goals_summary.reset_index(), y="n_goals_per_match", by="season"
).fit()

res_goals_by_season.display()
Omnibus (Kruskal-Wallis) test results
Source ddof1 H p-unc
Kruskal season 7 3.49 p = 0.836
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)
season cld spaced_cld mean ci_lower ci_upper
0 2008/2009 a a 2.61 2.41 2.81
1 2009/2010 a a 2.69 2.49 2.88
2 2010/2011 a a 2.69 2.50 2.87
3 2011/2012 a a 2.71 2.53 2.88
4 2012/2013 a a 2.77 2.63 2.90
5 2013/2014 a a 2.75 2.57 2.92
6 2014/2015 a a 2.69 2.57 2.81
7 2015/2016 a a 2.78 2.66 2.90
Descriptive statistics of group (season) means
count min max range mean median std mad skew
mean 8 2.61 2.78 0.18 2.71 2.70 0.06 0.03 -0.48
Code
(
    goals_summary.groupby("season")
    .agg(["mean", "std"])
    .style.format(precision=1)
    .format("{:.2f}", subset=["n_goals_per_match"])
    .highlight_max(color="#FFFF77", subset="n_goals_per_match")
    .highlight_min(color="#FFBBBB", subset="n_goals_per_match")
)
Table 5.4. Statistics of performance in each league and season: summaries for each season. Yellow cells indicate maximum and pale red ones indicate minimum values in column.
  n_matches_total n_goals_total n_goals_per_match
  mean std mean std mean std
season            
2008/2009 302.4 72.4 788.4 208.2 2.61 0.30
2009/2010 293.6 77.5 784.7 207.7 2.69 0.29
2010/2011 296.4 74.8 795.4 210.3 2.69 0.28
2011/2012 292.7 75.6 795.2 226.2 2.71 0.26
2012/2013 296.4 74.8 821.7 216.3 2.77 0.20
2013/2014 275.6 113.5 762.6 319.9 2.75 0.26
2014/2015 302.3 72.3 808.8 184.0 2.69 0.18
2015/2016 302.4 72.4 832.9 170.1 2.78 0.18

5.3 Top Teams

  1. Which teams shows the best performance?
  2. How do best and worst teams differ in resultativeness?
  3. How many matches do the best teams win and loose?
  • The analysis of teams included data from 7 seasons.

  • To evaluate team’s performance, it was decided to count in how many seasons the team appeared between the Top 5 scoring teams (in terms of goals per match in that season). Table 5.5 shows that 12 teams appears in that list and Real Madrid CF (7 times in 7 seasons), FC Barcelona (6 times), and PSV (5 times) are the 3 leaders. Comparing best and worst teams, they performance differ by about 2 goals per match (Figure 5.6, Table 5.6).

  • Comparing teams that had highest percentage of won matches per season, most frequently SL Benfica (5 times in 7 seasons), FC Barcelona (5 times), Real Madrid CF (4 times), Celtic (4 times) were between Top 5 (Table 5.7).

  • To get among Top 5 winners, in some cases, it was sufficient to win as little as 73.7 % of matches but to loose no more than 15.8 % of matches (Table 5.7).

See the details below.

Code
print(f"Seasons in this analysis: {teams_top_bottom_goals.season.nunique()}")
Seasons in this analysis: 7
Code
(
    teams_top_bottom_goals.query("which == 'Top 5'")
    .team_name.value_counts()
    .to_df("Number of seasons (out of 7)", "Team")
    .index_start_at(1)
    .style
)
Table 5.5. Number of seasons a teams was among Top 5 by number of goals per match.
  Team Number of seasons (out of 7)
1 Real Madrid CF 7
2 FC Barcelona 6
3 PSV 5
4 FC Bayern Munich 4
5 SL Benfica 3
6 Ajax 2
7 FC Porto 2
8 Manchester City 2
9 Chelsea 1
10 Roda JC Kerkrade 1
11 Celtic 1
12 Liverpool 1
13 Paris Saint-Germain 1

Compare the performance of Top 5 and Bottom 5 teams:

Code
ax = sns.scatterplot(
    teams_top_bottom_goals,
    x="season",
    y="n_goals_per_match",
    hue="which",
)

ax.tick_params(axis="x", rotation=90)
ax.set_xlabel("Season")
ax.set_ylabel("Number of goals per match")
ax.set_ylim([0, 4.5])
ax.get_legend().set_title(None)

Fig. 5.6. Comparison of Top 5 and Bottom 5 teams by number of goals per match.
Code
teams_top_bottom_goals_summary = (
    teams_top_bottom_goals.groupby(["which"])
    .n_goals_per_match.agg(["min", "mean", "std", "max"])
    .sort_index(ascending=False)
    .reset_index()
)

teams_top_bottom_goals_summary.columns = pd.MultiIndex.from_tuples(
    [
        ("", "Group of teams"),
        ("Goals per match", "Min"),
        ("Goals per match", "Mean"),
        ("Goals per match", "SD"),
        ("Goals per match", "Max"),
    ]
)

teams_top_bottom_goals_summary.style.format(precision=2).hide(axis="index")
Table 5.6. Summaries of Top 5 and Bottom 5 teams by number of goals per match in all 7 seasons.
Goals per match
Group of teams Min Mean SD Max
Top 5 2.32 2.77 0.31 3.70
Bottom 5 0.30 0.68 0.11 0.87
Code
(
    teams_wins_per_season.reset_index()
    .Team.value_counts()
    .to_df("Number of seasons (out of 7)", "Team")
    .index_start_at(1)
    .style
)
Table 5.7. Number of seasons a team was among Top 5 by percentage of won matches.
  Team Number of seasons (out of 7)
1 SL Benfica 5
2 FC Barcelona 5
3 Real Madrid CF 4
4 Celtic 4
5 FC Porto 3
6 FC Bayern Munich 3
7 Manchester United 2
8 PSV 2
9 RSC Anderlecht 1
10 Ajax 1
11 Rangers 1
12 Manchester City 1
13 Juventus 1
14 Atlético Madrid 1
15 Sporting CP 1
16 Paris Saint-Germain 1
Code
variables = ["Lost", "Draw", "Won"]

(
    teams_wins_per_season.style.format("{:.1f} %", subset=variables)
    .highlight_max(subset=variables, color="#FFFF77")
    .highlight_min(subset=variables, color="#FFBBBB")
)
Table 5.8. Top 5 teams by percentage of won matches in each season. Highest values in each column are in yellow and lowest ones are in pale red.
      Lost Draw Won
Season League Team      
2009/2010 Belgium Jupiler League RSC Anderlecht 0.0 % 0.0 % 100.0 %
Netherlands Eredivisie Ajax 0.0 % 0.0 % 100.0 %
Portugal Liga ZON Sagres SL Benfica 10.0 % 0.0 % 90.0 %
Spain LIGA BBVA Real Madrid CF 6.7 % 6.7 % 86.7 %
FC Barcelona 0.0 % 13.3 % 86.7 %
2010/2011 Portugal Liga ZON Sagres FC Porto 0.0 % 10.0 % 90.0 %
Spain LIGA BBVA FC Barcelona 5.3 % 15.8 % 78.9 %
Scotland Premier League Rangers 13.2 % 7.9 % 78.9 %
Spain LIGA BBVA Real Madrid CF 10.5 % 13.2 % 76.3 %
Scotland Premier League Celtic 10.5 % 13.2 % 76.3 %
2011/2012 Spain LIGA BBVA Real Madrid CF 5.3 % 10.5 % 84.2 %
Scotland Premier League Celtic 13.2 % 7.9 % 78.9 %
Portugal Liga ZON Sagres FC Porto 3.3 % 20.0 % 76.7 %
England Premier League Manchester City 13.2 % 13.2 % 73.7 %
Spain LIGA BBVA FC Barcelona 7.9 % 18.4 % 73.7 %
England Premier League Manchester United 13.2 % 13.2 % 73.7 %
2012/2013 Germany 1. Bundesliga FC Bayern Munich 2.9 % 11.8 % 85.3 %
Spain LIGA BBVA FC Barcelona 5.3 % 10.5 % 84.2 %
Portugal Liga ZON Sagres SL Benfica 3.3 % 16.7 % 80.0 %
FC Porto 0.0 % 20.0 % 80.0 %
England Premier League Manchester United 13.2 % 13.2 % 73.7 %
2013/2014 Italy Serie A Juventus 5.3 % 7.9 % 86.8 %
Germany 1. Bundesliga FC Bayern Munich 5.9 % 8.8 % 85.3 %
Scotland Premier League Celtic 2.6 % 15.8 % 81.6 %
Portugal Liga ZON Sagres SL Benfica 6.7 % 16.7 % 76.7 %
Spain LIGA BBVA Atlético Madrid 10.5 % 15.8 % 73.7 %
2014/2015 Netherlands Eredivisie PSV 11.8 % 2.9 % 85.3 %
Portugal Liga ZON Sagres SL Benfica 8.8 % 11.8 % 79.4 %
Spain LIGA BBVA Real Madrid CF 15.8 % 5.3 % 78.9 %
FC Barcelona 10.5 % 10.5 % 78.9 %
Scotland Premier League Celtic 10.5 % 13.2 % 76.3 %
2015/2016 Portugal Liga ZON Sagres SL Benfica 11.8 % 2.9 % 85.3 %
Germany 1. Bundesliga FC Bayern Munich 5.9 % 11.8 % 82.4 %
Portugal Liga ZON Sagres Sporting CP 5.9 % 14.7 % 79.4 %
France Ligue 1 Paris Saint-Germain 5.3 % 15.8 % 78.9 %
Netherlands Eredivisie PSV 5.9 % 17.6 % 76.5 %

5.4 Players in 2015/2016

This subsection deals with the analysis of football players in the most recent available season. It contains a link to the dashboard (a technical requirement of this project) too.

5.5 Analysis of Players

  1. Which players are the best ones?
  2. Which player attributes are related to being a good player?
  3. How various player attributes relate to each other?
  • This analysis included players from season 2015/2016 (information of players announced after 2015-07-01, if several records are present, the most recent one is used).

  • Among 7057 included players, Top 5 players by the overall rating were: Lionel Messi, Cristiano Ronaldo, Neymar, Manuel Neuer, and Luis Suarez (Table 5.9). Player reactions (r=0.81) and potential (r=0.80) were the attributes most strongly correlated to overall rating (Table 5.11).

  • Correlation and hierarchical cluster analysis revealed that there are at least 2 groups of related player attributes: one of the major clusters seems to be associated to goal-keeping-related features and bigger values of physiological properties like body mass (variable “weight_kg”) and height are positive related to better goal keeping characteristics (Figure 5.7). The other cluster has several sub-clusters which might also be related to different roles of players but this idea should be investigated in more detail.

Find the details below in this sub-section.

Code
player_info_2015_2016 = players.query("player_info_date >= '2015-07-01'")
n_players_last_season = player_info_2015_2016.player_id.nunique()
print(f"Number of players included: {n_players_last_season}")
Number of players included: 7057
Code
players_2015_2016 = (
    player_info_2015_2016.assign(
        rank=lambda x: (
            x.groupby("player_id").player_info_date.rank(
                method="first", ascending=False
            )
        )
    )
    .query("rank == 1")
    .drop(columns=["rank"])
    .sort_values("overall_rating", ascending=False)
)

players_2015_2016.query("overall_rating >= 90")
Table 5.9. Top players in season 2015/2016:players with overall rating over 90.
player_id player_info_date player_name birthday birth_year height weight_kg bmi overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
102482 30981 2015-12-17 Lionel Messi 1987-06-24 1987 170.18 72.11 24.90 94.00 94.00 left medium low 80.00 93.00 71.00 88.00 85.00 96.00 89.00 90.00 79.00 96.00 95.00 90.00 92.00 92.00 95.00 80.00 68.00 75.00 59.00 88.00 48.00 22.00 90.00 90.00 74.00 13.00 23.00 21.00 6.00 11.00 15.00 14.00 8.00
33330 30893 2015-10-16 Cristiano Ronaldo 1985-02-05 1985 185.42 79.82 23.22 93.00 93.00 right high low 82.00 95.00 86.00 81.00 87.00 93.00 88.00 77.00 72.00 91.00 91.00 93.00 90.00 92.00 62.00 94.00 94.00 90.00 79.00 93.00 62.00 29.00 93.00 81.00 85.00 22.00 31.00 23.00 7.00 11.00 15.00 14.00 11.00
131464 19533 2016-02-04 Neymar 1992-02-05 1992 175.26 68.03 22.15 90.00 94.00 right high medium 72.00 88.00 62.00 78.00 83.00 94.00 78.00 79.00 74.00 93.00 91.00 90.00 92.00 86.00 84.00 78.00 61.00 79.00 45.00 73.00 56.00 36.00 89.00 79.00 81.00 21.00 24.00 33.00 9.00 9.00 15.00 15.00 11.00
109033 27299 2016-04-21 Manuel Neuer 1986-03-27 1986 193.04 92.06 24.71 90.00 90.00 right medium medium 15.00 13.00 25.00 48.00 11.00 16.00 14.00 11.00 47.00 31.00 58.00 61.00 43.00 87.00 35.00 25.00 78.00 44.00 83.00 16.00 29.00 30.00 12.00 70.00 37.00 10.00 10.00 11.00 85.00 87.00 91.00 90.00 87.00
105983 40636 2015-10-16 Luis Suarez 1987-01-24 1987 182.88 84.81 25.36 90.00 90.00 right high medium 77.00 90.00 77.00 82.00 87.00 88.00 86.00 84.00 64.00 91.00 88.00 78.00 86.00 91.00 60.00 88.00 69.00 88.00 76.00 85.00 78.00 41.00 91.00 84.00 85.00 30.00 45.00 38.00 27.00 25.00 31.00 33.00 37.00
Code
players_2015_2016.query("overall_rating < 50")
Table 5.10. Players with overall rating below 50 in season 2015/2016.
player_id player_info_date player_name birthday birth_year height weight_kg bmi overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
98716 215085 2015-07-10 Kyrylo Petrov 1990-06-22 1990 182.88 76.19 22.78 49.00 55.00 right medium medium 30.00 22.00 63.00 40.00 28.00 28.00 30.00 30.00 26.00 51.00 60.00 55.00 43.00 51.00 59.00 37.00 72.00 58.00 66.00 26.00 60.00 54.00 27.00 29.00 37.00 55.00 62.00 64.00 11.00 11.00 14.00 9.00 13.00
76938 696435 2016-04-14 Jan Bamert 1998-03-09 1998 180.34 69.84 21.47 48.00 67.00 right medium medium 38.00 23.00 40.00 26.00 31.00 59.00 30.00 30.00 22.00 31.00 66.00 59.00 53.00 47.00 61.00 30.00 59.00 56.00 50.00 24.00 55.00 48.00 37.00 34.00 39.00 46.00 57.00 52.00 7.00 11.00 6.00 9.00 7.00
101928 674221 2016-02-04 Liam Grimshaw 1995-02-02 1995 177.80 74.83 23.67 48.00 60.00 right medium high 33.00 32.00 42.00 55.00 27.00 41.00 33.00 37.00 42.00 49.00 68.00 67.00 58.00 54.00 65.00 52.00 63.00 64.00 59.00 33.00 67.00 48.00 34.00 48.00 44.00 45.00 54.00 49.00 13.00 8.00 15.00 12.00 11.00
157494 659742 2016-05-12 Sandro Lauper 1996-10-25 1996 185.42 69.84 20.31 48.00 64.00 right medium medium 47.00 45.00 39.00 52.00 47.00 59.00 44.00 38.00 47.00 53.00 65.00 67.00 57.00 42.00 61.00 55.00 45.00 53.00 50.00 42.00 33.00 35.00 49.00 45.00 54.00 22.00 36.00 25.00 14.00 7.00 13.00 6.00 11.00
172 528212 2016-02-25 Aaron Lennox 1993-02-19 1993 190.50 82.09 22.62 48.00 56.00 right medium medium 12.00 15.00 16.00 23.00 14.00 15.00 14.00 18.00 18.00 22.00 15.00 26.00 31.00 45.00 24.00 26.00 38.00 18.00 44.00 12.00 21.00 19.00 14.00 15.00 41.00 15.00 15.00 12.00 53.00 41.00 39.00 51.00 53.00
151371 614951 2016-03-03 Robin Huser 1998-01-24 1998 180.34 69.84 21.47 47.00 63.00 right medium medium 34.00 27.00 44.00 53.00 35.00 47.00 44.00 35.00 47.00 41.00 65.00 66.00 67.00 52.00 75.00 57.00 64.00 48.00 47.00 29.00 51.00 47.00 39.00 37.00 56.00 42.00 48.00 50.00 13.00 6.00 15.00 11.00 16.00
Code
cor_data_players = [
    (i, round(players_2015_2016.overall_rating.corr(players_2015_2016[i]), 3))
    for i in (
        players_2015_2016.select_dtypes("number").drop(
            columns=["player_id", "overall_rating"]
        )
    )
]

(
    pd.DataFrame(cor_data_players, columns=["variable", "r"])
    .sort_values("r", ascending=False)
    .index_start_at(1)
    .style.format({"r": "{:.2f}"})
    .bar(vmin=-1, vmax=1, cmap="BrBG", subset=["r"])
)
Table 5.11. Correlation to overall rating.
  variable r
1 reactions 0.81
2 potential 0.80
3 vision 0.41
4 short_passing 0.40
5 long_passing 0.39
6 ball_control 0.36
7 shot_power 0.33
8 long_shots 0.32
9 curve 0.32
10 volleys 0.30
11 free_kick_accuracy 0.29
12 crossing 0.29
13 dribbling 0.28
14 aggression 0.27
15 positioning 0.27
16 penalties 0.27
17 finishing 0.25
18 stamina 0.25
19 heading_accuracy 0.24
20 jumping 0.23
21 strength 0.22
22 interceptions 0.21
23 agility 0.20
24 sprint_speed 0.18
25 acceleration 0.17
26 standing_tackle 0.15
27 sliding_tackle 0.13
28 marking 0.13
29 bmi 0.09
30 balance 0.09
31 weight_kg 0.07
32 gk_handling 0.02
33 gk_reflexes 0.02
34 gk_diving 0.02
35 gk_positioning 0.02
36 gk_kicking 0.02
37 height 0.01
38 birth_year -0.23
Note

Heatmaps and clustered heatmaps in this project are very big as they contain many variables. I tried smaller plot size, but then every second variable name got hidden.

Code
sns.clustermap(
    players_2015_2016.corr(numeric_only=True),
    vmin=-1,
    vmax=1,
    annot=False,
    cmap="BrBG",
    method="centroid",
    figsize=(15, 15),
);

Fig. 5.7. Clustered heatmap of correlation coefficients between player attributes.

5.5.1 Dashboard

Some additional exploration of football players is available via this Looker Studio dashboard (preview in Figure 5.8). Only players with no missing data in their attributes are included in the dashboard.

Fig. 5.8. Print screen of the Looker Studio dashboard (link).

5.6 Home Advantage: Is It Real?

  1. Is there such a thing as home advantage?
  2. If yes, can we quantify it?

The analysis of 25,979 matches revealed, that:

  • Teams that play at home wins 45.9% (CI 45.1%–46.6%) matches compared to 28.7% away winning and 25.4% draws. This difference is statistically significant (χ² test, p < 0.001).
  • On average, home teams score 0.38 goals more than away teams. This shift toward the home advantage is statistically significant (t-test, p < 0.001).
  • Comparing different leagues, they do differ by the degree of home advantage. E.g., in Spain LIGA BBVA home advantage is as high as 0.50 goals and in Scotland Premier League it is as low as 0.22.
  • Comparing different seasons, no significant differences were found.

Find the details below.

Code
# Count of Home wins, Draws and Away wins
counts = matches.match_winner.value_counts(sort=False).rename("matches")
res_counts = an.AnalyzeCounts(counts, "Match outcome").fit()
res_counts.display()
Omnibus (chi-squared) test resultsChi square test, χ²(2, n = 25979) = 1881.57, p < 0.001
Counts of matches with 95% CI and post-hoc (pairwise chi-squared) test results
  Match outcome n_matches percent ci_lower ci_upper cld spaced_cld
0 Away Wins 7,466 28.7% 28.1% 29.4% a a__
1 Draw 6,596 25.4% 24.7% 26.0% b _b_
2 Home Wins 11,917 45.9% 45.1% 46.6% c __c
Descriptive statistics of group (Match outcome) counts
count min max range mean median std mad skew
n_matches 3 6,596 11,917 5,321 8,660 7,466 2,854 870 1.55
percent 3 25.4% 45.9% 20.5% 33.3% 28.7% 11.0% 3.3% 1.55
Code
res_counts.plot(rot=0, color=[blue, blue, green]);

Fig. 5.9. Distribution of match outcomes. The most common outcome is highlighted in green.
Code
mean_goal_diff = matches.goal_diff.mean()

ax = matches.goal_diff.plot.hist(
    edgecolor="black", label="_nolegend_", bins=np.arange(-6.5, 6.5)
)

ax.set_xlabel("Goal difference (home wins, if >0)")
ax.set_ylabel("Number of matches")

ax.axvline(
    x=mean_goal_diff,
    color="red",
    linestyle="--",
    label="Mean",
    zorder=1,
)

ax.axvline(
    x=0,
    color="gold",
    markeredgecolor="grey",
    linestyle="--",
    label="Zero (draw)",
    linewidth=1.5,
    zorder=2,
)

ax.legend(frameon=False, loc="upper right")

# Print results
(t, p) = sps.ttest_1samp(matches.goal_diff, 0)
print(
    f"On average, home teams score {mean_goal_diff:.2f} goals more than away "
    "teams. \nThis shift toward home advantage is statistically significant \n"
    f"(t-test, {my.format_p(p)})."
)
On average, home teams score 0.38 goals more than away teams. 
This shift toward home advantage is statistically significant 
(t-test, p < 0.001).

Fig. 5.10. Distribution of goal difference in each match. Negative number when away wins, 0 whet it is draw, positive number when home wins.
Code
res_by_league = an.AnalyzeNumericGroups(matches, "goal_diff", by="league").fit()
res_by_league.display()
Omnibus (Kruskal-Wallis) test results
Source ddof1 H p-unc
Kruskal league 10 44.87 p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)
league cld spaced_cld mean ci_lower ci_upper
0 Belgium Jupiler League ab ab_ 0.42 0.33 0.50
1 England Premier League ab ab_ 0.39 0.33 0.46
2 France Ligue 1 abc abc 0.36 0.31 0.42
3 Germany 1. Bundesliga abc abc 0.35 0.28 0.43
4 Italy Serie A ab ab_ 0.38 0.33 0.44
5 Netherlands Eredivisie a a__ 0.48 0.40 0.56
6 Poland Ekstraklasa abc abc 0.36 0.29 0.44
7 Portugal Liga ZON Sagres bc _bc 0.28 0.21 0.36
8 Scotland Premier League c __c 0.22 0.14 0.31
9 Spain LIGA BBVA a a__ 0.50 0.43 0.56
10 Switzerland Super League abc abc 0.40 0.30 0.49
Descriptive statistics of group (league) means
count min max range mean median std mad skew
mean 11 0.22 0.50 0.27 0.38 0.38 0.08 0.03 -0.44
Code
(_, ax) = res_by_league.plot(ylabel="Goal difference \n(home wins, if >0)")
ax.tick_params(axis="x", rotation=90)
ax.axhline(
    y=0,
    color="lightgray",
    linestyle="--",
    label="Draw",
    zorder=1,
)
ax.legend(frameon=False, loc="lower right")
ax.set_ylim([-0.15, 0.6]);

Fig. 5.11. Degree of home advantage in different leagues. Mean goal difference (home minus away) of a match with 95% confidence interval.
Code
res_by_season = an.AnalyzeNumericGroups(matches, "goal_diff", by="season").fit()

res_by_season.display()
Omnibus (Kruskal-Wallis) test results
Source ddof1 H p-unc
Kruskal season 7 12.85 p = 0.076
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI)
season cld spaced_cld mean ci_lower ci_upper
0 2008/2009 a a 0.40 0.35 0.46
1 2009/2010 a a 0.41 0.35 0.47
2 2010/2011 a a 0.41 0.35 0.47
3 2011/2012 a a 0.43 0.37 0.49
4 2012/2013 a a 0.33 0.27 0.39
5 2013/2014 a a 0.39 0.33 0.46
6 2014/2015 a a 0.36 0.30 0.43
7 2015/2016 a a 0.33 0.27 0.40
Descriptive statistics of group (season) means
count min max range mean median std mad skew
mean 8 0.33 0.43 0.10 0.38 0.40 0.04 0.02 -0.62
Code
(_, ax) = res_by_season.plot(ylabel="Goal difference \n(home wins, if >0)")
ax.tick_params(axis="x", rotation=90)
ax.axhline(
    y=0,
    color="lightgray",
    linestyle="--",
    label="Draw",
    zorder=1,
)
ax.legend(frameon=False, loc="lower right")
ax.set_ylim([-0.15, 0.6]);

Fig. 5.12. Degree of home advantage in different seasons. Mean goal difference (home minus away) of a match with 95% confidence interval.

5.7 Relationship Between Betting Odds

  1. What is the relationship between betting odds from different websites?
  2. How strongly are betting odds related to match outcomes?

Odds ratios from different websites as well as ratio and log-ratio of home wins versus away wins betting odds are investigated in this subsection. The analysis (Figure 5.13) shows that:

  • odds of the same type (e.g., “home wins”) from different websites are strongly correlates between each other (Fig. 5.13).
  • odds of “draw” are more strongly related to “away wins” and almost not correlated to “home wins”.
  • the log-ratio of betting odds shows the highest correlation to football match outcome: r=-0.46 in case of B365 (bet365.com), log ratio of betting odds vs. difference of goals (home goals minus away goals), as shown in Table 5.12.

See details below.

EDA: Overview of matches_betting_odds table
Code
skim(matches_betting_odds)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 25979  │ │ float64     │ 50    │ │ match_winner          │                                │
│ │ Number of columns │ 57     │ │ int32       │ 6     │ └───────────────────────┘                                │
│ └───────────────────┴────────┘ │ category    │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name             NA       NA %    mean     sd      p0       p25     p75     p100   hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩  │
│ │ stage                       0     0     18    10      1     9    27   38█▇▇▇▇▅  │  │
│ │ home_team_goal              0     0    1.5   1.3      0     1     2   10  █▅▁   │  │
│ │ away_team_goal              0     0    1.2   1.1      0     0     2    9  █▂▁   │  │
│ │ goal_sum                    0     0    2.7   1.7      0     2     4   12 ▄█▄▁   │  │
│ │ goal_diff                   0     0   0.38   1.8     -9    -1     1   10  ▁█▇▁  │  │
│ │ goal_diff_sign              0     0   0.17  0.85     -1    -1     1    1▅  ▄ █  │  │
│ │ B365_home_wins           3400    13    2.6   1.8      1   1.7   2.8   26 │  │
│ │ BW_home_wins             3400    13    2.6   1.6      1   1.6   2.8   34 │  │
│ │ IW_home_wins             3500    13    2.5   1.4      1   1.6   2.6   20  █▁    │  │
│ │ LB_home_wins             3400    13    2.5   1.6      1   1.7   2.7   26 │  │
│ │ PS_home_wins            15000    57    2.8   2.2      1   1.7     3   36 │  │
│ │ WH_home_wins             3400    13    2.6   1.7      1   1.7   2.8   26 │  │
│ │ SJ_home_wins             8900    34    2.6   1.7      1   1.7   2.8   23  █▁    │  │
│ │ VC_home_wins             3400    13    2.7   1.9      1   1.7   2.8   36 │  │
│ │ GB_home_wins            12000    45    2.5   1.5    1.1   1.7   2.6   21  █▁    │  │
│ │ BS_home_wins            12000    45    2.5   1.5      1   1.7   2.6   17  █▁    │  │
│ │ B365_draw                3400    13    3.8   1.1    1.4   3.3     4   17  █▂    │  │
│ │ BW_draw                  3400    13    3.7     1    1.6   3.2   3.8   20  █▁    │  │
│ │ IW_draw                  3500    13    3.6   0.8    1.5   3.2   3.7   11  ▁█▁   │  │
│ │ LB_draw                  3400    13    3.7     1    1.4   3.2   3.8   19  █▁    │  │
│ │ PS_draw                 15000    57    4.1   1.5    2.2   3.4   4.2   29 │  │
│ │ WH_draw                  3400    13    3.7  0.96      1   3.2   3.8   17  █▃    │  │
│ │ SJ_draw                  8900    34    3.8     1    1.4   3.2   3.8   15  █▃    │  │
│ │ VC_draw                  3400    13    3.9   1.2    1.6   3.3     4   26  █▁    │  │
│ │ GB_draw                 12000    45    3.6  0.87    1.4   3.2   3.8   11  ▁█▁   │  │
│ │ BS_draw                 12000    45    3.7  0.87    1.3   3.2   3.8   13  ▅█▁   │  │
│ │ B365_away_wins           3400    13    4.7   3.7    1.1   2.5   5.2   51  █▁    │  │
│ │ BW_away_wins             3400    13    4.4   3.3    1.1   2.5     5   51  █▁    │  │
│ │ IW_away_wins             3500    13    4.2   2.9    1.1   2.5   4.6   25  █▂    │  │
│ │ LB_away_wins             3400    13    4.4   3.4    1.1   2.5     5   51  █▁    │  │
│ │ PS_away_wins            15000    57      5   4.5    1.1   2.6   5.4   48  █▁    │  │
│ │ WH_away_wins             3400    13    4.5   3.6    1.1   2.5     5   51  █▁    │  │
│ │ SJ_away_wins             8900    34    4.6   3.6    1.1   2.5   5.2   41  █▁    │  │
│ │ VC_away_wins             3400    13    4.8   4.3    1.1   2.5   5.4   67 │  │
│ │ GB_away_wins            12000    45    4.4     3    1.1   2.5     5   34  █▁    │  │
│ │ BS_away_wins            12000    45    4.4   3.2    1.1   2.5     5   34  █▁    │  │
│ │ B365_ratio_ha            3400    13    1.1   1.5  0.021  0.32   1.1   24 │  │
│ │ BW_ratio_ha              3400    13    1.1   1.4  0.021  0.33   1.1   31 │  │
│ │ PS_ratio_ha             15000    57    1.2   1.9  0.022  0.32   1.2   33 │  │
│ │ VC_ratio_ha              3400    13    1.1   1.7  0.015  0.32   1.1   33 │  │
│ │ IW_ratio_ha              3500    13      1   1.3  0.042  0.36     1   18 │  │
│ │ WH_ratio_ha              3400    13    1.1   1.5  0.021  0.34   1.1   24 │  │
│ │ GB_ratio_ha             12000    45      1   1.3  0.031  0.33     1   19 │  │
│ │ LB_ratio_ha              3400    13      1   1.4  0.021  0.34   1.1   24 │  │
│ │ SJ_ratio_ha              8900    34      1   1.4  0.026  0.32   1.1   20 │  │
│ │ BS_ratio_ha             12000    45      1   1.3  0.031  0.33     1   15  █▁    │  │
│ │ B365_log_ratio_ha        3400    13   -0.5     1   -3.9  -1.1 0.097  3.2  ▃█▆▂  │  │
│ │ BW_log_ratio_ha          3400    13  -0.48     1   -3.9  -1.1  0.08  3.4  ▃█▅▁  │  │
│ │ PS_log_ratio_ha         15000    57  -0.49   1.1   -3.8  -1.1  0.15  3.5 ▁▃█▅▁  │  │
│ │ VC_log_ratio_ha          3400    13  -0.51   1.1   -4.2  -1.1   0.1  3.5  ▂█▆▂  │  │
│ │ IW_log_ratio_ha          3500    13  -0.46  0.95   -3.2    -1 0.039  2.9 ▁▃█▅▁  │  │
│ │ WH_log_ratio_ha          3400    13  -0.48     1   -3.9  -1.1 0.074  3.2  ▃█▇▂  │  │
│ │ GB_log_ratio_ha         12000    45   -0.5  0.97   -3.5  -1.1 0.039  2.9  ▃█▆▂  │  │
│ │ LB_log_ratio_ha          3400    13  -0.48     1   -3.9  -1.1 0.077  3.2  ▃█▇▂  │  │
│ │ SJ_log_ratio_ha          8900    34  -0.51     1   -3.7  -1.1 0.056    3  ▃█▆▂  │  │
│ │ BS_log_ratio_ha         12000    45   -0.5  0.98   -3.5  -1.1 0.039  2.7 ▁▃█▇▂  │  │
│ └────────────────────────┴─────────┴────────┴─────────┴────────┴─────────┴────────┴────────┴───────┴─────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA         NA %           ordered                unique            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩  │
│ │ match_winner                              0            0False                                3 │  │
│ └────────────────────────────────────┴───────────┴───────────────┴───────────────────────┴───────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
EDA: Missing Value Plots of matches_betting_odds table

It seems that missing value structure is characteristic for each betting website as betting odds variables of each betting website are clustered together by their missing value structure.

Code
msno.matrix(matches_betting_odds, figsize=(10, 5));

Code
msno.dendrogram(matches_betting_odds);

EDA: Data Profiling Report of matches_betting_odds
Code
matches_betting_odds.shape
(25979, 57)
Code
if do_eda:
    profile_match_odds = eda.ProfileReport(
        matches_betting_odds,
        title="Data Profiling Report: matches_betting_odds",
        config_file="_config/ydata_profile_config--mini.yaml",
    )

    profile_match_odds
Note

Heatmaps and clustered heatmaps in this project are very big as they contain many variables. I tried smaller plot size, but then every second variable name got hidden.

Code
cor_betting = matches_betting_odds.select_dtypes("number").corr()

There are 5 types of variables related to betting odds (odds for home and away wins, draw, ratio and log ratio of home versus away wins).These types of odds are highly correlated in each category (see plot 5.13).

Code
plt.figure(figsize=(15, 11))
mask = np.triu(np.ones_like(cor_betting, dtype=np.bool_))
sns.heatmap(cor_betting.round(2), vmin=-1, vmax=1, annot=False, cmap="BrBG");

Fig. 5.13. Heatmap of Pearson’s correlation coefficients between betting odds, goal statistics, and some other variables.
Code
def name_replace(x):
    return x.replace("B365_", "B365 ")


def name_replace(x):
    return x.replace("B365_", "B365 ")


new_names = {"goal_diff": "Match goal difference (home–away)"}

(
    matches_betting_odds.filter(regex="B365_|diff$")
    .corr()
    .rename(columns=new_names, index=new_names)
    .rename(columns=name_replace, index=name_replace)
)
Table 5.12. Correlation between betting odds from B365 (bet365.com) website. ratio_ha is betting odds ratio for home and away teams.
Match goal difference (home–away) B365 home_wins B365 draw B365 away_wins B365 ratio_ha B365 log_ratio_ha
Match goal difference (home–away) 1.00 -0.38 0.24 0.40 -0.35 -0.46
B365 home_wins -0.38 1.00 0.02 -0.47 0.99 0.82
B365 draw 0.24 0.02 1.00 0.82 0.09 -0.45
B365 away_wins 0.40 -0.47 0.82 1.00 -0.41 -0.83
B365 ratio_ha -0.35 0.99 0.09 -0.41 1.00 0.77
B365 log_ratio_ha -0.46 0.82 -0.45 -0.83 0.77 1.00
Details: Clustered version of the heatmap

Clustered heatmap of Pearson’s correlation coefficients between betting odds, goal statistics, and some other variables.

Code
sns.clustermap(
    cor_betting.round(1),
    vmin=-1,
    vmax=1,
    annot=False,
    cmap="BrBG",
    method="centroid",
    figsize=(15, 15),
);

Code
del cor_betting

5.8 Team Score Prediction

  • Can we predict how many goals each team will score in each match?
  • In this section, number of goals each team scores in a match is modeled.
  • As a reference, standard deviation of goals was calculated: SD = 1.26 goals.
  • The initial idea was to select 4 final models for the types of variables, that are available and different times before the match (team-related features, player-related features and betting odds and one model based on all types of variables), so:
    • Three separate models for 3 predictor types (team-related features, player-related features and betting odds) were created.
    • Models with all 3 feature types as well as PCA features were also among the candidates, but they did not improve cross-validation performance and were discarded (see Table 5.13).
  • Finally, only a single model was selected:
    • Models with team-related (train RMSE=1.24, R²=0.03) and player-related features (train RMSE=1.20, R²=0.09) had really poor performance and barely explained any variation in target variable (R²<0.15), so were also discarded (see Table 5.14).
    • In cases two cases (a. betting odds based model and b. model where all variables were between the candidates as possible predictors) the same RF model with a single variable B365_win (betting odds that team wins) was selected. Its test performance is RMSE=1.16, R²=0.15.
  • There is a debate if betting odds is a reliable predictor due to its nature (it is the output of other model, it changes frequently, etc.). Yet, in this analysis betting odds was the only type of predictors that allowed achieving model with minimum reasonable amount of explained variance (R²≥0.15).
  • Conclusion: there is a lot of randomness in the game, so basing on the available data it is hard to make reliable predictions in advance on how many goals a team will core..

The summary of the results is present in Tables 5.13 and 5.14.

The details are in the subsections below.

Code
target_sd = team_train[team_target].std()

print(
    "Standard deviation (SD) of target variable in training set: "
    f"{round(target_sd, 2)} goals"
)
Standard deviation (SD) of target variable in training set: 1.26 goals

5.8.3 Betting Odds as Predictors

Linear Regression

Code
# Do SFS or take results from cache
def fun_sfs_res_team_betting():
    np.random.seed(250)
    estimator = LinearRegression()
    subset = [team_target, "team_type", *team_vars_betting_odds]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_betting.pickle"
sfs_res_team_betting = my.cached_results(file, fun_sfs_res_team_betting)
Code
ml.sfs_plot_results(
    sfs_res_team_betting,
    "Predictors: Betting Odds (Linear Regression)",
);

Fig. 5.18. SFS results.
k = 2, avg. RMSE = 1.162 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_betting)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 VC_log_ratio_wl 1.168 nan nan
2 B365_loose 1.162 0.006 0.524
3 IW_ratio_wl 1.162 0.000 0.028
4 IW_win 1.161 0.000 0.031
5 VC_win 1.161 0.000 0.013
6 VC_ratio_wl 1.161 0.000 0.034
7 BW_ratio_wl 1.161 -0.000 -0.000
8 LB_ratio_wl 1.161 -0.000 -0.000
9 LB_log_ratio_wl 1.161 -0.000 -0.000
10 LB_loose 1.161 0.000 0.000
11 B365_log_ratio_wl 1.161 0.000 0.000
12 B365_ratio_wl 1.161 -0.000 -0.001
13 WH_ratio_wl 1.161 -0.000 -0.001
14 WH_win 1.161 0.000 0.002
15 IW_log_ratio_wl 1.161 -0.000 -0.002
16 WH_log_ratio_wl 1.161 -0.000 -0.003
17 BW_win 1.161 -0.000 -0.006
18 B365_win 1.161 -0.000 -0.004
19 WH_loose 1.161 -0.000 -0.006
20 VC_loose 1.161 -0.000 -0.003
21 IW_loose 1.161 -0.000 -0.003
22 BW_loose 1.161 -0.000 -0.003
23 BW_log_ratio_wl 1.161 -0.000 -0.003
24 LB_win 1.161 -0.000 -0.009
25 team_type__home 1.161 -0.000 -0.010

Random Forests

Details: Feature importances
Code
def fun_rf_team_betting():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    subset = [team_target, "team_type", *team_vars_betting_odds]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_betting.pickle"
rf_team_betting = my.cached_results(file, fun_rf_team_betting)

rf_team_betting_importances = ml.get_rf_importances(rf_team_betting)

ml.plot_importances(rf_team_betting_importances, n=30);

Code
(rf_team_betting_importances.style.format(precision=4).bar())
  features importance
2 VC_win 0.0928
0 B365_win 0.0646
4 WH_win 0.0600
1 BW_win 0.0539
19 BW_log_ratio_wl 0.0484
13 BW_ratio_wl 0.0482
14 VC_ratio_wl 0.0421
20 VC_log_ratio_wl 0.0415
17 LB_ratio_wl 0.0408
23 LB_log_ratio_wl 0.0400
16 WH_ratio_wl 0.0398
22 WH_log_ratio_wl 0.0393
7 BW_loose 0.0378
12 B365_ratio_wl 0.0358
18 B365_log_ratio_wl 0.0357
5 LB_win 0.0351
15 IW_ratio_wl 0.0344
21 IW_log_ratio_wl 0.0339
8 VC_loose 0.0289
9 IW_loose 0.0286
11 LB_loose 0.0283
3 IW_win 0.0273
10 WH_loose 0.0259
6 B365_loose 0.0233
24 team_type__home 0.0135
Code
# Do SFS or take results from cache
def fun_sfs_res_team_betting_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    subset = [team_target, "team_type", *team_vars_betting_odds]
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_betting_rf.pickle"
sfs_res_team_betting_rf = my.cached_results(file, fun_sfs_res_team_betting_rf)
Code
ml.sfs_plot_results(
    sfs_res_team_betting_rf,
    "Predictors: Betting Odds (Random Forests)",
    team_train[team_target].std(),
);

Fig. 5.19. SFS results. Red dashed reference line indicates SD of target variable.
k = 1, avg. RMSE = 1.162 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_betting_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 B365_win 1.162 nan nan
2 team_type__home 1.165 -0.003 -0.245
3 WH_win 1.194 -0.029 -2.496
4 B365_ratio_wl 1.257 -0.063 -5.243
5 B365_loose 1.256 0.001 0.072
6 B365_log_ratio_wl 1.255 0.001 0.041
7 BW_log_ratio_wl 1.271 -0.016 -1.244
8 LB_ratio_wl 1.227 0.044 3.449
9 VC_log_ratio_wl 1.211 0.016 1.343
10 IW_log_ratio_wl 1.203 0.008 0.645
11 WH_loose 1.202 0.001 0.106
12 BW_ratio_wl 1.199 0.002 0.177
13 LB_log_ratio_wl 1.200 -0.001 -0.058
14 VC_loose 1.200 -0.000 -0.019
15 BW_loose 1.200 0.000 0.007
16 BW_win 1.200 0.001 0.056
17 LB_loose 1.199 0.001 0.066
18 WH_log_ratio_wl 1.199 -0.000 -0.041
19 LB_win 1.198 0.001 0.102
20 IW_ratio_wl 1.199 -0.001 -0.070
21 IW_win 1.199 0.000 0.014
22 VC_ratio_wl 1.199 -0.001 -0.055
23 WH_ratio_wl 1.199 0.001 0.059
24 IW_loose 1.200 -0.002 -0.142
25 VC_win 1.201 -0.001 -0.087

5.8.4 All Variables as Predictors

Linear Regression

Code
# Do SFS or take results from cache
def fun_sfs_res_team_all():
    np.random.seed(250)
    estimator = LinearRegression()
    X, y = team_train.make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression", 50).fit(X, y)


file = "saved-output/sfs_res_team_all.pickle"
sfs_res_team_all = my.cached_results(file, fun_sfs_res_team_all)
Code
ml.sfs_plot_results(
    sfs_res_team_all, "Predictors: All Variables (Linear Regression)"
);

Fig. 5.20. SFS results.
k = 50, avg. RMSE = 1.156 [Best]
(Number of predictors at best score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_all)
    .head(20)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 VC_log_ratio_wl 1.168 nan nan
2 B365_loose 1.162 0.006 0.524
3 dribbling__mean 1.161 0.001 0.076
4 standing_tackle__mean 1.159 0.002 0.135
5 weight_kg__mean 1.159 0.001 0.060
6 gk_positioning__std 1.158 0.000 0.026
7 short_passing__mean 1.158 0.000 0.028
8 BW_ratio_wl 1.158 0.000 0.023
9 acceleration__max 1.158 0.000 0.019
10 ball_control__min 1.157 0.000 0.016
11 gk_diving__max 1.157 0.000 0.013
12 BW_win 1.157 0.000 0.011
13 curve__std 1.157 0.000 0.009
14 standing_tackle__max 1.157 0.000 0.007
15 marking__mean 1.157 0.000 0.010
16 volleys__std 1.157 0.000 0.009
17 height__std 1.157 0.000 0.008
18 LB_win 1.157 0.000 0.005
19 LB_log_ratio_wl 1.157 0.000 0.011
20 vision__min 1.156 0.000 0.008

Random Forests

Details: Feature importances

The plot below contains all sorted importances (upper subplot) and 30 top cases (lower subplot).

Code
def fun_rf_team_all():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    X, y = team_train.make_dummies(team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_all.pickle"
rf_team_all = my.cached_results(file, fun_rf_team_all)


rf_team_all_importances = ml.get_rf_importances(rf_team_all)

ml.plot_importances(rf_team_all_importances);

Code
(
    rf_team_all_importances.nlargest(20, "importance")
    .style.format(precision=4)
    .bar()
)
  features importance
166 VC_win 0.0639
164 B365_win 0.0417
168 WH_win 0.0327
165 BW_win 0.0161
161 player_age__mean 0.0107
160 player_age__min 0.0105
163 player_age__max 0.0094
82 reactions__std 0.0091
26 potential__std 0.0088
17 bmi__mean 0.0088
22 overall_rating__std 0.0087
162 player_age__std 0.0085
94 jumping__std 0.0084
10 height__std 0.0082
110 aggression__std 0.0081
34 finishing__std 0.0081
14 weight_kg__std 0.0079
58 free_kick_accuracy__std 0.0079
93 jumping__mean 0.0078
106 long_shots__std 0.0078
Code
# Do SFS or take results from cache
def fun_sfs_res_team_all_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
       team_target,
       "VC_win", "B365_win", "WH_win", "BW_win", "player_age__mean",
       "player_age__min", "player_age__max", "reactions__std",
       "potential__std", "bmi__mean", 
       "dribbling__mean", "team_type"
    ]
    # fmt: on
    X, y = team_train[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_all_rf.pickle"
sfs_res_team_all_rf = my.cached_results(file, fun_sfs_res_team_all_rf)
Code
ml.sfs_plot_results(
    sfs_res_team_all_rf,
    "Predictors: All Variables (Random Forest)",
    target_sd,
);

Fig. 5.21. SFS results. Red dashed reference line indicates SD of target variable.
k = 1, avg. RMSE = 1.162 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_all_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 B365_win 1.162 nan nan
2 team_type__home 1.165 -0.003 -0.264
3 WH_win 1.194 -0.029 -2.499
4 player_age__mean 1.257 -0.063 -5.246
5 bmi__mean 1.209 0.048 3.834
6 player_age__max 1.196 0.013 1.095
7 dribbling__mean 1.188 0.008 0.671
8 player_age__min 1.185 0.002 0.194
9 potential__std 1.184 0.001 0.124
10 VC_win 1.183 0.001 0.056
11 reactions__std 1.183 0.000 0.006
12 BW_win 1.183 0.000 0.005

5.8.5 PCA Features of All Variables as Predictors

It was tried to create predictive model based on principal components instead of original numeric variables. PCA scree plot suggests that it is reasonable to use that 4 or 6 components as at these points the “elbow” point can be visible. Six components explain 56 % of variance. To explain 80% of variance, 27 components are needed.

Code
_, team_train_num, _ = ml.get_columns_by_purpose(team_train, team_target)
_, _, pca_obj = ml.pca_screeplot(team_train_num, 60);

Fig. 5.22. PCA screeplot.
Code
pcs_6 = pca_obj.explained_variance_ratio_.cumsum()[5] * 100
print(f"First 6 PCs explain {pcs_6:.1f} % of variance.")
First 6 PCs explain 56.0 % of variance.
Code
n_pcs_80 = np.argwhere(pca_obj.explained_variance_ratio_.cumsum() >= 0.80).min()
print(f"Number of PCs needed to explain at least 80% of variance: {n_pcs_80}")
Number of PCs needed to explain at least 80% of variance: 27
Code
d_target, d_num, d_other, d_pca, team_scale, team_pca = ml.do_pca(
    team_train, team_target, n_components=50
)
team_train_with_pca = pd.concat([d_target, d_other, d_pca], axis=1)

Linear Regression

Include 6 PCs in SFS.

Code
# Do SFS or take results from cache
def fun_sfs_res_team_pca_2():
    np.random.seed(250)
    estimator = LinearRegression()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5", "pc_6", 
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca_2.pickle"
sfs_res_team_pca_2 = my.cached_results(file, fun_sfs_res_team_pca_2)
Code
ml.sfs_plot_results(
    sfs_res_team_pca_2,
    "Predictors: PCs of All Variables (Linear Regression)",
)

Fig. 5.23. SFS results.
k = 4, avg. RMSE = 1.176 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_pca_2)
    .head(20)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 pc_2 1.218 nan nan
2 pc_1 1.181 0.037 3.020
3 pc_6 1.178 0.004 0.324
4 team_type__home 1.176 0.001 0.124
5 pc_4 1.176 0.000 0.007
6 pc_3 1.175 0.001 0.077
7 pc_5 1.185 -0.010 -0.832

Include 27 PCs in SFS.

Code
# Do SFS or take results from cache
def fun_sfs_res_team_pca():
    np.random.seed(250)
    estimator = LinearRegression()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5",
        "pc_6", "pc_7", "pc_8", "pc_9", "pc_10",
        "pc_11", "pc_12", "pc_13", "pc_14", "pc_15",
        "pc_16", "pc_17", "pc_18", "pc_19", "pc_20",
        "pc_21", "pc_22", "pc_23", "pc_24", "pc_25",
        "pc_26", "pc_27",
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca.pickle"
sfs_res_team_pca = my.cached_results(file, fun_sfs_res_team_pca)
Code
ml.sfs_plot_results(
    sfs_res_team_pca, "Predictors: PCs of All Variables (Linear Regression)"
)

Fig. 5.24. SFS results.
k = 5, avg. RMSE = 1.160 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_pca)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 pc_2 1.218 nan nan
2 pc_1 1.181 0.037 3.020
3 pc_7 1.171 0.011 0.903
4 pc_8 1.164 0.006 0.553
5 pc_6 1.160 0.004 0.340
6 pc_4 1.160 0.001 0.054
7 pc_3 1.159 0.001 0.069
8 pc_14 1.159 0.000 0.012
9 pc_15 1.159 0.000 0.011
10 team_type__home 1.159 0.000 0.012
11 pc_20 1.158 0.000 0.005
12 pc_13 1.158 0.000 0.003
13 pc_16 1.158 0.000 0.003
14 pc_10 1.158 0.000 0.001
15 pc_22 1.158 0.000 0.000
16 pc_11 1.158 0.000 0.000
17 pc_26 1.158 0.000 0.002
18 pc_12 1.158 -0.000 -0.001
19 pc_27 1.158 -0.000 -0.001
20 pc_24 1.158 -0.000 -0.002
21 pc_21 1.158 -0.000 -0.002
22 pc_5 1.158 -0.000 -0.002
23 pc_17 1.158 0.000 0.006
24 pc_19 1.158 -0.000 -0.002
25 pc_25 1.158 -0.000 -0.006
26 pc_18 1.159 -0.000 -0.008
27 pc_23 1.159 -0.000 -0.007
28 pc_9 1.159 -0.000 -0.011

Random Forests

Details: Feature importances

Random forest feature importance of principal components and categorical variables.

Code
def fun_rf_team_pca():
    np.random.seed(250)
    rf = RandomForestRegressor(n_jobs=-1)
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5",
        "pc_6", "pc_7", "pc_8", "pc_9", "pc_10",
        "pc_11", "pc_12", "pc_13", "pc_14", "pc_15",
        "pc_16", "pc_17", "pc_18", "pc_19", "pc_20",
        "pc_21", "pc_22", "pc_23", "pc_24", "pc_25",
        "pc_26", "pc_27",
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(team_target)
    return rf.fit(X, y)


file = "saved-output/rf_team_pca.pickle"
rf_team_pca = my.cached_results(file, fun_rf_team_pca)

rf_team_pca_importances = ml.get_rf_importances(rf_team_pca)

ml.plot_importances(rf_team_pca_importances, n=10);

Code
rf_team_pca_importances.style.format(precision=4).bar()
  features importance
1 pc_2 0.1069
0 pc_1 0.1031
6 pc_7 0.0420
7 pc_8 0.0360
14 pc_15 0.0326
13 pc_14 0.0326
24 pc_25 0.0326
25 pc_26 0.0320
10 pc_11 0.0320
16 pc_17 0.0319
23 pc_24 0.0319
15 pc_16 0.0317
9 pc_10 0.0316
5 pc_6 0.0315
11 pc_12 0.0310
22 pc_23 0.0308
20 pc_21 0.0306
21 pc_22 0.0305
8 pc_9 0.0304
18 pc_19 0.0299
19 pc_20 0.0298
26 pc_27 0.0298
17 pc_18 0.0298
12 pc_13 0.0294
3 pc_4 0.0292
2 pc_3 0.0290
4 pc_5 0.0279
27 team_type__home 0.0036

Include 6 PCs in SFS.

Code
# Do SFS or take results from cache
def fun_sfs_res_team_pca_2_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5", "pc_6", 
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca_2_rf.pickle"
sfs_res_team_pca_2_rf = my.cached_results(file, fun_sfs_res_team_pca_2_rf)
Code
ml.sfs_plot_results(
    sfs_res_team_pca_2_rf,
    "Predictors: Selected PCs (Random Forest)",
    team_train[team_target].std(),
);

Fig. 5.25. SFS results.
k = 6, avg. RMSE = 1.201 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_pca_2_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 team_type__home 1.245 nan nan
2 pc_1 1.457 -0.213 -17.073
3 pc_2 1.259 0.198 13.604
4 pc_3 1.214 0.045 3.562
5 pc_6 1.206 0.008 0.648
6 pc_5 1.201 0.006 0.458
7 pc_4 1.204 -0.003 -0.225

Include 27 PCs in SFS.

Code
# Do SFS or take results from cache
def fun_sfs_res_team_pca_rf():
    np.random.seed(250)
    estimator = RandomForestRegressor()
    # fmt: off
    subset = [
        team_target,
        "team_type",
        "pc_1", "pc_2", "pc_3", "pc_4", "pc_5",
        "pc_6", "pc_7", "pc_8", "pc_9", "pc_10",
        "pc_11", "pc_12", "pc_13", "pc_14", "pc_15",
        "pc_16", "pc_17", "pc_18", "pc_19", "pc_20",
        "pc_21", "pc_22", "pc_23", "pc_24", "pc_25",
        "pc_26", "pc_27",
    ]
    # fmt: on
    X, y = team_train_with_pca[subset].make_dummies(exclude=team_target)
    return ml.sfs(estimator, "regression").fit(X, y)


file = "saved-output/sfs_res_team_pca_rf.pickle"
sfs_res_team_pca_rf = my.cached_results(file, fun_sfs_res_team_pca_rf)
Code
ml.sfs_plot_results(
    sfs_res_team_pca_rf,
    "Predictors: Selected PCs (Random Forest)",
    team_train[team_target].std(),
);

Fig. 5.26. SFS results.
k = 14, avg. RMSE = 1.180 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of RMSE
Code
(
    ml.sfs_list_features(sfs_res_team_pca_rf)
    .style.format(precision=3)
    .highlight_min(subset=["RMSE"])
)
  added_feature RMSE RMSE_improvement RMSE_percentage_change
step        
1 team_type__home 1.245 nan nan
2 pc_1 1.459 -0.214 -17.194
3 pc_2 1.259 0.200 13.676
4 pc_3 1.216 0.044 3.474
5 pc_7 1.199 0.017 1.369
6 pc_9 1.192 0.007 0.591
7 pc_10 1.188 0.004 0.350
8 pc_11 1.186 0.002 0.163
9 pc_6 1.183 0.003 0.248
10 pc_20 1.183 0.000 0.006
11 pc_12 1.181 0.002 0.141
12 pc_8 1.181 0.000 0.029
13 pc_17 1.180 0.001 0.055
14 pc_24 1.180 -0.000 -0.007
15 pc_15 1.179 0.001 0.098
16 pc_16 1.179 0.000 0.002
17 pc_26 1.178 0.001 0.089
18 pc_14 1.179 -0.001 -0.071
19 pc_18 1.178 0.001 0.061
20 pc_19 1.179 -0.001 -0.101
21 pc_23 1.178 0.001 0.082
22 pc_27 1.179 -0.001 -0.057
23 pc_22 1.180 -0.001 -0.125
24 pc_21 1.180 0.000 0.034
25 pc_25 1.180 -0.000 -0.025
26 pc_5 1.181 -0.000 -0.017
27 pc_13 1.181 -0.000 -0.004
28 pc_4 1.184 -0.003 -0.268

5.8.6 Final models

This subsection summarizes the results from the subsections above and evaluates the performance on the whole training and test sets.

Basing on training CV RMSE:

  • Comparing 3 groups of predictors (team-related features, player-related features and betting odds), betting odds show the best predictive abilities and team-related features show the worst ones (see Table 5.13).
  • Comparing predictions based on original variables and PCs of these variables, PCs did not improve the predictions.
  • For the further investigation, 3 models were selected.
Table 5.13. Regression model selection results: selected models for each feature type and algorithm.
Features type Method Number of
features selected
Training CV
RMSE
Selected as
final model
Note
Team-related Linear regression k = 2 1.241 No⁴
  Random forest k = 2 1.242 No
Player-related Linear regression k = 10 1.203 No⁴ Included¹: all
Max. allowed²: 30
With k = 20, RMSE: 1.199
  Random forest k = 10 1.224 No Included¹: 10
Betting odds Linear regression k = 2 1.162 No
  Random forest k = 1 1.162 Yes⁵
All variables Linear regression k = 6 1.158 No Included¹: all
Max. allowed²: 50
  Random forest k = 1 1.162 Yes⁵ Included¹: 12
The same model as in “Betting odds | Random forest”
6 PCs of all³ variables Linear regression k = 4 1.176 No Included¹: 7
  Random forest k = 6 1.201 No Included¹: 7
27 PCs of all³ variables Linear regression k = 5 1.160 No Included¹: 28
  Random forest k = 13 1.180 No Included¹: 28
With k = 17, RMSE: 1.178

¹ – Number of features included in SFS selection.
² – Maximum allowed number of features to be selected.
³ – PCs of all numeric variables.
⁴ – Model was a candidate to become a final model but rejected due to low performance.
⁵ – In both cases the same model was selected.


Two candidates to final models out of 3 were discarded due to low explained variance (R²<0.15; see Table 5.14).

Code
np.random.seed(250)

# -----------------------------------------------------------------------

subset_1 = [team_target, "team_type", "buildUpPlayPassing"]
X_train_1, y_train_1 = team_train[subset_1].make_dummies(exclude=team_target)

model_team_team = LinearRegression()
model_team_team.fit(X_train_1, y_train_1)

y_pred_train_1 = model_team_team.predict(X_train_1)

# -----------------------------------------------------------------------

subset_2 = [
    team_target,
    "dribbling__mean",
    "team_type",
    "agility__mean",
    "height__max",
    "standing_tackle__mean",
    "short_passing__mean",
    "ball_control__min",
    "gk_reflexes__max",
    "overall_rating__max",
    "player_age__mean",
]
X_train_2, y_train_2 = team_train[subset_2].make_dummies(exclude=team_target)

model_team_player = LinearRegression()
model_team_player.fit(X_train_2, y_train_2)

y_pred_train_2 = model_team_player.predict(X_train_2)

# -----------------------------------------------------------------------

subset_3 = [team_target, "B365_win"]
X_train_3, y_train_3 = team_train[subset_3].make_dummies(exclude=team_target)
X_test_3, y_test_3 = team_test[subset_3].make_dummies(exclude=team_target)

model_team_all = RandomForestRegressor(n_jobs=-1)
model_team_all.fit(X_train_3, y_train_3)

y_pred_train_3 = model_team_all.predict(X_train_3)
y_pred_test_3 = model_team_all.predict(X_test_3)

# -----------------------------------------------------------------------

pd.concat(
    [
        ml.get_regression_performance(
            y_train_1, y_pred_train_1, "Train (team-related features)"
        ),
        ml.get_regression_performance(
            y_train_2, y_pred_train_2, "Train (player-related features)"
        ),
        ml.get_regression_performance(
            y_train_3, y_pred_train_3, "Train (all features/betting odds)"
        ),
        ml.get_regression_performance(
            y_test_3, y_pred_test_3, "Test (all features/betting odds)"
        ),
    ]
).index_start_at(1).style.format(precision=2)
Table 5.14. Final evaluation of selected models for team goal prediction.
  set n SD RMSE RMSE_SD_ratio SD_RMSE_ratio
1 Train (team-related features) 27200 1.26 1.24 0.03 0.98 1.02
2 Train (player-related features) 27200 1.26 1.20 0.09 0.95 1.05
3 Train (all features/betting odds) 27200 1.26 1.16 0.16 0.92 1.09
4 Test (all features/betting odds) 5455 1.26 1.16 0.15 0.92 1.08

5.9 Match Outcome Prediction

  • Can we predict which team will win the match?
  • In this section, the output of the match (home wins, draw, away wins) is modeled.
  • The initial idea was to select 4 models: one from each feature type group as these features are available at different time before the match. But model based on team-related features showed low performance. And model based on all type of variables was rejected due to possible overfitting in preference to less complex model with 1 variable based on betting odds: these models share she same most important feature and inclusion of additional features only slightly improved model performance on training set. So only 2 final models were selected.
  • The test performance of the final models:
    • for the model based on player attributes accuracy is 50%, balanced accuracy is 42%;
    • for the model based on betting odds is as follows: accuracy 52%, balanced accuracy is 45%.
  • These models can be used in different situations when different typos of variables are available.
  • Unfortunately, both models are unable to predict outcome “draw” correctly. This might be related to the findings in section Relationship Between Betting Odds that betting odds of “draw” are correlated to the outcome “away wins”.
  • Conclusion: despite the fact that there is a lot of randomness in the game, decisions based on data can improve predictions on the football match outcome. Still, this prediction is not perfect.

The results of classification model selection are in Table 5.15. The performance of the selected models is presented in Table 5.16 and in the output below this table.

The details are in the subsections below.

5.9.3 Betting-Odds as Predictors

Logistic Regression

Code
# Do SFS or take results from cache
def fun_sfs_res_match_betting():
    np.random.seed(250)
    estimator = LogisticRegression(
        solver="newton-cg", multi_class="multinomial"
    )
    subset = [match_target, *match_vars_betting_odds]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification").fit(X, y)


file = "saved-output/sfs_res_match_betting.pickle"
sfs_res_match_betting = my.cached_results(file, fun_sfs_res_match_betting)
Code
ml.sfs_plot_results(
    sfs_res_match_betting, "Predictors: Betting Odds (Logistic Regression)"
);

Fig. 5.31. SFS results.
k = 1, avg. BAcc = 0.454 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
Details: Numeric values of BAcc
Code
(
    ml.sfs_list_features(sfs_res_match_betting)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)
  added_feature BAcc BAcc_improvement BAcc_percentage_change
step        
1 BW_away_wins 0.454 nan nan
2 VC_away_wins 0.454 -0.000 -0.022
3 LB_away_wins 0.455 0.000 0.086
4 WH_away_wins 0.455 0.000 0.073
5 B365_away_wins 0.455 -0.000 -0.008
6 IW_away_wins 0.455 -0.000 -0.060
7 VC_log_ratio_ha 0.446 -0.009 -1.893
8 BW_log_ratio_ha 0.447 0.001 0.254
9 IW_draw 0.447 0.000 0.077
10 WH_log_ratio_ha 0.448 0.000 0.046
11 B365_log_ratio_ha 0.448 -0.000 -0.000
12 VC_draw 0.447 -0.000 -0.038
13 B365_draw 0.448 0.000 0.018
14 IW_log_ratio_ha 0.448 0.000 0.003
15 LB_log_ratio_ha 0.447 -0.000 -0.018
16 WH_draw 0.447 -0.000 -0.108
17 BW_draw 0.447 -0.000 -0.074
18 WH_ratio_ha 0.447 0.000 0.009
19 LB_draw 0.447 0.000 0.069
20 VC_ratio_ha 0.447 -0.000 -0.038
21 IW_home_wins 0.447 0.000 0.026
22 WH_home_wins 0.447 -0.000 -0.069
23 VC_home_wins 0.446 -0.000 -0.043
24 BW_ratio_ha 0.446 -0.000 -0.044
25 LB_ratio_ha 0.446 0.000 0.000
26 B365_ratio_ha 0.446 -0.000 -0.003
27 B365_home_wins 0.446 -0.000 -0.010
28 IW_ratio_ha 0.446 -0.000 -0.111
29 BW_home_wins 0.445 -0.001 -0.144
30 LB_home_wins 0.445 0.000 0.023

Random Forests

Details: Feature importances
Code
def fun_rf_match_betting():
    np.random.seed(250)
    rf = RandomForestClassifier(n_jobs=-1)
    subset = [match_target, *match_vars_betting_odds]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return rf.fit(X, y)


file = "saved-output/rf_match_betting.pickle"
rf_match_betting = my.cached_results(file, fun_rf_match_betting)

rf_match_betting_importances = ml.get_rf_importances(rf_match_betting)

ml.plot_importances(rf_match_betting_importances, n=30);

Code
# Do SFS or take results from cache
def fun_sfs_res_match_betting_rf():
    np.random.seed(250)
    estimator = RandomForestClassifier()
    subset = [match_target, *match_vars_betting_odds]
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_betting_rf.pickle"
sfs_res_match_betting_rf = my.cached_results(file, fun_sfs_res_match_betting_rf)
Code
ml.sfs_plot_results(
    sfs_res_match_betting_rf, "Predictors: Betting Odds (Random Forests)"
);

Fig. 5.32. SFS results.
k = 10, avg. BAcc = 0.445 [Best]
(Number of predictors at best score)
Details: Numeric values of BAcc
Code
(
    ml.sfs_list_features(sfs_res_match_betting_rf)
    .head(20)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)
  added_feature BAcc BAcc_improvement BAcc_percentage_change
step        
1 BW_home_wins 0.445 nan nan
2 B365_home_wins 0.439 -0.006 -1.251
3 WH_away_wins 0.428 -0.012 -2.671
4 WH_draw 0.426 -0.001 -0.317
5 IW_home_wins 0.434 0.008 1.782
6 LB_ratio_ha 0.442 0.008 1.936
7 WH_home_wins 0.443 0.001 0.130
8 VC_home_wins 0.444 0.002 0.349
9 B365_log_ratio_ha 0.442 -0.003 -0.606
10 BW_log_ratio_ha 0.445 0.003 0.751

5.9.4 All Variables as Predictors

Logistic Regression

Code
# Do SFS or take results from cache
def fun_sfs_res_match_all():
    np.random.seed(250)
    estimator = LogisticRegression(
        solver="newton-cg", multi_class="multinomial"
    )
    X, y = match_train.make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_all.pickle"
sfs_res_match_all = my.cached_results(file, fun_sfs_res_match_all)
Code
ml.sfs_plot_results(
    sfs_res_match_all, "Predictors: All Features (Logistic Regression)"
);

Fig. 5.33. SFS results.
k = 9, avg. BAcc = 0.459 [Best]
(Number of predictors at best score)
Details: Numeric values of BAcc
Code
(
    ml.sfs_list_features(sfs_res_match_all)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)
  added_feature BAcc BAcc_improvement BAcc_percentage_change
step        
1 BW_away_wins 0.454 nan nan
2 chanceCreationCrossing_home 0.456 0.002 0.399
3 jumping__mean_home 0.457 0.001 0.272
4 defenceTeamWidth_home 0.458 0.001 0.115
5 gk_kicking__mean_away 0.458 0.000 0.065
6 gk_kicking__min_home 0.459 0.001 0.120
7 balance__max_home 0.458 -0.000 -0.048
8 buildUpPlaySpeed_home 0.459 0.000 0.080
9 finishing__std_home 0.459 0.000 0.040
10 agility__std_home 0.459 -0.000 -0.045

Random Forests

Details: Feature importances
Code
def fun_rf_match_all():
    np.random.seed(250)
    rf = RandomForestClassifier(n_jobs=-1)
    X, y = match_train.make_dummies(exclude=match_target)
    return rf.fit(X, y)


file = "saved-output/rf_match_all.pickle"
rf_match_all = my.cached_results(file, fun_rf_match_all)

rf_match_all_importances = ml.get_rf_importances(rf_match_all)

ml.plot_importances(rf_match_all_importances, n=50);

Code
(
    rf_match_all_importances.nlargest(25, "importance")
    .style.format(precision=4)
    .bar()
)
  features importance
25 BW_log_ratio_ha 0.0096
19 BW_ratio_ha 0.0081
9 LB_home_wins 0.0071
18 B365_ratio_ha 0.0070
17 VC_away_wins 0.0069
29 LB_log_ratio_ha 0.0068
26 VC_log_ratio_ha 0.0067
22 WH_ratio_ha 0.0064
24 B365_log_ratio_ha 0.0062
20 VC_ratio_ha 0.0060
2 B365_away_wins 0.0057
15 VC_home_wins 0.0056
28 WH_log_ratio_ha 0.0056
8 IW_away_wins 0.0055
5 BW_away_wins 0.0050
12 WH_home_wins 0.0047
11 LB_away_wins 0.0047
23 LB_ratio_ha 0.0046
3 BW_home_wins 0.0045
0 B365_home_wins 0.0041
21 IW_ratio_ha 0.0040
355 player_age__std_home 0.0040
64 bmi__mean_away 0.0039
66 bmi__std_away 0.0038
354 player_age__std_away 0.0038
Code
# Do SFS or take results from cache
def fun_sfs_res_match_all_rf():
    np.random.seed(250)
    estimator = RandomForestClassifier()
    # fmt: off
    subset = [
        match_target, 'BW_log_ratio_ha', 'BW_ratio_ha', 'LB_home_wins', 
        'B365_ratio_ha', 'VC_away_wins', 'LB_log_ratio_ha', 'VC_log_ratio_ha',
        'WH_ratio_ha', 'B365_log_ratio_ha', 'VC_ratio_ha',
        'B365_away_wins', 'VC_home_wins', 'WH_log_ratio_ha',
        'IW_away_wins', 'BW_away_wins', 'WH_home_wins', 'LB_away_wins',
        'LB_ratio_ha', 'BW_home_wins', 'B365_home_wins', 'IW_ratio_ha',
        'player_age__std_home', 'bmi__mean_away', 'bmi__std_away',
        'player_age__std_away', 'player_age__min_away',
        'agility__std_home', 'free_kick_accuracy__std_home',
        'potential__std_away', 'overall_rating__std_away',
        'overall_rating__std_home', 'agility__std_away',
        'acceleration__std_home', 'player_age__min_home', 'bmi__mean_home',
        'weight_kg__std_home', 'IW_home_wins', 'dribbling__mean_away',
        'long_shots__std_away', 'reactions__std_away',
        'long_shots__std_home', 'player_age__mean_away', 'bmi__std_home',
        'potential__std_home', 'heading_accuracy__std_away',
        'player_age__mean_home', 'strength__std_away',
        'weight_kg__std_away', 'shot_power__std_home',
        'long_passing__std_away'
    ]
    # fmt: on
    X, y = match_train[subset].make_dummies(exclude=match_target)
    return ml.sfs(estimator, "classification", 10).fit(X, y)


file = "saved-output/sfs_res_match_all_rf.pickle"
sfs_res_match_all_rf = my.cached_results(file, fun_sfs_res_match_all_rf)
Code
ml.sfs_plot_results(
    sfs_res_match_all_rf, "All Features as Predictors (Random Forests)"
);

Fig. 5.34. SFS results.
k = 10, avg. BAcc = 0.452 [Best]
(Number of predictors at best score)
Details: Numeric values of BAcc
Code
(
    ml.sfs_list_features(sfs_res_match_all_rf)
    .head(20)
    .style.format(precision=3)
    .highlight_max(subset=["BAcc"])
)
  added_feature BAcc BAcc_improvement BAcc_percentage_change
step        
1 BW_home_wins 0.445 nan nan
2 B365_home_wins 0.439 -0.006 -1.245
3 long_shots__std_home 0.425 -0.014 -3.252
4 player_age__std_away 0.438 0.013 3.029
5 free_kick_accuracy__std_home 0.443 0.005 1.210
6 VC_home_wins 0.446 0.003 0.587
7 BW_ratio_ha 0.451 0.005 1.028
8 player_age__min_home 0.451 0.001 0.129
9 weight_kg__std_home 0.451 -0.001 -0.132
10 LB_away_wins 0.452 0.002 0.358

5.9.5 Final Models

This subsection summarizes the results from the subsections above and evaluates the performance on whole training and test sets.

Table 5.15. Classification model selection results: selected models for each feature type and algorithm.
Features type Method Number of
features selected
Training CV
BAcc
Selected as
final model
Notes
Team-related Logistic regression k = 9 0.347 No
  Random forest k = 3 0.350 No³
Player-related Logistic regression k = 9 0.451 Yes Included¹: all
Max. allowed²: 30
With k = 28, BAcc: 0.454.
  Random forest k = 7 0.430 No Included¹: 21
Max. allowed²: 10
With k = 10, BAcc: 0.431.
Betting odds Logistic regression k = 1 0.454 Yes
  Random forest k = 1 0.445 No Included¹: all
Max. allowed²: 10
All Variables Logistic regression k = 4 0.458 No⁴ Included¹: all
Max. allowed²: 10
  Random forest k = 7 0.451 No Included¹: 50
Max. allowed²: 10

¹ – Number of features included in SFS selection.
² – Maximum allowed number of features to be selected.
³ – This model was a candidate to become the final model but it was rejected due to low performance.
⁴ – This model was a candidate to become the final model but it shares the same variable as betting odds based model and with 3 additional variables the performance increased only slightly. So model was rejected due to possible overfitting in preference to less complex model with 1 variable.

Code
np.random.seed(250)

# -----------------------------------------------------------------------

subset_1 = [
    match_target,
    "dribbling__mean_away",
    "overall_rating__mean_home",
    "overall_rating__mean_away",
    "stamina__max_away",
    "gk_positioning__std_home",
    "long_shots__max_away",
    "weight_kg__std_home",
    "jumping__min_away",
    "strength__std_away",
]
X_train_1, y_train_1 = match_train[subset_1].make_dummies(exclude=match_target)
X_test_1, y_test_1 = match_test[subset_1].make_dummies(exclude=match_target)

model_match_player = LogisticRegression(
    solver="newton-cg", multi_class="multinomial"
)
model_match_player.fit(X_train_1, y_train_1)

y_pred_train_1 = model_match_player.predict(X_train_1)
y_pred_test_1 = model_match_player.predict(X_test_1)

# -----------------------------------------------------------------------

subset_2 = [match_target, "BW_away_wins"]
X_train_2, y_train_2 = match_train[subset_2].make_dummies(exclude=match_target)
X_test_2, y_test_2 = match_test[subset_2].make_dummies(exclude=match_target)

model_match_betting = LogisticRegression(
    solver="newton-cg", multi_class="multinomial"
)
model_match_betting.fit(X_train_2, y_train_2)

y_pred_train_2 = model_match_betting.predict(X_train_2)
y_pred_test_2 = model_match_betting.predict(X_test_2)

# -----------------------------------------------------------------------

pd.concat(
    [
        ml.get_classification_performance(
            y_train_1, y_pred_train_1, "Train (player-related variables)"
        ),
        ml.get_classification_performance(
            y_test_1, y_pred_test_1, "Test (player-related variables)"
        ),
        ml.get_classification_performance(
            y_train_2, y_pred_train_2, "Train (betting odds based prediction)"
        ),
        ml.get_classification_performance(
            y_test_2, y_pred_test_2, "Test (betting odds based prediction)"
        ),
    ]
).index_start_at(1)
Table 5.16. Final evaluation of selected models for match outcome prediction.
set n Accuracy BAcc BAcc_01 f1_macro f1_weighted Kappa
1 Train (player-related variables) 12634 0.53 0.45 0.17 0.39 0.45 0.21
2 Test (player-related variables) 2575 0.50 0.42 0.14 0.37 0.42 0.17
3 Train (betting odds based prediction) 12634 0.53 0.45 0.18 0.39 0.45 0.22
4 Test (betting odds based prediction) 2575 0.52 0.45 0.18 0.39 0.45 0.21
Code
print("Classification Report\nTest set (player-related variables)\n")
print(ml.print_classification_report(y_test_1, y_pred_test_1, "test"))
Classification Report
Test set (player-related variables)

    set     n  Accuracy  BAcc  BAcc_01  f1_macro  f1_weighted  Kappa
0  test  2575      0.50  0.42     0.14      0.37         0.42   0.17

              precision    recall  f1-score   support

   Away Wins       0.49      0.45      0.47       790
        Draw       1.00      0.00      0.00       641
   Home Wins       0.51      0.83      0.63      1144

    accuracy                           0.50      2575
   macro avg       0.67      0.42      0.37      2575
weighted avg       0.63      0.50      0.42      2575


Confusion matrix (rows - true, columns - predicted):
[[352   0 438]
 [168   1 472]
 [199   0 945]]
None
Code
print("Classification Report\nTest set (betting odds based prediction)\n")
ml.print_classification_report(y_test_2, y_pred_test_2, "test")
Classification Report
Test set (betting odds based prediction)

    set     n  Accuracy  BAcc  BAcc_01  f1_macro  f1_weighted  Kappa
0  test  2575      0.52  0.45     0.18      0.39         0.45   0.21

              precision    recall  f1-score   support

   Away Wins       0.48      0.56      0.52       790
        Draw       0.00      0.00      0.00       641
   Home Wins       0.54      0.79      0.64      1144

    accuracy                           0.52      2575
   macro avg       0.34      0.45      0.39      2575
weighted avg       0.39      0.52      0.45      2575


Confusion matrix (rows - true, columns - predicted):
[[446   0 344]
 [231   0 410]
 [244   0 900]]

6 Summary

In this project, the European Football database, which includes data from seasons 2008/2009 to 2015/2016 was analyzed. Nine main questions in the “Analysis” section were answered. At the beginning of each main subsection, the most important findings were summarized. The game includes a lot of randomness but in some situations data-based approach can give additional valuable information about the European Football game.

6.1 Things to Improve

  • Some pre-precessing steps were performed but data from those steps were not included in the final analysis. These pre-processing steps could be removed from the analysis.
  • Some pre-processing steps should be explained in more detail in a written form.
  • I preferred .eval() over .assign() were possible and used .assign() elsewhere. Some users may find this as inconsistent coding style.
  • Some tables have names that are technical (n_goals) rather that natural for humans (e.g., “Number of goals”).
  • Variable names in the last part (e.g., y_train_1, y_pred_train_1) could have been more human-friendly.
  • Parameter tuning may improve RF performance.
  • Other types of machine learning algorithms (e.g., SVM, xgBoost) may capture the rends better and lead to better performance. This should be tested.
  • Some parts of this database (e.g., tables with player data) could be investigated in more detail to get even more insights.
  • Some plots (e.g., heat maps or cluster maps) are very large in order not to loose variable names. But these plots may not fin on the screen. So to fit then onto a screen, the user should make browser window narrower but as tall as it was before. On the other hand, some HTML output (profiling report) can be effectively studied only on wide screens.