Data analysis tools: Python, SQL, Looker Studio Helper tools: VS Code, Quarto, Git Skills:
data pre-processing
exploratory data analysis (EDA):
descriptive statistics
data visualization
inferential statistics:
hypothesis testing
confidence intervals
predictive modeling:
classification task
regression task
object-oriented programming (OOP)
statistical programming
literate programming
dashboarding
Abbreviations
Acc – accuracy.
BAcc – balanced accuracy.
BAcc_01 – balanced accuracy where 0 is the worst and 1 is the best result.
CI – 95% confidence interval.
CLD – compact letter display.
CV – cross-validation.
EDA – exploratory data analysis.
FIFA – International Federation of Association Football.
k – number of variables/features.
ML – machine learning.
n – either sample or group size.
NA, NAs – missing value(s).
p – p-value.
p_adj – p-value (adjusted).
PC, PCs – principal component(s).
PCA – principal component analysis.
r – Pearson’s correlation coefficient.
R² – coefficient of determination, r squared.
RMSE – root mean squared error.
RNG – (pseudo)random number generator.
SD – standard deviation.
SE – standard error.
SFS – sequential feature selection.
UK – United Kingdom.
1 Introduction
European Football (also known as Soccer) is one of the most popular games in Europe. Football is a big market with revenues of €27.6 billion in 2020/21 (source). The money is earned by, e.g., selling tickets to matches and rights to broadcast games, participating in betting, and advertising.
In this project, European Football data from seasons 2008/2009 to 2015/2016 was analyzed to get a better data-based understanding of this game. In each subsection of the “Analysis” section of this project, nine main questions are analyzed and insights are provided. At the beginning of each main subsection, the most important findings are presented and further parts of that subsection provide the details (plots, tables, etc.) on those findings.
Tip
Pay attention that some codes, analyses, results, or other details are hidden in collapsible sections (that are collapsed by default). These are:
either parts that have a lot of results that can clutter the report,
or less important or supplementary parts, e.g., to prove some claims in the text.
1.1 Setup
Code: The main Python setup
# Automatically reload certain modules%reload_ext autoreload%autoreload 1# Plotting%matplotlib inline# Packages and modules -------------------------------import osimport reimport warnings# Working with SQL databaseimport sqlite3# EDAimport ydata_profiling as edafrom skimpy import skimimport missingno as msno# Data wrangling, mathsimport numpy as npimport pandas as pdimport janitor # imports additional Pandas methods# Statistical analysisimport scipy.stats as sps# Machine learningfrom sklearn.linear_model import LinearRegression, LogisticRegressionfrom sklearn.ensemble import RandomForestRegressor, RandomForestClassifier# Visualizationimport matplotlib.pyplot as pltimport seaborn as snsfrom matplotlib.colors import LinearSegmentedColormap# Mapsimport geopandas as gpdfrom shapely.geometry import Polygon# Enable ability to run R in Pythonos.environ["R_HOME"] ="C:/PROGRA~1/R/R-4.2.3"with warnings.catch_warnings(): warnings.simplefilter("ignore")import rpy2%load_ext rpy2.ipython# Custom functionsimport functions.fun_utils as myimport functions.pandas_methodsimport functions.fun_analysis as animport functions.fun_ml as ml%aimport functions.fun_utils%aimport functions.pandas_methods%aimport functions.fun_analysis%aimport functions.fun_ml# Settings --------------------------------------------# Default plot optionsplt.rc("figure", titleweight="bold")plt.rc("axes", labelweight="bold", titleweight="bold")plt.rc("font", weight="normal", size=10)plt.rc("figure", figsize=(7, 3))# Pandas optionspd.set_option("display.max_rows", 1000)pd.set_option("display.max_columns", 300)pd.set_option("display.max_colwidth", 50) # Possible option: Nonepd.set_option("display.float_format", lambda x: f"{x:.2f}")pd.set_option("styler.format.thousands", ",")# colorsgreen, blue, orange, red ="tab:green", "tab:blue", "tab:orange", "tab:red"# Analysis parametersdo_eda =True
Code
"""Various functions for data pre-processing, analysis and plotting."""# OS moduleimport os# Enable ability to run R code in Pythonos.environ["R_HOME"] ="C:/PROGRA~1/R/R-4.2.3"import rpy2.robjects as r_objfrom rpy2.robjects.conversion import localconverter# Other Python libraries and modulesimport pathlibimport pickleimport pandas as pdimport matplotlib.pyplot as pltimport statsmodels.stats.api as smsimport scipy.stats as spsfrom scipy.stats import median_abs_deviationfrom typing import Unionfrom IPython.display import display, HTMLfrom matplotlib.ticker import MaxNLocator# Utilities ==================================================================# For Pandas objectsdef index_has_names(obj):"""Check if index of an object has names. Args: obj: Object that has `.index` attribute Returns: bool: True if index has names, False otherwise. """returnNonenotinlist(obj.index.names)# Display in Jupyter notebookdef display_collapsible(x, summary: str="", sep=" ", is_open: bool=False):"""Display data frame or other object surrounded by `<details>` tags (I.e., display in collapsible way) Args: x (pd.DataDrame, str, list): Object to display summary (str, optional): Collapsed section name. Defaults to "". sep (str, optional): Symbol used to join strings (when x is a list). Defaults to " ". is_open (bool, optional): Should the section be open by default Defaults to False. """if is_open: is_open =" open"else: is_open =""ifhasattr(x, "to_html") andcallable(x.to_html): html_str = x.to_html()eliftype(x) ==str: html_str = xelse: html_str = sep.join([str(i) for i in x]) display( HTML(f"<details{is_open}><summary>{summary}</summary>"+ html_str+"</details>" ) )def cached_results(file, fun, **kwargs):"""If file does not exist, take results from file, otherwise calculate them, save to file and return the calculated result. Args: file (str): File name. fun (function): function. **kwargs: arguments passed to `fun`. Returns: The result of `fun()` """if pathlib.Path(file).is_file():withopen(file, "rb") as f: results = pickle.load(f)else: results = fun(**kwargs)withopen(file, "wb") as f: pickle.dump(results, f)return results# Helper functions to work with R in Python -----------------------------------def r_to_python(obj: str):"""Import object from R environment to Python Import object from R environment created in ipynb cells via `rpy2` package. Args: obj (str): Object name in R global environment. Returns: Analogous Python object (NOTE: tested with data frames only). """return r_obj.pandas2ri.rpy2py(r_obj.globalenv[obj])# Format values ------------------------------------------------------------def format_p(p):"""Format p values at 3 decimal places. Args: p (float): p value (number between 0 and 1). """if p <0.001:return"p < 0.001"elif p >0.999:return"p > 0.999"else:returnf"p = {p:.3f}"def format_percent(x: float):"""Round percentages to 1 decimal place and format as strings Values between 0 and 0.05 are printed as <0.1% Values between 99.95 and 100 are printed as >100% Args: x (float): A sequence of percentage values ranging from 0 to 100. Returns: pd.Series[str]: Pandas series of formatted values. Values equal to 0 are formatted as "0%", values between 0 and 0.05 are formatted as "<0.1%", values between 99.95 and 100 are formatted as ">99.9%", and values equal to 100 are formatted as "100%". Author: Vilmantas Gėgžna """return pd.Series( ["0%"if i ==0else"<0.1%"if i <0.05else">99.9%"if99.95<= i <100elsef"{i:.1f}%"for i in x ], index=x.index, )# Analysis =================================================================# Exploratory analysisdef count_unique(data: pd.DataFrame):"""Get number and percentage of unique values Args: data (pd.DataFrame): Data frame to analyze. Return: data frame with columns `n_unique` (int) and `percent_unique` (str) """ n_unique = data.nunique()return pd.concat( [ n_unique.rename("n_unique"), format_percent((n_unique / data.shape[0]).multiply(100)).rename("percent_unique" ), ], axis=1, )# Descriptive statistics ----------------------------------------------------def calc_summaries(x, ndigits=None):"""Calculate some common summary statistics. Args: x (pandas.Series): Numeric variable to summarize. ndigits (int, None, optional): Number of decimal digits to round to. Defaults to None. Return: pandas.DataFrame with summary statistics. """def mad(x):return median_abs_deviation(x)defrange(x):return x.max() - x.min() res = x.agg( ["count", "min", "max", range, "mean", "median", "std", mad, "skew"] )if ndigits isnotNone: summary = pd.DataFrame(round(res, ndigits=ndigits)).Telse: summary = pd.DataFrame(res).T# Present count data as integer: summary = summary.assign(count=lambda d: d["count"].astype(int))return summary# Plot counts ---------------------------------------------------------------def plot_counts_with_labels( counts, title="", x=None, y="n", x_lab=None, y_lab="Count", label="percent", label_rotation=0, title_fontsize=13, legend=False, ec="black", y_lim_max=None, ax=None,**kwargs,):"""Plot count data as bar plots with labels. Args: counts (pandas.DataFrame): Data frame with counts data. title (str, optional): Figure title. Defaults to "". x (str, optional): Column name from `counts` to plot on x axis. Defaults to None: first column. y (str, optional): Column name from `counts` to plot on y axis. Defaults to "n". x_lab (str, optional): X axis label. Defaults to value of `x` with capitalized first letter. y_lab (str, optional): Y axis label. Defaults to "Count". label (str, None, optional): Column name from `counts` for value labels. Defaults to "percent". If None, label is not added. label_rotation (int, optional): Angle of label rotation. Defaults to 0. legend (bool, optional): Should legend be shown?. Defaults to False. ec (str, optional): Edge color. Defaults to "black". y_lim_max (float, optional): Upper limit for Y axis. Defaults to None: do not change. ax (matplotlib.axes.Axes, optional): Axes object. Defaults to None. **kwargs: further arguments to pandas.DataFrame.plot.bar() Returns: matplotlib.axes.Axes: Axes object of the generate plot. Author: Vilmantas Gėgžna """if x isNone: x = counts.columns[0]if x_lab isNone: x_lab = x.capitalize()if y_lim_max isNone: y_lim_max = counts[y].max() *1.15 ax = counts.plot.bar(x=x, y=y, legend=legend, ax=ax, ec=ec, **kwargs) ax.set_title(title, fontsize=title_fontsize) ax.set_xlabel(x_lab) ax.set_ylabel(y_lab)if label isnotNone: ax_add_value_labels_ab( ax, labels=counts[label], rotation=label_rotation ) ax.set_ylim(0, y_lim_max)return axdef ax_xaxis_integer_ticks(min_n_ticks: int, rot: int=0):"""Ensure that x axis ticks has integer values Args: min_n_ticks (int): Minimal number of ticks to use. rot (int, optional): Rotation angle of x axis tick labels. Defaults to 0. """ ax = plt.gca() ax.xaxis.set_major_locator( MaxNLocator(min_n_ticks=min_n_ticks, integer=True) ) plt.xticks(rotation=rot)def ax_axis_comma_format(axis: str="xy", ax=None):"""Write values of X axis ticks with comma as thousands separator Args: axis (str, optional): which axis should be formatted: "x" X axis, "y" Y axis or "xy" (default) both axes. ax (axis object, None, optional):Axis of plot. Defaults to None: current axis. """if ax isNone: ax = plt.gca() fmt ="{x:,.0f}" formatter = plt.matplotlib.ticker.StrMethodFormatter(fmt)if"x"in axis: ax.xaxis.set_major_formatter(formatter)if"y"in axis: ax.yaxis.set_major_formatter(formatter)def ax_add_value_labels_ab( ax, labels=None, spacing=2, size=9, weight="bold", **kwargs):"""Add value labels above/below each bar in a bar chart. Arguments: ax (matplotlib.Axes): Plot (axes) to annotate. label (str or similar): Values to be used as labels. spacing (int): Number of points between bar and label. size (int): font size. weight (str): font weight. **kwargs: further arguments to axis.annotate. Source: This function is based on https://stackoverflow.com/a/48372659/4783029 """# For each bar: Place a labelfor rect, label inzip(ax.patches, labels):# Get X and Y placement of label from rect. y_value = rect.get_height() x_value = rect.get_x() + rect.get_width() /2 space = spacing# Vertical alignment for positive values va ="bottom"# If the value of a bar is negative: Place label below the barif y_value <0:# Invert space to place label below space *=-1# Vertical alignment va ="top"# Use Y value as label and format number with one decimal placeif labels isNone: label ="{:.1f}".format(y_value)# Create annotation ax.annotate( label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha="center", va=va, fontsize=size, fontweight=weight,**kwargs, )# Inferential statistics -----------------------------------------------------def ci_proportion_multinomial( counts, method: str="goodman", n_label: str="n", percent_label: str="percent",) -> pd.DataFrame:"""Calculate simultaneous confidence intervals for multinomial proportion. More information in documentation of statsmodels' multinomial_proportions_confint. Args: x (int): ps.Series, list or tuple with count data. method (str, optional): Method. Defaults to "goodman". n_label (str, optional): Name for column for counts. percent_label (str, optional): Name for column for percentage values. Returns: pd.DataFrame: _description_ Examples: >>> ci_proportion_multinomial([62, 33, 55]) """asserttype(counts) in [pd.Series, list, tuple]iftype(counts) isnot pd.Series: counts = pd.Series(counts)return pd.concat( [ (counts).rename(n_label), (counts /sum(counts)).rename(percent_label) *100, pd.DataFrame( sms.multinomial_proportions_confint(counts, method=method), index=counts.index, columns=["ci_lower", "ci_upper"], )*100, ], axis=1, )def test_chi_square_gof( f_obs: list[int], f_exp: Union[str, list[float]] ="all equal") ->str:"""Chi squared (χ²) goodness-of-fit (gof) test Args: f_obs (list[int]): Observed frequencies f_exp str, list[int]: List of expected frequencies or "all equal" if all frequencies are equal to the mean of observed frequencies. Defaults to "all equal". Returns: str: formatted test results including p value. """ k =len(f_obs) n =sum(f_obs) exp = n / k dof = k -1if f_exp =="all equal": f_exp = [exp for _ inrange(k)] stat, p = sps.chisquare(f_obs=f_obs, f_exp=f_exp)# May also be formatted this way:return (f"Chi square test, χ²({dof}, n = {n}) = {round(stat, 2)}, {format_p(p)}" )def pairwise_chisq_gof_test(x: pd.Series):"""Post-hoc Pairwise chi-squared Test Interface to R function `rstatix::pairwise_chisq_gof_test()`. Args: x (pandas.Series): data with group counts Returns: pandas.DataFrame: DataFrame with CLD results. """# Loading R package rstatix = r_obj.packages.importr("rstatix") dplyr = r_obj.packages.importr("dplyr")# Converting Pandas obj to R objwith localconverter(r_obj.default_converter + r_obj.pandas2ri.converter): x_in_r = r_obj.conversion.py2rpy(x)# Invoking the R function and getting the result df_result_r = rstatix.pairwise_chisq_gof_test(x_in_r) df_result_r = dplyr.relocate(df_result_r, "group1", "group2")# Converting the result to a Pandas dataframereturn r_obj.pandas2ri.rpy2py(df_result_r)def convert_pairwise_p_to_cld( data, group1: str="group1", group2: str="group2", p_name: str="p.adj", output_gr_var: str="group",):"""Convert p values from pairwise comparisons to CLD CLD - compact letter display: shared letter shows that difference is not significant. Interface to R function `convert_pairwise_p_to_cld()`. Args: data (pandas.DataFrame): Data frame with at least 3 columns: the first 2 columns contain names of both groups, one more column should contain p values. group1 (str, optional): Name of the first column with group names. Defaults to "group1". group2 (str, optional): Name of the first column with group names. Defaults to "group2". p_name (str, optional): Name of column with p values. Defaults to "p.adj". output_gr_var (str, optional): Name of column in output dataset with group names. Defaults to "group". Returns: pandas.DataFrame: DataFrame with CLD results. """# Loading R function from file r_obj.r["source"]("functions/functions.R") convert_pairwise_p_to_cld = r_obj.globalenv["convert_pairwise_p_to_cld"]# Converting Pandas data frame to R data framewith localconverter(r_obj.default_converter + r_obj.pandas2ri.converter): df_in_r = r_obj.conversion.py2rpy(data)# Invoking the R function and getting the result df_result_r = convert_pairwise_p_to_cld( df_in_r, group1=group1, group2=group2, p_name=p_name, output_gr_var=output_gr_var, )# Converting the result back to a Pandas dataframereturn r_obj.pandas2ri.rpy2py(df_result_r)
Code
"""Classes to perform statistical analysis and output the results."""import pandas as pdimport numpy as npimport pingouin as pgimport statsmodels.stats.api as smsimport scikit_posthocs as spimport matplotlib.pyplot as pltimport functions.fun_utils as my # Custom moduleimport functions.pandas_methods # Custom module; imports method .to_df()# Analyze count data ---------------------------------------------------------class AnalyzeCounts:"""The class to analyze count data. - Performs omnibus chi-squared and post-hoc pair-wise chi-squared test. - Compactly presents results of post-hoc test as compact letter display, CLDNOTE: for CLD calculations, R is required. (Shared CLD letter show no significant difference between groups). - Calculates percentages and their confidence intervals by using Goodman's method. - Creates summary of grouped values (group counts and percentages). - Plots results as bar plots with percentage labels. """def__init__(self, counts, by=None, counts_of=None):""" Object initialization function. Args: counts (pandas.Series[int]): Count data to analyze. by (str, optional): Grouping variable name. Used to create labels. If None, defaults to "Group" counts_of (str, optional): The thing that was counted. This name is used for labels in plots and tables. Defaults to `counts.name`. """assertisinstance(counts, pd.Series)# Set defaultsif by isNone: by ="Group"if counts_of isNone: counts_of = counts.name# Set attributes: user inputs or defaultsself.counts = countsself.counts_of = counts_ofself.by = by# Set attributes: created/calculatedself.n_label =f"n_{counts_of}"# Create label for counts# Set attributes: results to be calculatedself.results_are_calculated =Falseself.omnibus =Noneself.n_ci_and_cld =Noneself.descriptive_stats =Nonedef fit(self):"""Perform count data analysis: calculate the results."""# Alias attributes counts =self.counts by =self.by n_label =self.n_label# Omnibus test: perform and save the resultsself.omnibus = my.test_chi_square_gof(counts)# Post-hoc (pairwise chi-square): perform posthoc_p = my.pairwise_chisq_gof_test(counts) posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p, output_gr_var=by)# Confidence interval: calculate ci = ( my.ci_proportion_multinomial( counts, method="goodman", n_label=n_label ) .rename_axis(by) .reset_index() )# Make sure datasets are mergeable ci[by] = ci[by].astype(str) posthoc_cld[by] = posthoc_cld[by].astype(str)# Merge results n_ci_and_cld = pd.merge(ci, posthoc_cld, on=by)# Format percentages and countsvars= ["percent", "ci_lower", "ci_upper"] n_ci_and_cld[vars] = n_ci_and_cld[vars].apply(my.format_percent)# Save resultsself.n_ci_and_cld = n_ci_and_cld# Descriptive statistics: calculate to_format = ["min", "max", "range", "mean", "median", "std", "mad"]def format_0f(x):return [f"{i:,.0f}"for i in x] summary_count = my.calc_summaries(ci[n_label]) summary_count[to_format] = summary_count[to_format].apply(format_0f) summary_perc = my.calc_summaries(ci["percent"]) summary_perc[to_format] = summary_perc[to_format].apply( my.format_percent )# Save resultsself.descriptive_stats = pd.concat([summary_count, summary_perc])# Initialization statusself.results_are_calculated =True# Outputreturnselfdefprint(self, omnibus: bool=True, posthoc: bool=True, descriptives: bool=True, ):"""Print numeric results. Args: omnibus (bool, optional): Flag to print omnibus test results. Defaults to True. posthoc (bool, optional): Flag to print post-hoc test results. Defaults to True. descriptives (bool, optional): Flag to print descriptive statistics. Defaults to True. Raises: Exception: if calculations with `.fit()` method were not performed. """ifnotself.results_are_calculated:raiseException("No results. Run `.fit()` first.")# Omnibus testif omnibus:print("Omnibus (chi-squared) test results:")print(self.omnibus, "\n")# Post-hoc and CIif posthoc:print(f"Counts of {self.counts_of} with 95% CI ""and post-hoc (pairwise chi-squared) test results:" )print(self.n_ci_and_cld, "\n")# Descriptive statistics: displayif descriptives:print(f"Descriptive statistics of group ({self.by}) counts:")print(self.descriptive_stats, "\n")def display(self, omnibus: bool=True, posthoc: bool=True, descriptives: bool=True, ):"""Display numeric results in Jupyter Notebooks. Args: omnibus (bool, optional): Flag to print omnibus test results. Defaults to True. posthoc (bool, optional): Flag to print post-hoc test results. Defaults to True. descriptives (bool, optional): Flag to print descriptive statistics. Defaults to True. Raises: Exception: if calculations with `.analyze()` method were not performed. """ifnotself.results_are_calculated:raiseException("No results. Run `.fit()` first.")# Omnibus testif omnibus: my.display_collapsible(self.omnibus, "Omnibus (chi-squared) test results" )# Post-hoc and CIif posthoc: my.display_collapsible(self.n_ci_and_cld.style.format({self.n_label: "{:,.0f}"}),f"Counts of {self.counts_of} with 95% CI and post-hoc "" (pairwise chi-squared) test results", )# Descriptive statistics: displayif descriptives: my.display_collapsible(self.descriptive_stats,f"Descriptive statistics of group ({self.by}) counts", )def plot(self, xlabel=None, ylabel=None, **kwargs):"""Plot analysis results. Args: xlabel (str, None, optional): X axis label. Defaults to None: autogenerated label. ylabel (str, None, optional): Y axis label. Defaults to None: autogenerated label. **kwargs: further arguments passed to `my.plot_counts_with_labels()` Raises: Exception: if calculations with `.fit()` method were not performed. Returns: matplotlib.axes object """ifnotself.results_are_calculated:raiseException("No results. Run `.fit()` first.")# Plotif xlabel isNone: xlabel =self.by.capitalize()if ylabel isNone: ylabel =f"Number of {self.counts_of}" ax = my.plot_counts_with_labels(self.n_ci_and_cld, x_lab=xlabel, y_lab=ylabel, y=self.n_label,**kwargs, ) my.ax_axis_comma_format("y")return ax# Analyze numeric groups ------------------------------------------------------class AnalyzeNumericGroups:"""Class to analyze numeric/continuous data by groups. - Calculates mean ratings per group and their confidence intervals using t distribution. - Performs omnibus (Kruskal-Wallis) and post-hoc (Conover-Iman) tests. - Compactly presents results of post-hoc test as compact letter display, CLDNOTE: for CLD calculations, R is required. (Shared CLD letter show no significant difference between groups). - Creates summary of grouped values (group counts and percentages). - Plots results as points with 95% confidence interval error bars. """def__init__(self, data, y: str, by: str):"""Initialize the class. Args: y (str): Name of numeric/continuous (dependent) variable. by (str): Name of grouping (independent) variable. data (pandas.DataFrame): data frame with variables indicated in `y` and `by`. """assertisinstance(data, pd.DataFrame)# Set attributes: user inputsself.data = dataself.y = yself.by = by# Set attributes: results to be calculatedself.results_are_calculated =Falseself.omnibus =Noneself.ci_and_cld =Noneself.descriptive_stats =Nonedef fit(self):# Aliases: data =self.data y =self.y by =self.by# Omnibus test: Kruskal-Wallis test omnibus = pg.kruskal(data=data, dv=y, between=by) omnibus["p-unc"] = my.format_p(omnibus["p-unc"][0])self.omnibus = omnibus# Confidence intervals ci_raw = data.groupby(by)[y].apply(lambda x: [np.mean(x), *sms.DescrStatsW(x).tconfint_mean()] ) ci = pd.DataFrame(list(ci_raw), index=ci_raw.index, columns=["mean", "ci_lower", "ci_upper"], ).reset_index()# Post-hoc test: Conover-Iman test posthoc_p_matrix = sp.posthoc_conover( data, val_col=y, group_col=by, p_adjust="holm" ) posthoc_p_df = posthoc_p_matrix.stack().to_df("p.adj", ["group1", "group2"] ) posthoc_cld = my.convert_pairwise_p_to_cld( posthoc_p_df, output_gr_var=by )# Make sure datasets are mergeable ci[by] = ci[by].astype(str) posthoc_cld[by] = posthoc_cld[by].astype(str)self.ci_and_cld = pd.merge(posthoc_cld, ci, on=by)# Descriptive statistics of meansself.descriptive_stats = my.calc_summaries(ci["mean"])# Results are presentself.results_are_calculated =True# Output:returnselfdefprint(self, omnibus: bool=True, posthoc: bool=True, descriptives: bool=True, ):"""Print numeric results. Args: omnibus (bool, optional): Flag to print omnibus test results. Defaults to True. posthoc (bool, optional): Flag to print post-hoc test results. Defaults to True. descriptives (bool, optional): Flag to print descriptive statistics. Defaults to True. Raises: Exception: if calculations with `.fit()` method were not performed. """ifnotself.results_are_calculated:raiseException("No results. Run `.fit()` first.")# Omnibus testif omnibus:print("Omnibus (Kruskal-Wallis) test results:")print(self.omnibus, "\n")# Post-hoc and CIif posthoc:print("Post-hoc (Conover-Iman) test results as CLD and ""Confidence intervals (CI):", )print(self.ci_and_cld, "\n")# Descriptive statisticsif descriptives:print(f"Descriptive statistics of group ({self.by}) means:")print(self.descriptive_stats, "\n")def display(self, omnibus: bool=True, posthoc: bool=True, descriptives: bool=True, ):"""Display numeric results in Jupyter Notebooks. Args: omnibus (bool, optional): Flag to print omnibus test results. Defaults to True. posthoc (bool, optional): Flag to print post-hoc test results. Defaults to True. descriptives (bool, optional): Flag to print descriptive statistics. Defaults to True. Raises: Exception: if calculations with `.fit()` method were not performed. """ifnotself.results_are_calculated:raiseException("No results. Run `.fit()` first.")# Omnibus testif omnibus: my.display_collapsible(self.omnibus, "Omnibus (Kruskal-Wallis) test results" )# Post-hoc and CIif posthoc: my.display_collapsible(self.ci_and_cld,"Post-hoc (Conover-Iman) test results as CLD and ""Confidence intervals (CI)", )# Descriptive statistics of meansif descriptives: my.display_collapsible(self.descriptive_stats,f"Descriptive statistics of group ({self.by}) means", )def plot(self, title=None, xlabel=None, ylabel=None, **kwargs):"""Plot the results Args: xlabel (str, None, optional): X axis label. Defaults to None: capitalized value of `by`. ylabel (str, None, optional): Y axis label. Defaults to None: capitalized value of `y`. title (str, None, optional): The title of the plot. Defaults to None. Returns: Tuple with matplotlib figure and axis objects (fig, ax). """ifnotself.results_are_calculated:raiseException("No results. Run `.fit()` first.")# Aliases: ci =self.ci_and_cld by =self.by y =self.y# Create figure and axes fig, ax = plt.subplots()# Construct plot x = ci.iloc[:, 0] ax.errorbar( x=x, y=ci["mean"], yerr=[ci["mean"] - ci["ci_lower"], ci["ci_upper"] - ci["mean"]], mfc="red", ms=2, mew=1, fmt="ko", zorder=3, )if xlabel isNone: xlabel = by.capitalize()if ylabel isNone: ylabel = y.capitalize() ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_ylim([0, None]) ax.set_title(title)# Outputreturn (fig, ax)
Code
from typing import Unionimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport matplotlib.ticker as mticker# Machine learningfrom sklearn.metrics import mean_squared_error as mse, r2_scorefrom sklearn.metrics import f1_score, accuracy_score, balanced_accuracy_scorefrom sklearn.metrics import cohen_kappa_scorefrom sklearn.metrics import classification_report, confusion_matrixfrom mlxtend.feature_selection import SequentialFeatureSelectorfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScaler# Helpersdef as_formula( target: str=None, include: Union[list[str], pd.DataFrame] =None, exclude: list[str] =None, add: str="",):""" Generates the R style formula for statsmodels (patsy) given the dataframe, dependent variable and optional excluded columns as strings. Args: target (str): name of target variable. include (pandas.DataFrame or list[str]): dataframe of column names to include. exclude (list[str], optional): columns to exclude. add (str): string to add to formula, e.g., "+ 0" Return: String with R style formula for `patsy` (e.g., "target ~ x1 + x2"). See also: https://stackoverflow.com/a/44866142/4783029 """ifisinstance(include, pd.DataFrame): include =list(include.columns.values)if target in include: include.remove(target)if exclude isnotNone:for col in exclude: include.remove(col)return target +" ~ "+" + ".join(include) + adddef get_columns_by_purpose(data, target: str):"""Split data frame to 3 data frames: for target, numeric, and remaining variables. Examples: >>> # Split >>> d_target, d_num, d_other = get_columns_by_purpose(data, "class") >>> # Merge back >>> pd.concat([d_target, d_num, d_other], axis=1) """ d_num = data.drop(columns=target).select_dtypes("number") d_other = data.drop(columns=[target, *d_num.columns.values])return data[target].to_frame(), d_num, d_other# Functions for feature selectiondef sfs(estimator, est_type, k_features="parsimonious", forward=True):"""Create SFS instance for classification Args.: est_type (str): classification or regression other arguments: see mlextend.SequentialFeatureSelector() """if est_type =="regression": scoring ="neg_root_mean_squared_error"elif est_type =="classification": scoring ="balanced_accuracy"else:raiseException(f"Unrecognized learner/estimator type: {type}")return SequentialFeatureSelector( estimator, k_features=k_features, # "parsimonious", forward=forward, floating=False, scoring=scoring, verbose=1, cv=5, n_jobs=-1, )def sfs_get_score(sfs_object, k_features):"""Return performance score achieved with certain number of features. Args.: sfs_object: result of function do_sfs_lin_reg() k_features (int): number of features. """ md =round( np.median(sfs_object.get_metric_dict()[k_features]["cv_scores"]), 3 )return {"k_features": k_features,"mean_score": round( sfs_object.get_metric_dict()[k_features]["avg_score"], 3 ),"median_score": md,"sd_score": round( sfs_object.get_metric_dict()[k_features]["std_dev"], 3 ), }def sfs_plot_results(sfs_object, sub_title="", ref_y=None):"""Plot results from SFS object Args.: sfs_object: object with SFS results. sub_title (str): second line of title. ref_y (float): Y coordinate of reference line. """ scoring = sfs_object.get_params()["scoring"]if scoring =="neg_root_mean_squared_error": metric ="RMSE" sign =-1elif scoring =="balanced_accuracy": metric ="BAcc" sign =1else:raiseException(f"Unsupported scoring metric: {scoring}")if sfs_object.forward: sfs_type ="Forward"else: sfs_type ="Backward" fig, ax = plt.subplots(1, 2, sharey=True) xlab ="Number of predictors included"if ref_y isnotNone: ax[0].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5) ax[1].axhline(y=ref_y, color="darkred", linestyle="--", lw=0.5) avg_score = [ (int(i), sign * c["avg_score"]) for i, c in sfs_object.subsets_.items() ] averages = pd.DataFrame(avg_score, columns=["k_features", "avg_score"]) ( averages.plot.scatter( x="k_features", y="avg_score", xlabel=xlab, ylabel=metric, title=f"Average {metric}", ax=ax[0], ) ) cv_scores = {int(i): sign * c["cv_scores"] for i, c in sfs_object.subsets_.items() } ( pd.DataFrame(cv_scores).plot.box( xlabel=xlab, title=f"{metric} in CV splits", ax=ax[1], ) ) ax[0].xaxis.set_major_locator(mticker.MaxNLocator(integer=True)) ax[1].xaxis.set_major_locator(mticker.MaxNLocator(integer=True))ifnot sfs_object.forward: ax[1].invert_xaxis() main_title = (f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV "+f"\n{sub_title}" ) fig.suptitle(main_title) plt.tight_layout() plt.show()# Print resultsifnot sfs_object.interrupted_:if sfs_object.is_parsimonious: note ="[Parsimonious]" k_selected =f"k = {len(sfs_object.k_feature_names_)}" score_at_k =f"avg. {metric} = {sign * sfs_object.k_score_:.3f}" note_2 ="Smallest number of predictors at best ± 1 SE score"else: note ="[Best]"if sign <0: best = averages.nsmallest(1, "avg_score")else: best = averages.nlargest(1, "avg_score") k_selected =f"k = {int(best.k_features.values)}" score_at_k =f"avg. {metric} = {float(best.avg_score.values):.3f}" note_2 ="Number of predictors at best score"print(f"{k_selected}, {score_at_k}{note}\n({note_2})")def sfs_list_features(sfs_result):"""List features by order when they were added. Current implementation correctly works with forward selection only. Args: sfs_result (SFS object) """def rename_metric(x):return x.replace("score", metric) scoring = sfs_result.get_params()["scoring"]if scoring =="neg_root_mean_squared_error": metric ="RMSE" sign =-1elif scoring =="balanced_accuracy": metric ="BAcc" sign =1else:raiseException(f"Unsupported scoring metric: {scoring}") feature_dict = sfs_result.get_metric_dict() lst = [[*feature_dict[i]["feature_names"]] for i in feature_dict] feature = []for x, y inzip(lst[0::], lst[1::]): feature.append(*set(y).difference(x))return ( pd.DataFrame( {"added_feature": [*lst[0], *feature],"score": [ sign * feature_dict[i]["avg_score"] for i in feature_dict ], } ) .assign(score_improvement=lambda x: sign * x.score.diff()) .assign( score_percentage_change=lambda x: sign * x.score.pct_change() *100 ) .index_start_at(1) .rename_axis("step") .rename(columns=rename_metric) )# Functions for regression/classificationdef get_regression_performance(y_true, y_pred, name=""):"""Evaluate regression model performance Calculate R², RMSE, and SD of predicted variable Args.: y_true, y_pred: true and predicted numeric values. name (str): the name of investigated set. """return ( pd.DataFrame( {"set": name,"n": len(y_true),"SD": [float(np.std(y_true))],"RMSE": [float(np.sqrt(mse(y_true, y_pred)))],"R²": [r2_score(y_true, y_pred)], } ) .eval("RMSE_SD_ratio = RMSE/SD") .eval("SD_RMSE_ratio = SD/RMSE") )def get_classification_performance(true_class, predicted_class, name=""):"""Evaluate classification model performance Calculate accuracy (Acc), Balanced accuracy (BAcc), Balanced accuracy adjusted to be between 0 and 1 (BAcc_01), F1 macro average (F1_macro), F1 weighted macro average (F1_weighted), Cohen's Kappa. Args.: true_class, predicted_class: true and predicted numeric values. name (str): the name of investigated set. """ acc = accuracy_score(true_class, predicted_class) bacc = balanced_accuracy_score(true_class, predicted_class) bacc01 = balanced_accuracy_score(true_class, predicted_class, adjusted=True) f1_macro = f1_score(true_class, predicted_class, average="macro") f1_weighted = f1_score(true_class, predicted_class, average="weighted") kappa = cohen_kappa_score(true_class, predicted_class)return pd.DataFrame( {"set": name,"n": len(true_class),"Accuracy": [acc],"BAcc": [bacc],"BAcc_01": [bacc01],"f1_macro": [f1_macro],"f1_weighted": [f1_weighted],"Kappa": [kappa], } )def print_classification_report(true_class, predicted_class, name=""):"""Print summary of classification performance Args.: true_class, predicted_class: data sequences of the same length: with class names/indicators. name (str): the name of investigated set of data. """print( get_classification_performance(true_class, predicted_class, name=name) )print("")print(classification_report(true_class, predicted_class, zero_division=0))print("")print("Confusion matrix (rows - true, columns - predicted):")print(confusion_matrix(true_class, predicted_class))# For Random Forestsdef get_rf_importances(obj):"""Get random forest feature importance Args: obj (fitted instance of RandomForestRegressor()): Random Forest. Returns: pandas.DataFrame: dataframe with feature names and their importance. """return pd.DataFrame( {"features": obj.feature_names_in_,"importance": obj.feature_importances_, } ).sort_values("importance", ascending=False)def plot_importances(data, n=20):"""Plot 2 plots with feature importance: "overview" and zoomed plot. Args: data (pandas.DataFrame): dataframe with columns `features` and `importance` """ fig, ax = plt.subplots(2, 1, height_ratios=(1, 3)) data.plot.bar( x="features", y="importance", ylabel="", xlabel="", legend=False, ax=ax[0], ) ax[0].xaxis.set_ticklabels([]) data.head(n).plot.bar( x="features", y="importance", ylabel="", xlabel="Features", ax=ax[1] ) fig.suptitle("Feature Importance: All and Several Top Variables")return fig, ax# PCAdef pca_screeplot(data, n_components=30):"""Plot PCA screeplot Args: data (pandas.Dataframe): Numeric data n_components (int, optional): Max number of principal components to extract. Defaults to 30. Returns: 3 objects: plot (fig and ax) and pca object. """ scale = StandardScaler() pca = PCA(n_components=n_components) scaled_data = scale.fit_transform(data) pca.fit(scaled_data) pct_explained = pca.explained_variance_ratio_ *100 fig, ax = plt.subplots(figsize=(8, 4)) ax.plot(pct_explained, "-o", color="tab:green") ax.set_xlabel("Number of components") ax.set_ylabel("% of explained variance")return fig, ax, pcadef do_pca(data, target: str, n_components: int=10, scale=None, pca=None):"""Do PCA on numeric non-target variables Args: data (pandas.Dataframe): data target (str): Target variable name n_components (int, optional): Number of PCA components to extract. Defaults to 10. n_components is ignored if `pca` is not None. scale (instance of sklearn.preprocessing.StandardScaler or None): Fitted object to scale data. pca (instance of sklearn.decomposition.PCA or None): Fitted PCA object. Returns: tuple with 6 elements: - 4 data frames: d_target, d_num, d_other, d_pca - fitted instance of sklearn.preprocessing.StandardScaler. - fitted instance of sklearn.decomposition.PCA. """ d_target, d_num, d_other = get_columns_by_purpose(data, target)if scale isNone: scale = StandardScaler() sc_data = scale.fit_transform(d_num)else: sc_data = scale.transform(d_num)if pca isNone: pca = PCA(n_components=n_components) pc_num = pca.fit_transform(sc_data)else: pc_num = pca.transform(sc_data) n_components = pc_num.shape[1]# Convert to DataFrame and name columns (pc_1, pc_2, etc.) d_pca = pd.DataFrame( pc_num, index=d_num.index, columns=[f"pc_{i}"for i in np.arange(1, n_components +1)], )return (d_target, d_num, d_other, d_pca, scale, pca)
Code
"""New methods for Pandas Series and DataFrames"""# Setup -----------------------------------------------------------------import warningsimport pandas as pdimport pandas_flavor as pffrom typing import Unionimport janitor # imports additional Pandas methodsimport functions.fun_utils as my # Custom module# Series methods --------------------------------------------------------@pf.register_series_methoddef to_df(self: pd.Series, values_name: str=None, key_name: Union[str, list[str], tuple[str, ...]] =None,) -> pd.DataFrame:"""Convert Series to DataFrame with desired or default column names. Similar to `pandas.Series.to_frame()`, but the main purpose of this method is to be used with the result of `.value_counts()`. So appropriate default column names are pre-defined. And index is always reset. Args: self (pandas.Series): The object the method is applied to. values_name (str): Name for series values (applied before conversion to DataFrame). Defaults "count". key_name (str or sequence of str): New name for the columns, that are created from Series index that was present before the conversion to DataFrame. Defaults to `self.index.names`, if index has names, to `self.name` if index has no names but series has name, or to "value" otherwise. Return: pandas.DataFrame Examples: >>> import pandas as pd >>> df = pd.Series({'right': 138409, 'left': 44733}).rename("foot") >>> df.to_df() >>> # Compared to .to_frame() >>> df.to_frame() """ k_name =None v_name =None# Check if defaults can be set based on non-missing attribute valuesif my.index_has_names(self): k_name =self.index.namesifself.name isnotNone: v_name =self.nameelse: k_name =self.name# Set user-defined values or defaultsif key_name isnotNone: k_name = key_nameelif k_name isNone: k_name ="value"# Defaultif values_name isnotNone: v_name = values_nameelif v_name isNone: v_name ="count"# Default# Outputreturnself.rename_axis(k_name).rename(v_name).reset_index()@pf.register_series_methoddef to_category(self, categories=None, ordered=False):"""Convert variable to categorical one.NOTE: method with the same name but for DataFrame also exists. Args: self (pandas.Series): The object the method is applied to. categories (list of values, optional): Categories listed here will become the first categories. The remaining ones (not in this list) will follow. Defaults to None: use default order. ordered (bool, optional): Whether or not this categorical is treated as ordered categorical. Defaults to False. Return: pandas.Series """self=self.astype("category") all_cats =self.cat.categories.valuesif categories isnotNone:# new order all_cats = [*categories,*sorted(list(set(all_cats).difference(categories))), ]returnself.cat.reorder_categories(all_cats, ordered=ordered)# DataFrame methods --------------------------------------------------------@pf.register_dataframe_methoddef relocate(self, col, before=0):"""Change position of a column. Do transformations in-place and return a data frame. Args: self (pd.DataFrame): The object the method is applied to. col (str): The name of column to relocate. before (int|str): The name or index of the column before which `col` will be inserted. Return: pandas.DataFrame Examples: >>> import pandas as pd >>> data = pd.DataFrame({"a": 1, "b":2, "c":3}) >>> data.relocate("c") >>> data >>> data.relocate("b", before="a") >>> data """ columns =self.columnsassert col in columnsif before isNone: position =0ifisinstance(before, int) orisinstance(before, float): position =int(before)ifisinstance(before, str):assert before in columns position = columns.get_loc(before) col_position = columns.get_loc(col)if col_position <= position: position -=1 col_to_relocate =self.pop(col)self.insert(loc=position, column=col, value=col_to_relocate)returnself@pf.register_dataframe_methoddef index_start_at(self, start=1):"""Create a new sequential index that starts at indicated number. Args.: self (pd.DataFrame): The object the method is applied to. start (int): The start of an index Return: pandas.DataFrame """ i =self.indexself.index =range(start, len(i) + start)returnselfwith warnings.catch_warnings(): warnings.simplefilter("ignore")# There is a deprecated method with the same name in `pyjanitor` package# So warning is suppressed@pf.register_dataframe_methoddef to_datetime(self, columns, **kwargs):"""Convert indicated columns to datetime. Args.: self (pd.DataFrame): The object the method is applied to. columns (str or list[str]): Column names to convert to datetime. **kwargs: Named arguments to be passed to pandas.to_datetime() Return: pandas.DataFrame """ifisinstance(columns, str): columns = [columns]self[columns] =self[columns].apply(pd.to_datetime, **kwargs)returnself@pf.register_dataframe_methoddef to_category(self, columns, categories=None, ordered=False):"""Convert indicated columns to categorical variables.NOTE: method with the same name but for Series also exists. Args.: self (pd.DataFrame): The object the method is applied to. column (str or list[str]): Column names to convert to category. categories (list of values, optional): Categories listed here will become the first categories. The remaining ones (not in this list) will follow. Defaults to None: use default order. ordered (bool, optional): Whether or not this categorical is treated as ordered categorical. Defaults to False. Return: pandas.DataFrame """ifisinstance(columns, str): columns = [columns]# res = self.loc[:, columns].copy(deep=True)# self.loc[:, columns] = res.apply(lambda x: x.to_category(# categories=categories, ordered=ordered# ))self.transform_columns( columns,lambda x: x.to_category(categories=categories, ordered=ordered), elementwise=False, )returnself@pf.register_dataframe_methoddef make_dummies(self, exclude=None, drop_first=True, prefix_sep="__", **kwargs):"""Convert categorical variables in data frame to dummies and remove target variable. Args: exclude (str or None, optional): Name of target variable (exclude from the feature list). Defaults to None. drop_first (bool, optional): See pandas.get_dummies(). Defaults to True. prefix_sep (str, optional): See pandas.get_dummies(). Defaults to "__". Returns: If `exclude` is not present: pandas.DataFrame (the X) If `exclude` is present: pandas.DataFrame and pandas.Series (the X and y) """if exclude isnotNone: y =self[exclude]self=self.drop(columns=exclude) df_with_dummies = pd.get_dummies(self, drop_first=drop_first, prefix_sep=prefix_sep, **kwargs )if exclude isnotNone:return (df_with_dummies, y)return df_with_dummies
Code
# R functions for data analysis#' Convert Post-Hoc Test Results to CLD#'#' Convert p values from pairwise comparisons to CLD.#'#' CLD - compact letter display.#' This function is a wrapper around [multcompView::multcompLetters()].#'#' @note#' No hyphens are allowed in group names#' (vaues of culumns `group1` and `group2`).#'#' @param .data (data frame with at least 3 columns)#' The result of pairwise comparison test usually from \pkg{rstatix}#' package.#' @param group1,group2 Name of the columns in `.data`, which contain the names#' of first and second group. Defaults to "group1" and "group2".#' @param p_name Name of the column, which contains p values.#' Defaults to `p.adj`.#' @param alpha Significance level. Defaults to 0.05.#'#' @return Data frame with compared group names and CLD representation of#' test results. Contains columns with group names and CLD results#' (`cld` and `spaced_cld`).convert_pairwise_p_to_cld <-function(.data,group1 ="group1",group2 ="group2",p_name ="p.adj",output_gr_var ="group",alpha =0.05) {# Checking input col_names <-c(group1, group2, p_name) missing_col <-!col_names %in%colnames(.data)if (any(missing_col)) {stop("Check you input as these columns are not present in data: ",paste(col_names[missing_col], sep =",") ) }# Analysis pair_names <- stringr::str_glue("{.data[[group1]]}-{.data[[group2]]}")# Prepare input data cld_obj <- purrr::set_names(.data[[p_name]], pair_names) |># Get CLD multcompView::multcompLetters(threshold = alpha) # If no differences are detected, then "$monospacedLetters" is not created,# then "$Letters" is used instead.if (is.null(cld_obj$monospacedLetters)) { cld_obj$monospacedLetters <- cld_obj$Letters }# Format the results cld_obj |>with( dplyr::full_join( Letters |> tibble::enframe(output_gr_var, "cld"), monospacedLetters |> tibble::enframe(output_gr_var, "spaced_cld") |> dplyr::mutate(spaced_cld = stringr::str_replace_all(spaced_cld, " ", "_") ),by = output_gr_var ) )}
2 Methods
This section shortly introduces the main aspects of inferential statistics and predictive modeling used in this project.
2.1 Statistical Inference
For difference in proportions, χ² (chi-squared) test was performed with pair-wise χ² as post-hoc. Goodman’s method was used to calculate confidence intervals of multinomial proportions.
For differences between groups of numeric variables, Kruskal-Wallis test was performed followed by Conover-Iman test as post-hoc. Confidence intervals of means were calculating using t-distribution based method.
In this project, confidence level is 95%, significance level is 0.05.
2.2 Predictive Modelling
For predictive modeling, training (data from all seasons except the last one) and test (data from the last season only) sets were used. The training set was used for model selection and the test set for performance evaluation of the selected models.
For the regression task, linear regression and random forests (RF) were used. For the classification task, logistic regression and RF were used. Forward sequential feature selection (SFS) with 5-fold cross-validation (CV) was used to find an optimal combination of variables. The optimized metric in the regression was RMSE** (root mean squared error), in classification BAcc (balanced accuracy), which takes into account class imbalance.
As some calculations take a lot of time, in some analyses either the total available number of features or the number of features allowed to be included in the analysis, or both were limited to fit into a reasonable amount of available time: the decision was made either based on the RF feature importance analysis or the results of previous calculations (number of possibly valuable features and time that was needed to perform a certain amount of calculations).
Models with greater performance were desirable but less complex models with almost the same level of performance as the best one were preferred.
3 Initial Exploration
In this section, the database is presented. Data summaries as well as database tables are explored to better understand the data itself and what steps of pre-precessing are needed.
3.1 Database
The “Ultimate 25k+ Matches Football Database – European” (v2) was downloaded from Kaggle. The database consists of 7 tables. The entity relationship diagram (ERD) is shown below (Fig. 3.1): pay attention that some columns from table Match are not shown in the ERD.
Code: Create connection to SQL database
db = sqlite3.connect("data/database.sqlite")
Code
query ="""--sqlSELECT name FROM sqlite_master WHERE type = 'table' AND name != 'sqlite_sequence';"""cursor = db.cursor()cursor.execute(query)print("Data tables in the database: ")for i, tbl inenumerate(cursor.fetchall(), start=1):print(" ", i, ". ", *tbl, sep="")
Data tables in the database:
1. Player_Attributes
2. Player
3. Match
4. League
5. Country
6. Team
7. Team_Attributes
3.2 Tables Country and League
In tables country and league has 11 distinct records each. As Scotland and England are regions of the United Kingdom, UK, there are 10 countries only.
Code
# Working with SQL databaseimport sqlite3query ="""--sqlSELECT (SELECT COUNT(DISTINCT name) FROM Country) n_regions, (SELECT COUNT(DISTINCT name) FROM League) n_leagues;"""pd.read_sql_query(query, db).style.hide(axis="index")
Table 3.1. Inspection: number of unique items in country and league tables.
n_regions
n_leagues
11
11
Code
pd.read_sql_query("SELECT * FROM Country", db).index_start_at(1).style
Table 3.2. Inspection: table country.
id
name
1
1
Belgium
2
1,729
England
3
4,769
France
4
7,809
Germany
5
10,257
Italy
6
13,274
Netherlands
7
15,722
Poland
8
17,642
Portugal
9
19,694
Scotland
10
21,518
Spain
11
24,558
Switzerland
Code
pd.read_sql_query("SELECT * FROM League", db).index_start_at(1).style
Table 3.3. Inspection: table league.
id
country_id
name
1
1
1
Belgium Jupiler League
2
1,729
1,729
England Premier League
3
4,769
4,769
France Ligue 1
4
7,809
7,809
Germany 1. Bundesliga
5
10,257
10,257
Italy Serie A
6
13,274
13,274
Netherlands Eredivisie
7
15,722
15,722
Poland Ekstraklasa
8
17,642
17,642
Portugal Liga ZON Sagres
9
19,694
19,694
Scotland Premier League
10
21,518
21,518
Spain LIGA BBVA
11
24,558
24,558
Switzerland Super League
League and county/region id codes coincide so these variables contain redundant information.
Details: Country/Region and league IDs are the same.
Table match includes information on 25,979 matches from 2008-07-18 to 2016-05-25 (seasons from 2008/2009 to 2015/2016), approximately 3,200-3,400 matches per season (except the season 2013/2014, where some data is likely to be missing). More details on match dataset in Tables 3.4–3.5.
Code
query ="""--sqlSELECT (SELECT COUNT(1) FROM Match) n_records, (SELECT COUNT(DISTINCT country_id) FROM Match) n_regions, (SELECT COUNT(DISTINCT league_id) FROM Match) n_leagues, (SELECT COUNT(DISTINCT season) FROM Match) n_seasons, (SELECT COUNT(DISTINCT team) FROM ( SELECT home_team_api_id team FROM Match UNION SELECT away_team_api_id team FROM Match )) n_teams, (SELECT COUNT(DISTINCT player) FROM ( SELECT home_player_1 player FROM Match UNION SELECT home_player_2 player FROM Match UNION SELECT home_player_3 player FROM Match UNION SELECT home_player_4 player FROM Match UNION SELECT home_player_5 player FROM Match UNION SELECT home_player_6 player FROM Match UNION SELECT home_player_7 player FROM Match UNION SELECT home_player_8 player FROM Match UNION SELECT home_player_9 player FROM Match UNION SELECT home_player_10 player FROM Match UNION SELECT home_player_11 player FROM Match UNION SELECT away_player_1 player FROM Match UNION SELECT away_player_2 player FROM Match UNION SELECT away_player_3 player FROM Match UNION SELECT away_player_4 player FROM Match UNION SELECT away_player_5 player FROM Match UNION SELECT away_player_6 player FROM Match UNION SELECT away_player_7 player FROM Match UNION SELECT away_player_8 player FROM Match UNION SELECT away_player_9 player FROM Match UNION SELECT away_player_10 player FROM Match UNION SELECT away_player_11 player FROM Match )) n_players, (SELECT COUNT(DISTINCT match_api_id) FROM Match) n_matches;"""n_matches = pd.read_sql_query(query, db)n_matches.style.hide(axis="index")
Table 3.4. Inspection: number of unique items in match table.
n_records
n_regions
n_leagues
n_seasons
n_teams
n_players
n_matches
25,979
11
11
8
299
11,060
25,979
Code
query ="""--sqlSELECT season, COUNT(season) n_matches FROM Match GROUP BY season;"""pd.read_sql_query(query, db).index_start_at(1).style
Table 3.5. Number of matches per season in match table.
season
n_matches
1
2008/2009
3,326
2
2009/2010
3,230
3
2010/2011
3,260
4
2011/2012
3,220
5
2012/2013
3,260
6
2013/2014
3,032
7
2014/2015
3,325
8
2015/2016
3,326
Code: Import match
match = pd.read_sql_query("SELECT * FROM Match", db)# Fix datetime data typematch = match.to_datetime("date")# Printmatch.head(2)
Table 3.6. Inspection: a few rows of table match.
id
country_id
league_id
season
stage
date
match_api_id
home_team_api_id
away_team_api_id
home_team_goal
away_team_goal
home_player_X1
home_player_X2
home_player_X3
home_player_X4
home_player_X5
home_player_X6
home_player_X7
home_player_X8
home_player_X9
home_player_X10
home_player_X11
away_player_X1
away_player_X2
away_player_X3
away_player_X4
away_player_X5
away_player_X6
away_player_X7
away_player_X8
away_player_X9
away_player_X10
away_player_X11
home_player_Y1
home_player_Y2
home_player_Y3
home_player_Y4
home_player_Y5
home_player_Y6
home_player_Y7
home_player_Y8
home_player_Y9
home_player_Y10
home_player_Y11
away_player_Y1
away_player_Y2
away_player_Y3
away_player_Y4
away_player_Y5
away_player_Y6
away_player_Y7
away_player_Y8
away_player_Y9
away_player_Y10
away_player_Y11
home_player_1
home_player_2
home_player_3
home_player_4
home_player_5
home_player_6
home_player_7
home_player_8
home_player_9
home_player_10
home_player_11
away_player_1
away_player_2
away_player_3
away_player_4
away_player_5
away_player_6
away_player_7
away_player_8
away_player_9
away_player_10
away_player_11
goal
shoton
shotoff
foulcommit
card
cross
corner
possession
B365H
B365D
B365A
BWH
BWD
BWA
IWH
IWD
IWA
LBH
LBD
LBA
PSH
PSD
PSA
WHH
WHD
WHA
SJH
SJD
SJA
VCH
VCD
VCA
GBH
GBD
GBA
BSH
BSD
BSA
0
1
1
1
2008/2009
1
2008-08-17
492473
9987
9993
1
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
None
None
None
None
None
None
None
None
1.73
3.40
5.00
1.75
3.35
4.20
1.85
3.20
3.50
1.80
3.30
3.75
NaN
NaN
NaN
1.70
3.30
4.33
1.90
3.30
4.00
1.65
3.40
4.50
1.78
3.25
4.00
1.73
3.40
4.20
1
2
1
1
2008/2009
1
2008-08-16
492474
10000
9994
0
0
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
None
None
None
None
None
None
None
None
1.95
3.20
3.60
1.80
3.30
3.95
1.90
3.20
3.50
1.90
3.20
3.50
NaN
NaN
NaN
1.83
3.30
3.60
1.95
3.30
3.80
2.00
3.25
3.25
1.85
3.25
3.75
1.91
3.25
3.60
The following variables have non-cleaned HTML/XML-like text values and many missing values (45% cases with NAs), so they will not be included in the further analysis:
goal
shoton
shotoff
foulcommit
card
cross
corner
possession
Variables with player coordinates (such as home_player_X1 through away_player_Y11) will be excluded too.
Dataset contains columns with betting odds information from various betting websites. In betting odds-related variable names (e.g.: B365H), the first few symbols indicates betting websites and the meaning of the last letter is following:
A – Away wins,
D – Draw,
H – Home wins.
These variables can renamed to make easier-to-understand variable names. Next, betting odds from some websites abbreviated as PS (57% NAs), SJ (34%), GB (45%), BS (45%) have many missing values.
Other highlights from the profiling report:
as expected, distribution of matches show yearly patterns (section on variable date in data profiling report).
correlation between various betting odds is high (section on correlation in the report). This could be investigated in more detail.
Details: Text columns to exclude
This is just a short illustration of the issue (see column top with the most frequent values of lines goal and below). See the column of missing values in the overview of match table. More details can be explored in the data profiling report for match table.
if do_eda: eda.ProfileReport( player, title="Data Profiling Report: player", config_file="_config/ydata_profile_config--mini.yaml", )
3.5 Table Player_Attributes
Table Player_Attributes contains 183,978 records on 11,060 players about various properties of theirs. Variables in this dataset have from 0.45% to 1.46% of missing values.
Some numeric variables are bimodal and they might indicate different importance for different player roles:
ball_control
interceptions
marking
standing_tackle
sliding_tackle
Goalkeeper-related variables also have distinct distribution: a few players high scores (most probably they are goalkeepers) and many with low scores (most probably the remaining roles):
gk_diving
gk_handling
gk_kicking
gk_positioning
gk_reflexes
There are 3 categorical variables including 2 related to working rates. FIFA defines working rates categories as either “low”, “medium” or “high” [1]. In the dataset there are more values in these columns, and the additional values can be treated as errors in most cases especially when values make no sense. What is more, comparing attacking and defensive work rate columns, some errors in one column indicate what kind of errors will be in the other column. Some of those errors are characteristic only to data dated before 2012, which indicates that this might be data scraping errors or missing information on the scrapped webpages.
if do_eda: eda.ProfileReport( team, title="Data Profiling Report: team", config_file="_config/ydata_profile_config--mini.yaml", )
3.7 Table Team_Attributes
Table teams_attributes dataset contains 1,458 records on 288 teams. It is 11 teams less than in teams dataset. What is more, data is available only from year 2010.
Some variables like buildUpPlayDribbling and buildUpPlayDribblingClass have both numeric (without word Class in the name) and categorical (with word Class) versions. Graphical inspection show that numeric values in categorical classes do not overlap.
Categorical variables buildUpPlayPositioningClass, chanceCreationPositioningClass, and defenceDefenderLineClass do not have numeric equivalents.
The tables in this section were imported for exploratory purposes only. In the next section they will be imported in the form that is needed to answer the main questions of this analysis.
In this section, data will be imported and pre-processed to create the following tables required for the main analyses:
To present analyzed counties and leagues:
leagues
To compare resultativeness by leagues and seasons:
goals_summary
To identify and analyze top teams:
teams_top_bottom_goals
teams_wins_per_season
To identify top players in 2015/2016 and what factors make them best:
players
To investigate, if home advantage exists:
matches
To investigate relationship between betting odds from different companies/websites:
matches_betting_odds
For team score prediction in a match:
team_train
team_test
For match outcome (home wins, draw, away wins) prediction:
match_train
match_test
Some additional tables will be created ad-hoc in the analysis section.
4.1 Import
This section contains code that imports data to Python. Some pre-processing in SQL is also performed.
Before importing into Python:
tables country and league were merged and the result was called leagues.
new column country in table was created where Scotland and England were treated as the same country United Kingdom, UK,
column region was created to indicate regions of UK.
Code: Import leagues (country + league)
query ="""--sqlSELECT l.id league_id, CASE WHEN c.name IN ('England', 'Scotland') THEN 'United Kingdom' ELSE c.name END country, CASE WHEN c.name IN ('England', 'Scotland') THEN c.name ELSE '' END region, l.name leagueFROM Country c FULL JOIN League l ON ( l.country_id = c.id );"""leagues = pd.read_sql_query(query, db)# Printleagues.head(2)
Table 4.1. Inspection: a few rows of table leagues.
league_id
country
region
league
0
1
Belgium
Belgium Jupiler League
1
1729
United Kingdom
England
England Premier League
Code
leagues.shape
(11, 4)
Before importing into Python, team, and team_attributes tables were merged.
Table 4.6. Inspection: a few rows of table matches (2).
match_id
country
region
league
season
stage
match_date
home_team_id
away_team_id
home_team_goal
away_team_goal
goal_sum
goal_diff
goal_diff_sign
match_winner
home_player_1
home_player_2
home_player_3
home_player_4
home_player_5
home_player_6
home_player_7
home_player_8
home_player_9
home_player_10
home_player_11
away_player_1
away_player_2
away_player_3
away_player_4
away_player_5
away_player_6
away_player_7
away_player_8
away_player_9
away_player_10
away_player_11
B365_home_wins
B365_draw
B365_away_wins
BW_home_wins
BW_draw
BW_away_wins
IW_home_wins
IW_draw
IW_away_wins
LB_home_wins
LB_draw
LB_away_wins
PS_home_wins
PS_draw
PS_away_wins
WH_home_wins
WH_draw
WH_away_wins
SJ_home_wins
SJ_draw
SJ_away_wins
VC_home_wins
VC_draw
VC_away_wins
GB_home_wins
GB_draw
GB_away_wins
BS_home_wins
BS_draw
BS_away_wins
B365_ratio_ha
BW_ratio_ha
PS_ratio_ha
VC_ratio_ha
IW_ratio_ha
WH_ratio_ha
GB_ratio_ha
LB_ratio_ha
SJ_ratio_ha
BS_ratio_ha
B365_log_ratio_ha
BW_log_ratio_ha
PS_log_ratio_ha
VC_log_ratio_ha
IW_log_ratio_ha
WH_log_ratio_ha
GB_log_ratio_ha
LB_log_ratio_ha
SJ_log_ratio_ha
BS_log_ratio_ha
24558
24559
Switzerland
Switzerland Super League
2008/2009
1
2008-07-18
10192
9931
1
2
3
-1
-1
Away Wins
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
24559
24560
Switzerland
Switzerland Super League
2008/2009
1
2008-07-19
9930
10179
3
1
4
2
1
Home Wins
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Code: Create goals_summary
# Goals summary dataset for each league and seasongoals_summary = ( matches.groupby(["league", "season"]) .goal_sum.agg(["count", "sum"]) .rename(columns={"count": "n_matches_total", "sum": "n_goals_total"}) .eval("n_goals_per_match = n_goals_total/n_matches_total"))# Rank leagues by average number of goals per match per seasonleagues_by_goals = [*goals_summary.groupby("league") .n_goals_per_match.mean() .sort_values(ascending=False) .index]# Sort leagues by "resultativeness": update goals summary tablegoals_summary = ( goals_summary.reset_index() .assign(league=lambda x: x.league.to_category(leagues_by_goals)) .set_index(["league", "season"]) .sort_index())# Reorder categories (leagues) in `match`matches = matches.to_category("league", leagues_by_goals)
From matches, let’s create a dataset matches_long_team with one row per team:
Code: Create matches_long_team
# Add column `won_or_lost`, which indicates match status for the team:def team_won_or_lost(df):"""Return outcome if a team won or lost a match or there was draw."""if df.match_winner =="Draw":return"draw"elif (df.team_type =="home") and (df.match_winner =="Home Wins"):return"won"elif (df.team_type =="away") and (df.match_winner =="Away Wins"):return"won"else:return"lost"def negate_for_away_team(df):"""Negate goal difference for away team. Negative goal difference here means that the team lost. """if df.team_type =="away":return-df.goal_diffelse:return df.goal_diffmatches_long_team = ( matches.pivot_longer( column_names=re.compile("^(home|away)_(.+)"), names_pattern="^(home|away)_(.+)", names_to=("team_type", ".value"), sort_by_appearance=True, ) .to_category("team_type") .rename(columns={"team_goal": "team_goals"}) .assign( team_outcome=lambda x: x.apply(team_won_or_lost, axis=1), team_goal_diff=lambda x: x.apply(negate_for_away_team, axis=1), team_goal_diff_sign=lambda x: np.sign(x.team_goal_diff), ) .relocate("team_id", before="B365_home_wins") .relocate("team_type", before="B365_home_wins") .relocate("team_goals", before="B365_home_wins") .relocate("team_goal_diff", before="B365_home_wins") .relocate("team_goal_diff_sign", before="B365_home_wins") .relocate("team_outcome", before="B365_home_wins"))# Check outputprint("Expected ratio is 2, got: ", matches_long_team.shape[0] / matches.shape[0])matches_long_team.head(2)
Expected ratio is 2, got: 2.0
Table 4.7. Inspection: a few rows of table matches_long_team (1).
match_id
country
region
league
season
stage
match_date
goal_sum
goal_diff
goal_diff_sign
match_winner
team_id
team_type
team_goals
team_goal_diff
team_goal_diff_sign
team_outcome
B365_home_wins
B365_draw
B365_away_wins
BW_home_wins
BW_draw
BW_away_wins
IW_home_wins
IW_draw
IW_away_wins
LB_home_wins
LB_draw
LB_away_wins
PS_home_wins
PS_draw
PS_away_wins
WH_home_wins
WH_draw
WH_away_wins
SJ_home_wins
SJ_draw
SJ_away_wins
VC_home_wins
VC_draw
VC_away_wins
GB_home_wins
GB_draw
GB_away_wins
BS_home_wins
BS_draw
BS_away_wins
B365_ratio_ha
BW_ratio_ha
PS_ratio_ha
VC_ratio_ha
IW_ratio_ha
WH_ratio_ha
GB_ratio_ha
LB_ratio_ha
SJ_ratio_ha
BS_ratio_ha
B365_log_ratio_ha
BW_log_ratio_ha
PS_log_ratio_ha
VC_log_ratio_ha
IW_log_ratio_ha
WH_log_ratio_ha
GB_log_ratio_ha
LB_log_ratio_ha
SJ_log_ratio_ha
BS_log_ratio_ha
player_1
player_2
player_3
player_4
player_5
player_6
player_7
player_8
player_9
player_10
player_11
0
24559
Switzerland
Switzerland Super League
2008/2009
1
2008-07-18
3
-1
-1
Away Wins
10192
home
1
-1
-1
lost
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
24559
Switzerland
Switzerland Super League
2008/2009
1
2008-07-18
3
-1
-1
Away Wins
9931
away
2
1
1
won
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
To each match, merge last known information about the team before that particular match.
Note, that only teams, which had information in team_attributes, will be merged (merge requires non-null values in team_info_date). As some teams did not have this information, their names were not merged too, as they were present in teams table. if this is an issue, team and team_attributes should be merged separately to matches_long_team.
Table 4.8. Inspection: a few rows of table matches_long_team (2).
match_id
country
region
league
season
stage
match_date
team_info_date
goal_sum
goal_diff
goal_diff_sign
match_winner
team_id
team_type
team_goals
team_goal_diff
team_goal_diff_sign
team_outcome
B365_home_wins
B365_draw
B365_away_wins
BW_home_wins
BW_draw
BW_away_wins
IW_home_wins
IW_draw
IW_away_wins
LB_home_wins
LB_draw
LB_away_wins
PS_home_wins
PS_draw
PS_away_wins
WH_home_wins
WH_draw
WH_away_wins
SJ_home_wins
SJ_draw
SJ_away_wins
VC_home_wins
VC_draw
VC_away_wins
GB_home_wins
GB_draw
GB_away_wins
BS_home_wins
BS_draw
BS_away_wins
B365_ratio_ha
BW_ratio_ha
PS_ratio_ha
VC_ratio_ha
IW_ratio_ha
WH_ratio_ha
GB_ratio_ha
LB_ratio_ha
SJ_ratio_ha
BS_ratio_ha
B365_log_ratio_ha
BW_log_ratio_ha
PS_log_ratio_ha
VC_log_ratio_ha
IW_log_ratio_ha
WH_log_ratio_ha
GB_log_ratio_ha
LB_log_ratio_ha
SJ_log_ratio_ha
BS_log_ratio_ha
player_1
player_2
player_3
player_4
player_5
player_6
player_7
player_8
player_9
player_10
player_11
team_name
team_short_name
buildUpPlayPositioningClass
chanceCreationPositioningClass
defenceDefenderLineClass
buildUpPlaySpeed
buildUpPlayDribbling
buildUpPlayPassing
chanceCreationPassing
chanceCreationCrossing
chanceCreationShooting
defencePressure
defenceAggression
defenceTeamWidth
buildUpPlaySpeedClass
buildUpPlayDribblingClass
buildUpPlayPassingClass
chanceCreationPassingClass
chanceCreationCrossingClass
chanceCreationShootingClass
defencePressureClass
defenceAggressionClass
defenceTeamWidthClass
51956
25949
Switzerland
Switzerland Super League
2015/2016
36
2016-05-25
2015-09-10
4
2
1
Home Wins
10243
home
3
2
1
won
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7621.00
197757.00
115700.00
113235.00
121080.00
41116.00
632356.00
465399.00
462608.00
198082.00
3517.00
FC Zürich
ZUR
Organised
Organised
Cover
62.00
49.00
46.00
47.00
50.00
54.00
47.00
43.00
56.00
Balanced
Normal
Mixed
Normal
Normal
Normal
Medium
Press
Normal
51957
25949
Switzerland
Switzerland Super League
2015/2016
36
2016-05-25
2015-09-10
4
2
1
Home Wins
9824
away
1
-2
-1
lost
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
154261.00
294256.00
257845.00
41621.00
214344.00
114011.00
56868.00
488297.00
113227.00
531309.00
493418.00
FC Vaduz
VAD
Organised
Organised
Cover
53.00
32.00
56.00
38.00
53.00
46.00
42.00
33.00
58.00
Balanced
Little
Mixed
Normal
Normal
Normal
Medium
Contain
Normal
Let’s calculate goal scoring statistics of each team per season for team analysis.
Code: Create teams_goals_per_team
teams_goals_per_team = ( matches_long_team.groupby(["team_name", "season", "league"]) .team_goals.agg(["count", "sum"]) .rename({"count": "n_matches", "sum": "n_goals"}, axis=1) .reset_index()# Exclude teams that did not play in that season .query("n_matches > 0")# Goals per match .eval("n_goals_per_match = n_goals/n_matches"))teams_goals_per_team.head(n=2)
Table 4.9. Inspection: a few rows of table teams_goals_per_team.
team_name
season
league
n_matches
n_goals
n_goals_per_match
0
1. FC Kaiserslautern
2010/2011
Germany 1. Bundesliga
34
48
1.41
1
1. FC Kaiserslautern
2011/2012
Germany 1. Bundesliga
34
24
0.71
Let’s find several best and worst performing teams in all leagues per season.
Code: Create teams_top_bottom_goals
# Select Top 5 and Bottom 5 teams (by **goals per match**)# in each season (all leagues)def select_5(data, column: str, best: bool=True):"""Select best/worst teams Args: data (pandas.dataframe) column (str): column name to perform computations on. best(bool): Should the best (if True) of worst (if False) be found? If several teams share the same result as the 5-th, then more than 5 teams are returned. """if best:return data.nlargest(5, column, keep="all")else:return data.nsmallest(5, column, keep="all")def select_5_per_season_by_goals(best: bool):"""Select best/worst teams in each season"""return ( teams_goals_per_team.groupby("season", as_index=False) .apply(select_5, "n_goals_per_match", best=best) .sort_values(["season", "n_goals_per_match"], ascending=[True, False]) .reset_index(drop=True) )teams_top_bottom_goals = pd.concat( [ select_5_per_season_by_goals(best=True).assign(which="Top 5"), select_5_per_season_by_goals(best=False).assign(which="Bottom 5"), ]).index_start_at(1)# Previewpd.concat([teams_top_bottom_goals.head(n=2), teams_top_bottom_goals.tail(n=2)])
Table 4.10. Inspection: a few rows of table teams_top_bottom_goals.
team_name
season
league
n_matches
n_goals
n_goals_per_match
which
1
Ajax
2009/2010
Netherlands Eredivisie
10
37
3.70
Top 5
2
Chelsea
2009/2010
England Premier League
11
40
3.64
Top 5
72
Aston Villa
2015/2016
England Premier League
38
27
0.71
Bottom 5
73
Boavista FC
2015/2016
Portugal Liga ZON Sagres
31
21
0.68
Bottom 5
Let’s find Top teams by percentage of matches that they won.
Table 4.11. Inspection: a few rows of table teams_wins_per_season. Columns Lost, Draw, Won indicate percentage of games per season with the indicated outcome.
Lost
Draw
Won
Season
League
Team
2009/2010
Belgium Jupiler League
RSC Anderlecht
0.00
0.00
100.00
Netherlands Eredivisie
Ajax
0.00
0.00
100.00
From matches_long_team, let’s create dataset matches_long_player with one row per player:
Code: Create matches_long_player
matches_long_player = matches_long_team.pivot_longer( column_names=re.compile("player_.+"), names_pattern="player_(.+)", names_to="player_no", values_to="player_id", sort_by_appearance=True,)print("Expected ratio is 11, got: ", matches_long_player.shape[0] / matches_long_team.shape[0],)matches_long_player.head(2)
Expected ratio is 11, got: 11.0
Table 4.12. Inspection: a few rows of table matches_long_player (1).
match_id
country
region
league
season
stage
match_date
team_info_date
goal_sum
goal_diff
goal_diff_sign
match_winner
team_id
team_type
team_goals
team_goal_diff
team_goal_diff_sign
team_outcome
B365_home_wins
B365_draw
B365_away_wins
BW_home_wins
BW_draw
BW_away_wins
IW_home_wins
IW_draw
IW_away_wins
LB_home_wins
LB_draw
LB_away_wins
PS_home_wins
PS_draw
PS_away_wins
WH_home_wins
WH_draw
WH_away_wins
SJ_home_wins
SJ_draw
SJ_away_wins
VC_home_wins
VC_draw
VC_away_wins
GB_home_wins
GB_draw
GB_away_wins
BS_home_wins
BS_draw
BS_away_wins
B365_ratio_ha
BW_ratio_ha
PS_ratio_ha
VC_ratio_ha
IW_ratio_ha
WH_ratio_ha
GB_ratio_ha
LB_ratio_ha
SJ_ratio_ha
BS_ratio_ha
B365_log_ratio_ha
BW_log_ratio_ha
PS_log_ratio_ha
VC_log_ratio_ha
IW_log_ratio_ha
WH_log_ratio_ha
GB_log_ratio_ha
LB_log_ratio_ha
SJ_log_ratio_ha
BS_log_ratio_ha
team_name
team_short_name
buildUpPlayPositioningClass
chanceCreationPositioningClass
defenceDefenderLineClass
buildUpPlaySpeed
buildUpPlayDribbling
buildUpPlayPassing
chanceCreationPassing
chanceCreationCrossing
chanceCreationShooting
defencePressure
defenceAggression
defenceTeamWidth
buildUpPlaySpeedClass
buildUpPlayDribblingClass
buildUpPlayPassingClass
chanceCreationPassingClass
chanceCreationCrossingClass
chanceCreationShootingClass
defencePressureClass
defenceAggressionClass
defenceTeamWidthClass
player_no
player_id
0
24559
Switzerland
Switzerland Super League
2008/2009
1
2008-07-18
NaT
3
-1
-1
Away Wins
10192
home
1
-1
-1
lost
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
NaN
1
24559
Switzerland
Switzerland Super League
2008/2009
1
2008-07-18
NaT
3
-1
-1
Away Wins
10192
home
1
-1
-1
lost
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
NaN
To each match, merge at that time last known information about each player.
Table 4.13. Inspection: a few rows of table matches_long_player (2).
match_id
country
region
league
season
stage
match_date
team_info_date
player_info_date
goal_sum
goal_diff
goal_diff_sign
match_winner
team_id
team_type
team_goals
team_goal_diff
team_goal_diff_sign
team_outcome
B365_home_wins
B365_draw
B365_away_wins
BW_home_wins
BW_draw
BW_away_wins
IW_home_wins
IW_draw
IW_away_wins
LB_home_wins
LB_draw
LB_away_wins
PS_home_wins
PS_draw
PS_away_wins
WH_home_wins
WH_draw
WH_away_wins
SJ_home_wins
SJ_draw
SJ_away_wins
VC_home_wins
VC_draw
VC_away_wins
GB_home_wins
GB_draw
GB_away_wins
BS_home_wins
BS_draw
BS_away_wins
B365_ratio_ha
BW_ratio_ha
PS_ratio_ha
VC_ratio_ha
IW_ratio_ha
WH_ratio_ha
GB_ratio_ha
LB_ratio_ha
SJ_ratio_ha
BS_ratio_ha
B365_log_ratio_ha
BW_log_ratio_ha
PS_log_ratio_ha
VC_log_ratio_ha
IW_log_ratio_ha
WH_log_ratio_ha
GB_log_ratio_ha
LB_log_ratio_ha
SJ_log_ratio_ha
BS_log_ratio_ha
team_name
team_short_name
buildUpPlayPositioningClass
chanceCreationPositioningClass
defenceDefenderLineClass
buildUpPlaySpeed
buildUpPlayDribbling
buildUpPlayPassing
chanceCreationPassing
chanceCreationCrossing
chanceCreationShooting
defencePressure
defenceAggression
defenceTeamWidth
buildUpPlaySpeedClass
buildUpPlayDribblingClass
buildUpPlayPassingClass
chanceCreationPassingClass
chanceCreationCrossingClass
chanceCreationShootingClass
defencePressureClass
defenceAggressionClass
defenceTeamWidthClass
player_no
player_id
player_name
birthday
birth_year
height
weight_kg
bmi
overall_rating
potential
preferred_foot
attacking_work_rate
defensive_work_rate
crossing
finishing
heading_accuracy
short_passing
volleys
dribbling
curve
free_kick_accuracy
long_passing
ball_control
acceleration
sprint_speed
agility
reactions
balance
shot_power
jumping
stamina
strength
long_shots
aggression
interceptions
positioning
vision
penalties
marking
standing_tackle
sliding_tackle
gk_diving
gk_handling
gk_kicking
gk_positioning
gk_reflexes
player_age
542279
25949
Switzerland
Switzerland Super League
2015/2016
36
2016-05-25
2015-09-10
2016-04-21
4
2
1
Home Wins
9824
away
1
-2
-1
lost
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
FC Vaduz
VAD
Organised
Organised
Cover
53.00
32.00
56.00
38.00
53.00
46.00
42.00
33.00
58.00
Balanced
Little
Mixed
Normal
Normal
Normal
Medium
Contain
Normal
10
531309
Robin Kamber
1996-02-15
1996
187.96
82.99
23.49
52.00
65.00
right
high
low
49.00
47.00
49.00
63.00
47.00
50.00
53.00
48.00
66.00
56.00
65.00
60.00
56.00
48.00
55.00
48.00
31.00
45.00
71.00
40.00
43.00
31.00
46.00
60.00
50.00
43.00
45.00
42.00
12.00
11.00
8.00
9.00
8.00
20.27
542280
25949
Switzerland
Switzerland Super League
2015/2016
36
2016-05-25
2015-09-10
2016-03-03
4
2
1
Home Wins
9824
away
1
-2
-1
lost
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
FC Vaduz
VAD
Organised
Organised
Cover
53.00
32.00
56.00
38.00
53.00
46.00
42.00
33.00
58.00
Balanced
Little
Mixed
Normal
Normal
Normal
Medium
Contain
Normal
11
493418
Albion Avdijaj
1994-01-12
1994
190.50
79.82
21.99
54.00
61.00
right
medium
medium
29.00
60.00
72.00
40.00
47.00
47.00
30.00
23.00
23.00
49.00
48.00
53.00
62.00
47.00
49.00
49.00
67.00
60.00
66.00
42.00
50.00
24.00
56.00
48.00
56.00
27.00
20.00
21.00
7.00
6.00
15.00
10.00
8.00
22.37
Code
matches_long_player.shape
(542281, 139)
Let’s aggregate player information (numeric variables) to get one row for team-match combination.
Code: Create table team_player_summary
# Prepare dataset for aggregationinclude = ["height","weight_kg","bmi","overall_rating","potential","crossing","finishing","heading_accuracy","short_passing","volleys","dribbling","curve","free_kick_accuracy","long_passing","ball_control","acceleration","sprint_speed","agility","reactions","balance","shot_power","jumping","stamina","strength","long_shots","aggression","interceptions","positioning","vision","penalties","marking","standing_tackle","sliding_tackle","gk_diving","gk_handling","gk_kicking","gk_positioning","gk_reflexes","player_age",]for_agg = matches_long_player.groupby(["match_id", "team_id"])[include]# Evaluate which cases include all 11 playersn_player_ok = for_agg.count().min(axis=1).to_frame("players_summarized")percent_ok =round(n_player_ok.eval("players_summarized==11").mean(), 3) *100print(f"In {percent_ok:.1f}% cases, summaries include all 11 players. \n""Only these cases will be analyzed next.")# Calculate summary statistics for each selected attribute.# Include only those cases where 11 players are aggregated.team_player_summary = for_agg.agg(["min", "mean", "std", "max"])team_player_summary.columns = ( team_player_summary.columns.to_flat_index().str.join("__"))team_player_summary = ( n_player_ok.join(team_player_summary) .query("players_summarized==11") .drop(columns="players_summarized"))team_player_summary.head(2)
In 82.9% cases, summaries include all 11 players.
Only these cases will be analyzed next.
Table 4.14. Inspection: a few rows of table team_player_summary.
height__min
height__mean
height__std
height__max
weight_kg__min
weight_kg__mean
weight_kg__std
weight_kg__max
bmi__min
bmi__mean
bmi__std
bmi__max
overall_rating__min
overall_rating__mean
overall_rating__std
overall_rating__max
potential__min
potential__mean
potential__std
potential__max
crossing__min
crossing__mean
crossing__std
crossing__max
finishing__min
finishing__mean
finishing__std
finishing__max
heading_accuracy__min
heading_accuracy__mean
heading_accuracy__std
heading_accuracy__max
short_passing__min
short_passing__mean
short_passing__std
short_passing__max
volleys__min
volleys__mean
volleys__std
volleys__max
dribbling__min
dribbling__mean
dribbling__std
dribbling__max
curve__min
curve__mean
curve__std
curve__max
free_kick_accuracy__min
free_kick_accuracy__mean
free_kick_accuracy__std
free_kick_accuracy__max
long_passing__min
long_passing__mean
long_passing__std
long_passing__max
ball_control__min
ball_control__mean
ball_control__std
ball_control__max
acceleration__min
acceleration__mean
acceleration__std
acceleration__max
sprint_speed__min
sprint_speed__mean
sprint_speed__std
sprint_speed__max
agility__min
agility__mean
agility__std
agility__max
reactions__min
reactions__mean
reactions__std
reactions__max
balance__min
balance__mean
balance__std
balance__max
shot_power__min
shot_power__mean
shot_power__std
shot_power__max
jumping__min
jumping__mean
jumping__std
jumping__max
stamina__min
stamina__mean
stamina__std
stamina__max
strength__min
strength__mean
strength__std
strength__max
long_shots__min
long_shots__mean
long_shots__std
long_shots__max
aggression__min
aggression__mean
aggression__std
aggression__max
interceptions__min
interceptions__mean
interceptions__std
interceptions__max
positioning__min
positioning__mean
positioning__std
positioning__max
vision__min
vision__mean
vision__std
vision__max
penalties__min
penalties__mean
penalties__std
penalties__max
marking__min
marking__mean
marking__std
marking__max
standing_tackle__min
standing_tackle__mean
standing_tackle__std
standing_tackle__max
sliding_tackle__min
sliding_tackle__mean
sliding_tackle__std
sliding_tackle__max
gk_diving__min
gk_diving__mean
gk_diving__std
gk_diving__max
gk_handling__min
gk_handling__mean
gk_handling__std
gk_handling__max
gk_kicking__min
gk_kicking__mean
gk_kicking__std
gk_kicking__max
gk_positioning__min
gk_positioning__mean
gk_positioning__std
gk_positioning__max
gk_reflexes__min
gk_reflexes__mean
gk_reflexes__std
gk_reflexes__max
player_age__min
player_age__mean
player_age__std
player_age__max
match_id
team_id
145
8635
167.64
183.34
7.08
193.04
60.77
78.87
9.00
93.88
21.62
23.39
1.32
25.59
57.00
69.45
4.80
75.00
69.00
74.36
2.84
78.00
29.00
57.82
14.86
78.00
23.00
49.27
18.31
71.00
25.00
61.00
16.77
83.00
51.00
65.73
8.91
78.00
9.00
47.82
21.34
69.00
23.00
55.45
18.62
85.00
11.00
48.27
19.80
77.00
23.00
51.18
19.39
79.00
48.00
62.82
10.22
79.00
51.00
64.45
10.52
82.00
48.00
66.00
9.83
78.00
58.00
68.91
6.49
77.00
48.00
64.73
10.53
82.00
57.00
67.64
6.77
82.00
47.00
68.64
13.04
91.00
25.00
62.00
16.96
85.00
61.00
67.82
4.75
77.00
55.00
73.45
9.17
85.00
42.00
68.73
16.60
91.00
23.00
53.27
17.77
74.00
32.00
67.45
16.62
93.00
31.00
62.09
15.47
82.00
13.00
59.64
20.44
83.00
49.00
67.64
10.49
84.00
42.00
62.64
13.46
83.00
24.00
55.36
19.07
74.00
22.00
57.09
20.20
78.00
12.00
56.55
22.62
74.00
1.00
14.64
18.03
67.00
20.00
25.82
13.71
67.00
48.00
62.82
10.22
79.00
20.00
25.64
13.11
65.00
20.00
26.09
14.61
70.00
18.90
25.71
3.56
31.00
146
9987
170.18
181.26
6.84
193.04
60.77
73.92
8.41
89.80
20.34
22.42
1.16
24.10
54.00
64.09
6.11
72.00
62.00
71.27
5.39
83.00
22.00
54.64
14.66
75.00
22.00
48.91
18.53
74.00
22.00
51.09
13.92
75.00
26.00
56.00
12.77
72.00
25.00
53.73
14.28
72.00
22.00
54.18
19.54
77.00
25.00
53.18
13.47
70.00
11.00
48.18
17.97
72.00
42.00
56.00
9.27
67.00
22.00
58.45
17.30
76.00
56.00
68.18
6.55
77.00
48.00
69.27
8.87
79.00
37.00
65.36
10.46
75.00
56.00
64.27
5.82
72.00
51.00
63.73
7.79
77.00
22.00
56.18
17.67
81.00
61.00
67.09
3.91
73.00
43.00
66.82
9.71
83.00
47.00
63.27
12.55
89.00
22.00
51.73
16.66
69.00
44.00
62.36
11.41
82.00
30.00
59.27
11.59
72.00
30.00
57.45
16.59
77.00
25.00
60.82
13.47
74.00
31.00
58.55
15.44
82.00
21.00
45.09
17.70
65.00
21.00
47.36
17.32
74.00
22.00
49.27
20.23
72.00
1.00
13.82
16.52
62.00
20.00
24.91
12.68
63.00
42.00
56.27
9.57
67.00
20.00
24.55
11.48
59.00
20.00
25.00
12.98
64.00
18.78
23.39
3.22
27.43
Let’s prepare datasets for predictive modelling. Several principles I followed:
mainly numeric variables will be included in the analysis. Exception will be for variable team_type witch indicates if the team is playing at home or away.
some variables (especially betting odds which are highly inter-correlated) with may missing values were also excluded in order not to have more complete cases.
Prepare betting odds for team analysis: instead of home_wins, away_wins which are less correct in this analysis, here _win (victory) and _loose (loss) betting odds will be used. Ratios will be calculate using the new variables accordingly. The ratios will wave names ending in _ratio_wl (ratio betting odds to win / betting odds to loose).
Table 4.15. Inspection: a few rows of table team_betting_odds.
B365_draw
BW_draw
IW_draw
LB_draw
WH_draw
VC_draw
B365_win
BW_win
VC_win
IW_win
WH_win
LB_win
B365_loose
BW_loose
VC_loose
IW_loose
WH_loose
LB_loose
B365_ratio_wl
BW_ratio_wl
VC_ratio_wl
IW_ratio_wl
WH_ratio_wl
LB_ratio_wl
B365_log_ratio_wl
BW_log_ratio_wl
VC_log_ratio_wl
IW_log_ratio_wl
WH_log_ratio_wl
LB_log_ratio_wl
match_id
team_id
24491
8305
3.80
3.80
4.00
3.80
3.80
4.00
1.70
1.70
1.73
1.60
1.60
1.70
5.00
4.60
5.00
4.80
5.00
4.50
0.34
0.37
0.35
0.33
0.32
0.38
-1.08
-1.00
-1.06
-1.10
-1.14
-0.97
4702
8678
4.20
4.00
3.70
3.80
4.00
4.10
5.25
5.00
5.20
4.50
4.75
5.25
1.67
1.67
1.67
1.70
1.67
1.65
3.14
2.99
3.11
2.65
2.84
3.18
1.15
1.10
1.14
0.97
1.05
1.16
Code
team_betting_odds.shape
(44864, 30)
Code: Create table matches_long_team1
cols = matches_long_team.columns# Remove player info and column with many NA valuescondition_1 = ("player_", "buildUpPlayDribbling")# Remove some categorical variables and betting odds infocondition_2 = ("Class", "_draw", "_wins", "_ha")col_index =~(cols.str.startswith(condition_1) | cols.str.endswith(condition_2))# Remove rows with no team inforow_index =~matches_long_team.team_info_date.isna()matches_long_team1 = matches_long_team.loc[row_index, col_index]# Join team info, player info and betting odds infomatches_long_team1 = ( matches_long_team1.set_index(["match_id", "team_id"]) .join(team_player_summary) .join(team_betting_odds))# Remove intermediate resultsdel [cols, condition_1, condition_2, col_index, row_index]# Inspectmatches_long_team1.head(2)
Table 4.16. Inspection: a few rows of table matches_long_team1.
country
region
league
season
stage
match_date
team_info_date
goal_sum
goal_diff
goal_diff_sign
match_winner
team_type
team_goals
team_goal_diff
team_goal_diff_sign
team_outcome
team_name
team_short_name
buildUpPlaySpeed
buildUpPlayPassing
chanceCreationPassing
chanceCreationCrossing
chanceCreationShooting
defencePressure
defenceAggression
defenceTeamWidth
height__min
height__mean
height__std
height__max
weight_kg__min
weight_kg__mean
weight_kg__std
weight_kg__max
bmi__min
bmi__mean
bmi__std
bmi__max
overall_rating__min
overall_rating__mean
overall_rating__std
overall_rating__max
potential__min
potential__mean
potential__std
potential__max
crossing__min
crossing__mean
crossing__std
crossing__max
finishing__min
finishing__mean
finishing__std
finishing__max
heading_accuracy__min
heading_accuracy__mean
heading_accuracy__std
heading_accuracy__max
short_passing__min
short_passing__mean
short_passing__std
short_passing__max
volleys__min
volleys__mean
volleys__std
volleys__max
dribbling__min
dribbling__mean
dribbling__std
dribbling__max
curve__min
curve__mean
curve__std
curve__max
free_kick_accuracy__min
free_kick_accuracy__mean
free_kick_accuracy__std
free_kick_accuracy__max
long_passing__min
long_passing__mean
long_passing__std
long_passing__max
ball_control__min
ball_control__mean
ball_control__std
ball_control__max
acceleration__min
acceleration__mean
acceleration__std
acceleration__max
sprint_speed__min
sprint_speed__mean
sprint_speed__std
sprint_speed__max
agility__min
agility__mean
agility__std
agility__max
reactions__min
reactions__mean
reactions__std
reactions__max
balance__min
balance__mean
balance__std
balance__max
shot_power__min
shot_power__mean
shot_power__std
shot_power__max
jumping__min
jumping__mean
jumping__std
jumping__max
stamina__min
stamina__mean
stamina__std
stamina__max
strength__min
strength__mean
strength__std
strength__max
long_shots__min
long_shots__mean
long_shots__std
long_shots__max
aggression__min
aggression__mean
aggression__std
aggression__max
interceptions__min
interceptions__mean
interceptions__std
interceptions__max
positioning__min
positioning__mean
positioning__std
positioning__max
vision__min
vision__mean
vision__std
vision__max
penalties__min
penalties__mean
penalties__std
penalties__max
marking__min
marking__mean
marking__std
marking__max
standing_tackle__min
standing_tackle__mean
standing_tackle__std
standing_tackle__max
sliding_tackle__min
sliding_tackle__mean
sliding_tackle__std
sliding_tackle__max
gk_diving__min
gk_diving__mean
gk_diving__std
gk_diving__max
gk_handling__min
gk_handling__mean
gk_handling__std
gk_handling__max
gk_kicking__min
gk_kicking__mean
gk_kicking__std
gk_kicking__max
gk_positioning__min
gk_positioning__mean
gk_positioning__std
gk_positioning__max
gk_reflexes__min
gk_reflexes__mean
gk_reflexes__std
gk_reflexes__max
player_age__min
player_age__mean
player_age__std
player_age__max
B365_draw
BW_draw
IW_draw
LB_draw
WH_draw
VC_draw
B365_win
BW_win
VC_win
IW_win
WH_win
LB_win
B365_loose
BW_loose
VC_loose
IW_loose
WH_loose
LB_loose
B365_ratio_wl
BW_ratio_wl
VC_ratio_wl
IW_ratio_wl
WH_ratio_wl
LB_ratio_wl
B365_log_ratio_wl
BW_log_ratio_wl
VC_log_ratio_wl
IW_log_ratio_wl
WH_log_ratio_wl
LB_log_ratio_wl
match_id
team_id
22055
10267
Spain
Spain LIGA BBVA
2009/2010
23
2010-02-22
2010-02-22
3
1
1
Home Wins
home
2
1
1
won
Valencia CF
VAL
30.00
30.00
55.00
60.00
70.00
55.00
60.00
60.00
170.18
178.95
4.87
185.42
67.12
74.99
4.62
82.09
22.37
23.40
0.65
24.67
77.00
80.55
3.64
88.00
80.00
84.82
3.66
91.00
21.00
62.91
24.16
89.00
21.00
55.09
26.52
94.00
21.00
65.00
17.37
82.00
21.00
73.00
18.74
90.00
9.00
57.18
23.11
87.00
21.00
68.18
22.03
89.00
21.00
64.64
20.46
88.00
11.00
58.55
22.02
86.00
55.00
67.18
8.39
86.00
32.00
75.18
16.22
91.00
35.00
73.18
15.08
88.00
35.00
72.73
14.96
87.00
45.00
68.82
14.90
87.00
59.00
77.00
8.93
93.00
56.00
74.45
8.15
82.00
21.00
69.64
20.05
91.00
58.00
69.09
9.20
85.00
46.00
74.36
10.71
85.00
58.00
73.18
9.52
85.00
21.00
63.27
24.01
88.00
52.00
72.36
10.68
89.00
60.00
76.09
8.93
89.00
11.00
74.82
22.26
93.00
58.00
75.64
10.76
90.00
66.00
74.27
6.10
86.00
21.00
55.36
29.53
86.00
21.00
54.82
28.98
85.00
8.00
56.18
27.17
81.00
7.00
16.00
20.37
77.00
20.00
26.82
15.05
72.00
55.00
67.18
8.39
86.00
20.00
28.27
19.86
88.00
20.00
27.09
15.95
75.00
21.65
28.60
4.54
38.48
4.00
3.80
3.70
3.60
3.50
3.75
1.57
1.50
1.53
1.55
1.53
1.50
5.50
6.50
6.50
6.00
6.00
5.50
0.29
0.23
0.24
0.26
0.26
0.27
-1.25
-1.47
-1.45
-1.35
-1.37
-1.30
8305
Spain
Spain LIGA BBVA
2009/2010
23
2010-02-22
2010-02-22
3
1
1
Home Wins
away
1
-1
-1
lost
Getafe CF
GET
30.00
35.00
35.00
50.00
70.00
40.00
30.00
50.00
175.26
182.42
3.74
187.96
68.03
75.82
3.93
81.18
21.83
22.77
0.61
23.87
72.00
75.45
2.38
80.00
75.00
80.18
3.71
86.00
24.00
62.36
17.15
89.00
21.00
51.64
22.73
80.00
24.00
64.18
15.67
82.00
24.00
68.18
16.82
83.00
11.00
55.91
20.40
74.00
24.00
63.09
17.31
83.00
7.00
59.00
20.98
86.00
12.00
59.45
19.65
80.00
49.00
68.09
12.36
85.00
26.00
67.45
14.20
75.00
68.00
74.18
4.87
82.00
63.00
72.82
6.24
83.00
57.00
66.91
6.44
80.00
59.00
71.73
5.39
77.00
62.00
71.55
5.92
81.00
24.00
66.18
20.86
92.00
49.00
69.55
8.72
81.00
48.00
72.55
10.57
85.00
66.00
74.73
6.17
87.00
24.00
60.09
17.19
83.00
27.00
66.82
16.46
82.00
58.00
70.64
5.68
76.00
11.00
68.27
19.60
83.00
55.00
66.91
11.09
84.00
55.00
65.09
5.03
73.00
23.00
57.00
24.42
87.00
24.00
59.73
23.25
84.00
12.00
56.82
26.10
84.00
1.00
13.18
20.49
74.00
20.00
27.55
16.45
77.00
49.00
68.45
12.44
85.00
20.00
27.36
15.85
75.00
20.00
27.91
17.65
81.00
20.85
26.45
3.64
33.95
4.00
3.80
3.70
3.60
3.50
3.75
5.50
6.50
6.50
6.00
6.00
5.50
1.57
1.50
1.53
1.55
1.53
1.50
3.50
4.33
4.25
3.87
3.92
3.67
1.25
1.47
1.45
1.35
1.37
1.30
Code
matches_long_team1.shape
(39840, 212)
In the following code blocks:
Separate datasets were created for team-related and match-related analysis.
Data were split ro train and sets sets:
Training data was included all seasons except the last one.
In test set there were data from the last season only (2015/2016).
Non-complete cases were removed.
Code: Tables and variables for team-related predictive modeling`
del [ teams, teams_goals_per_team, team_player_summary, team_betting_odds, matches_long_team, matches_long_team1, matches_long_team2, matches_long_player,]
5 Analysis
This is the main part there the most important data-based insights are created.
At the top of each subsection, there will be one or several questions provided.
Next, the summary and the main findings of that subsection will follow.
Next, the details (plots, tables, etc.) will be provided.
Each subsection main contain ad-hoc data preprocessing and analysis code.
5.1 Included Countries and Leagues
Which leagues are in which countries?
There are 10 countries (Scotland and England are parts of United Kingdom, UK) and 11 leagues in the database: 1 league per country except UK. See details in map 5.1 and Table 5.1.
Which leagues score the most and the fewest goals?
Are there any goal-scoring patterns between seasons?
Main points of this section:
Leagues differ in number of matches per season:
Fewest games are played in Switzerland (180 matches per season);
Most games are played in Italy, France, England, and Spain (380 matches per season).
As some matches are missing from the datasets and leagues are of different size, it is more correct to compare leagues by the goals per match than by total goals scored.
Leagues differ by resultativeness:
Most scoring league is in the Netherlands (3.08 goals per match) and it does not differ significantly from leagues in Switzerland, and Germany.
Least scoring league is in Poland (2.42 goals per match) and it does not differ significantly from France, Portugal, Italy, and Scotland.
By comparing seasons, no significant patterns (differences) were found.
Find the details in the following subsections.
5.2.1 Both (Leagues and Seasons)
First, slices of each league and season were analyzed (Table 5.2). The result revealed that, e.g., in Belgium Jupiler League 2013/2014, some games are clearly missing: Wikipedia article indicates that 299 matches were played and Table 5.2 shows only 12. Looking at Wikipedia pages of some other seasons and leagues, it is clear that in some cases all games are included in the dataset, but in other cases some games are missing. So it is not correct to compare total matches and total goals per league. Average number of goals per match is a more appropriate measure.
Table 5.2. Goal statistics for each league and season.
n_matches_total
n_goals_total
n_goals_per_match
league
season
Netherlands Eredivisie
2008/2009
306
870
2.84
2009/2010
306
892
2.92
2010/2011
306
987
3.23
2011/2012
306
997
3.26
2012/2013
306
964
3.15
2013/2014
306
978
3.20
2014/2015
306
942
3.08
2015/2016
306
912
2.98
Switzerland Super League
2008/2009
180
540
3.00
2009/2010
180
599
3.33
2010/2011
180
537
2.98
2011/2012
162
425
2.62
2012/2013
180
462
2.57
2013/2014
180
520
2.89
2014/2015
180
517
2.87
2015/2016
180
566
3.14
Germany 1. Bundesliga
2008/2009
306
894
2.92
2009/2010
306
866
2.83
2010/2011
306
894
2.92
2011/2012
306
875
2.86
2012/2013
306
898
2.93
2013/2014
306
967
3.16
2014/2015
306
843
2.75
2015/2016
306
866
2.83
Spain LIGA BBVA
2008/2009
380
1,101
2.90
2009/2010
380
1,031
2.71
2010/2011
380
1,042
2.74
2011/2012
380
1,050
2.76
2012/2013
380
1,091
2.87
2013/2014
380
1,045
2.75
2014/2015
380
1,009
2.66
2015/2016
380
1,043
2.74
Belgium Jupiler League
2008/2009
306
855
2.79
2009/2010
210
565
2.69
2010/2011
240
635
2.65
2011/2012
240
691
2.88
2012/2013
240
703
2.93
2013/2014
12
30
2.50
2014/2015
240
668
2.78
2015/2016
240
694
2.89
England Premier League
2008/2009
380
942
2.48
2009/2010
380
1,053
2.77
2010/2011
380
1,063
2.80
2011/2012
380
1,066
2.81
2012/2013
380
1,063
2.80
2013/2014
380
1,052
2.77
2014/2015
380
975
2.57
2015/2016
380
1,026
2.70
Scotland Premier League
2008/2009
228
548
2.40
2009/2010
228
585
2.57
2010/2011
228
584
2.56
2011/2012
228
601
2.64
2012/2013
228
623
2.73
2013/2014
228
626
2.75
2014/2015
228
587
2.57
2015/2016
228
650
2.85
Italy Serie A
2008/2009
380
988
2.60
2009/2010
380
992
2.61
2010/2011
380
955
2.51
2011/2012
358
925
2.58
2012/2013
380
1,003
2.64
2013/2014
380
1,035
2.72
2014/2015
379
1,018
2.69
2015/2016
380
979
2.58
Portugal Liga ZON Sagres
2008/2009
240
552
2.30
2009/2010
240
601
2.50
2010/2011
240
584
2.43
2011/2012
240
634
2.64
2012/2013
240
667
2.78
2013/2014
240
569
2.37
2014/2015
306
763
2.49
2015/2016
306
831
2.72
France Ligue 1
2008/2009
380
858
2.26
2009/2010
380
916
2.41
2010/2011
380
890
2.34
2011/2012
380
956
2.52
2012/2013
380
967
2.54
2013/2014
380
933
2.46
2014/2015
380
947
2.49
2015/2016
380
960
2.53
Poland Ekstraklasa
2008/2009
240
524
2.18
2009/2010
240
532
2.22
2010/2011
240
578
2.41
2011/2012
240
527
2.20
2012/2013
240
598
2.49
2013/2014
240
634
2.64
2014/2015
240
628
2.62
2015/2016
240
635
2.65
5.2.2 Leagues
This subsection concentrates more on leagues. It compares leagues by size, i.e., number of games per season (Figure 5.2) and by resultativeness, i.e., average number of goals per match (Figure 5.3). Numeric summaries are displayed in Table 5.3.
Table 5.3. Statistics of performance in each league and season: summaries for each league. Yellow cells indicate maximum and pale-red ones indicate minimum values in column.
n_matches_total
n_goals_total
n_goals_per_match
mean
std
mean
std
mean
std
league
Netherlands Eredivisie
306.0
0.0
942.8
46.9
3.08
0.15
Switzerland Super League
177.8
6.4
520.8
55.3
2.93
0.25
Germany 1. Bundesliga
306.0
0.0
887.9
37.0
2.90
0.12
Spain LIGA BBVA
380.0
0.0
1051.5
30.3
2.77
0.08
Belgium Jupiler League
216.0
86.7
605.1
246.3
2.76
0.15
England Premier League
380.0
0.0
1030.0
46.7
2.71
0.12
Scotland Premier League
228.0
0.0
600.5
31.8
2.63
0.14
Italy Serie A
377.1
7.7
986.9
34.8
2.62
0.07
Portugal Liga ZON Sagres
256.5
30.6
650.1
99.3
2.53
0.17
France Ligue 1
380.0
0.0
928.4
38.2
2.44
0.10
Poland Ekstraklasa
240.0
0.0
582.0
49.0
2.42
0.20
5.2.3 Seasons
This subsection concentrates more on seasons. It compares seasons by size, i.e., number of games per league (Figure 5.4) and by resultativeness, i.e., average number of goals per match (Figure 5.5). Numerical summaries are displayed in Table 5.4.
Table 5.4. Statistics of performance in each league and season: summaries for each season. Yellow cells indicate maximum and pale red ones indicate minimum values in column.
n_matches_total
n_goals_total
n_goals_per_match
mean
std
mean
std
mean
std
season
2008/2009
302.4
72.4
788.4
208.2
2.61
0.30
2009/2010
293.6
77.5
784.7
207.7
2.69
0.29
2010/2011
296.4
74.8
795.4
210.3
2.69
0.28
2011/2012
292.7
75.6
795.2
226.2
2.71
0.26
2012/2013
296.4
74.8
821.7
216.3
2.77
0.20
2013/2014
275.6
113.5
762.6
319.9
2.75
0.26
2014/2015
302.3
72.3
808.8
184.0
2.69
0.18
2015/2016
302.4
72.4
832.9
170.1
2.78
0.18
5.3 Top Teams
Which teams shows the best performance?
How do best and worst teams differ in resultativeness?
How many matches do the best teams win and loose?
The analysis of teams included data from 7 seasons.
To evaluate team’s performance, it was decided to count in how many seasons the team appeared between the Top 5 scoring teams (in terms of goals per match in that season). Table 5.5 shows that 12 teams appears in that list and Real Madrid CF (7 times in 7 seasons), FC Barcelona (6 times), and PSV (5 times) are the 3 leaders. Comparing best and worst teams, they performance differ by about 2 goals per match (Figure 5.6, Table 5.6).
Comparing teams that had highest percentage of won matches per season, most frequently SL Benfica (5 times in 7 seasons), FC Barcelona (5 times), Real Madrid CF (4 times), Celtic (4 times) were between Top 5 (Table 5.7).
To get among Top 5 winners, in some cases, it was sufficient to win as little as 73.7 % of matches but to loose no more than 15.8 % of matches (Table 5.7).
See the details below.
Code
print(f"Seasons in this analysis: {teams_top_bottom_goals.season.nunique()}")
Seasons in this analysis: 7
Code
( teams_top_bottom_goals.query("which == 'Top 5'") .team_name.value_counts() .to_df("Number of seasons (out of 7)", "Team") .index_start_at(1) .style)
Table 5.5. Number of seasons a teams was among Top 5 by number of goals per match.
Team
Number of seasons (out of 7)
1
Real Madrid CF
7
2
FC Barcelona
6
3
PSV
5
4
FC Bayern Munich
4
5
SL Benfica
3
6
Ajax
2
7
FC Porto
2
8
Manchester City
2
9
Chelsea
1
10
Roda JC Kerkrade
1
11
Celtic
1
12
Liverpool
1
13
Paris Saint-Germain
1
Compare the performance of Top 5 and Bottom 5 teams:
Code
ax = sns.scatterplot( teams_top_bottom_goals, x="season", y="n_goals_per_match", hue="which",)ax.tick_params(axis="x", rotation=90)ax.set_xlabel("Season")ax.set_ylabel("Number of goals per match")ax.set_ylim([0, 4.5])ax.get_legend().set_title(None)
Code
teams_top_bottom_goals_summary = ( teams_top_bottom_goals.groupby(["which"]) .n_goals_per_match.agg(["min", "mean", "std", "max"]) .sort_index(ascending=False) .reset_index())teams_top_bottom_goals_summary.columns = pd.MultiIndex.from_tuples( [ ("", "Group of teams"), ("Goals per match", "Min"), ("Goals per match", "Mean"), ("Goals per match", "SD"), ("Goals per match", "Max"), ])teams_top_bottom_goals_summary.style.format(precision=2).hide(axis="index")
Table 5.6. Summaries of Top 5 and Bottom 5 teams by number of goals per match in all 7 seasons.
Goals per match
Group of teams
Min
Mean
SD
Max
Top 5
2.32
2.77
0.31
3.70
Bottom 5
0.30
0.68
0.11
0.87
Code
( teams_wins_per_season.reset_index() .Team.value_counts() .to_df("Number of seasons (out of 7)", "Team") .index_start_at(1) .style)
Table 5.7. Number of seasons a team was among Top 5 by percentage of won matches.
Table 5.8. Top 5 teams by percentage of won matches in each season. Highest values in each column are in yellow and lowest ones are in pale red.
Lost
Draw
Won
Season
League
Team
2009/2010
Belgium Jupiler League
RSC Anderlecht
0.0 %
0.0 %
100.0 %
Netherlands Eredivisie
Ajax
0.0 %
0.0 %
100.0 %
Portugal Liga ZON Sagres
SL Benfica
10.0 %
0.0 %
90.0 %
Spain LIGA BBVA
Real Madrid CF
6.7 %
6.7 %
86.7 %
FC Barcelona
0.0 %
13.3 %
86.7 %
2010/2011
Portugal Liga ZON Sagres
FC Porto
0.0 %
10.0 %
90.0 %
Spain LIGA BBVA
FC Barcelona
5.3 %
15.8 %
78.9 %
Scotland Premier League
Rangers
13.2 %
7.9 %
78.9 %
Spain LIGA BBVA
Real Madrid CF
10.5 %
13.2 %
76.3 %
Scotland Premier League
Celtic
10.5 %
13.2 %
76.3 %
2011/2012
Spain LIGA BBVA
Real Madrid CF
5.3 %
10.5 %
84.2 %
Scotland Premier League
Celtic
13.2 %
7.9 %
78.9 %
Portugal Liga ZON Sagres
FC Porto
3.3 %
20.0 %
76.7 %
England Premier League
Manchester City
13.2 %
13.2 %
73.7 %
Spain LIGA BBVA
FC Barcelona
7.9 %
18.4 %
73.7 %
England Premier League
Manchester United
13.2 %
13.2 %
73.7 %
2012/2013
Germany 1. Bundesliga
FC Bayern Munich
2.9 %
11.8 %
85.3 %
Spain LIGA BBVA
FC Barcelona
5.3 %
10.5 %
84.2 %
Portugal Liga ZON Sagres
SL Benfica
3.3 %
16.7 %
80.0 %
FC Porto
0.0 %
20.0 %
80.0 %
England Premier League
Manchester United
13.2 %
13.2 %
73.7 %
2013/2014
Italy Serie A
Juventus
5.3 %
7.9 %
86.8 %
Germany 1. Bundesliga
FC Bayern Munich
5.9 %
8.8 %
85.3 %
Scotland Premier League
Celtic
2.6 %
15.8 %
81.6 %
Portugal Liga ZON Sagres
SL Benfica
6.7 %
16.7 %
76.7 %
Spain LIGA BBVA
Atlético Madrid
10.5 %
15.8 %
73.7 %
2014/2015
Netherlands Eredivisie
PSV
11.8 %
2.9 %
85.3 %
Portugal Liga ZON Sagres
SL Benfica
8.8 %
11.8 %
79.4 %
Spain LIGA BBVA
Real Madrid CF
15.8 %
5.3 %
78.9 %
FC Barcelona
10.5 %
10.5 %
78.9 %
Scotland Premier League
Celtic
10.5 %
13.2 %
76.3 %
2015/2016
Portugal Liga ZON Sagres
SL Benfica
11.8 %
2.9 %
85.3 %
Germany 1. Bundesliga
FC Bayern Munich
5.9 %
11.8 %
82.4 %
Portugal Liga ZON Sagres
Sporting CP
5.9 %
14.7 %
79.4 %
France Ligue 1
Paris Saint-Germain
5.3 %
15.8 %
78.9 %
Netherlands Eredivisie
PSV
5.9 %
17.6 %
76.5 %
5.4 Players in 2015/2016
This subsection deals with the analysis of football players in the most recent available season. It contains a link to the dashboard (a technical requirement of this project) too.
5.5 Analysis of Players
Which players are the best ones?
Which player attributes are related to being a good player?
How various player attributes relate to each other?
This analysis included players from season 2015/2016 (information of players announced after 2015-07-01, if several records are present, the most recent one is used).
Among 7057 included players, Top 5 players by the overall rating were: Lionel Messi, Cristiano Ronaldo, Neymar, Manuel Neuer, and Luis Suarez (Table 5.9). Player reactions (r=0.81) and potential (r=0.80) were the attributes most strongly correlated to overall rating (Table 5.11).
Correlation and hierarchical cluster analysis revealed that there are at least 2 groups of related player attributes: one of the major clusters seems to be associated to goal-keeping-related features and bigger values of physiological properties like body mass (variable “weight_kg”) and height are positive related to better goal keeping characteristics (Figure 5.7). The other cluster has several sub-clusters which might also be related to different roles of players but this idea should be investigated in more detail.
Heatmaps and clustered heatmaps in this project are very big as they contain many variables. I tried smaller plot size, but then every second variable name got hidden.
Some additional exploration of football players is available via this Looker Studio dashboard (preview in Figure 5.8). Only players with no missing data in their attributes are included in the dashboard.
5.6 Home Advantage: Is It Real?
Is there such a thing as home advantage?
If yes, can we quantify it?
The analysis of 25,979 matches revealed, that:
Teams that play at home wins 45.9% (CI 45.1%–46.6%) matches compared to 28.7% away winning and 25.4% draws. This difference is statistically significant (χ² test, p < 0.001).
On average, home teams score 0.38 goals more than away teams. This shift toward the home advantage is statistically significant (t-test, p < 0.001).
Comparing different leagues, they do differ by the degree of home advantage. E.g., in Spain LIGA BBVA home advantage is as high as 0.50 goals and in Scotland Premier League it is as low as 0.22.
Comparing different seasons, no significant differences were found.
Find the details below.
Code
# Count of Home wins, Draws and Away winscounts = matches.match_winner.value_counts(sort=False).rename("matches")res_counts = an.AnalyzeCounts(counts, "Match outcome").fit()res_counts.display()
Omnibus (chi-squared) test resultsChi square test, χ²(2, n = 25979) = 1881.57, p < 0.001
Counts of matches with 95% CI and post-hoc (pairwise chi-squared) test results
Match outcome
n_matches
percent
ci_lower
ci_upper
cld
spaced_cld
0
Away Wins
7,466
28.7%
28.1%
29.4%
a
a__
1
Draw
6,596
25.4%
24.7%
26.0%
b
_b_
2
Home Wins
11,917
45.9%
45.1%
46.6%
c
__c
Descriptive statistics of group (Match outcome) counts
What is the relationship between betting odds from different websites?
How strongly are betting odds related to match outcomes?
Odds ratios from different websites as well as ratio and log-ratio of home wins versus away wins betting odds are investigated in this subsection. The analysis (Figure 5.13) shows that:
odds of the same type (e.g., “home wins”) from different websites are strongly correlates between each other (Fig. 5.13).
odds of “draw” are more strongly related to “away wins” and almost not correlated to “home wins”.
the log-ratio of betting odds shows the highest correlation to football match outcome: r=-0.46 in case of B365 (bet365.com), log ratio of betting odds vs. difference of goals (home goals minus away goals), as shown in Table 5.12.
EDA: Missing Value Plots of matches_betting_odds table
It seems that missing value structure is characteristic for each betting website as betting odds variables of each betting website are clustered together by their missing value structure.
Heatmaps and clustered heatmaps in this project are very big as they contain many variables. I tried smaller plot size, but then every second variable name got hidden.
There are 5 types of variables related to betting odds (odds for home and away wins, draw, ratio and log ratio of home versus away wins).These types of odds are highly correlated in each category (see plot 5.13).
Can we predict how many goals each team will score in each match?
In this section, number of goals each team scores in a match is modeled.
As a reference, standard deviation of goals was calculated: SD = 1.26 goals.
The initial idea was to select 4 final models for the types of variables, that are available and different times before the match (team-related features, player-related features and betting odds and one model based on all types of variables), so:
Three separate models for 3 predictor types (team-related features, player-related features and betting odds) were created.
Models with all 3 feature types as well as PCA features were also among the candidates, but they did not improve cross-validation performance and were discarded (see Table 5.13).
Finally, only a single model was selected:
Models with team-related (train RMSE=1.24, R²=0.03) and player-related features (train RMSE=1.20, R²=0.09) had really poor performance and barely explained any variation in target variable (R²<0.15), so were also discarded (see Table 5.14).
In cases two cases (a. betting odds based model and b. model where all variables were between the candidates as possible predictors) the same RF model with a single variable B365_win (betting odds that team wins) was selected. Its test performance is RMSE=1.16, R²=0.15.
There is a debate if betting odds is a reliable predictor due to its nature (it is the output of other model, it changes frequently, etc.). Yet, in this analysis betting odds was the only type of predictors that allowed achieving model with minimum reasonable amount of explained variance (R²≥0.15).
Conclusion: there is a lot of randomness in the game, so basing on the available data it is hard to make reliable predictions in advance on how many goals a team will core..
The summary of the results is present in Tables 5.13 and 5.14.
The details are in the subsections below.
Code
target_sd = team_train[team_target].std()print("Standard deviation (SD) of target variable in training set: "f"{round(target_sd, 2)} goals")
Standard deviation (SD) of target variable in training set: 1.26 goals
5.8.1 Team-Related Features as Predictors
Linear Regression
Code
# Do SFS or take results from cachedef fun_sfs_res_team_team(): np.random.seed(250) estimator = LinearRegression() subset = [team_target, "team_type", *team_vars_team] X, y = team_train[subset].make_dummies(exclude=team_target)return ml.sfs(estimator, "regression").fit(X, y)file="saved-output/sfs_res_team_team.pickle"sfs_res_team_team = my.cached_results(file, fun_sfs_res_team_team)
Code
ml.sfs_plot_results( sfs_res_team_team,"Predictors: Team-Related Features (Linear Regression)", target_sd,);
k = 2, avg. RMSE = 1.241 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
# Do SFS or take results from cachedef fun_sfs_res_team_all(): np.random.seed(250) estimator = LinearRegression() X, y = team_train.make_dummies(exclude=team_target)return ml.sfs(estimator, "regression", 50).fit(X, y)file="saved-output/sfs_res_team_all.pickle"sfs_res_team_all = my.cached_results(file, fun_sfs_res_team_all)
Code
ml.sfs_plot_results( sfs_res_team_all, "Predictors: All Variables (Linear Regression)");
k = 50, avg. RMSE = 1.156 [Best]
(Number of predictors at best score)
It was tried to create predictive model based on principal components instead of original numeric variables. PCA scree plot suggests that it is reasonable to use that 4 or 6 components as at these points the “elbow” point can be visible. Six components explain 56 % of variance. To explain 80% of variance, 27 components are needed.
n_pcs_80 = np.argwhere(pca_obj.explained_variance_ratio_.cumsum() >=0.80).min()print(f"Number of PCs needed to explain at least 80% of variance: {n_pcs_80}")
Number of PCs needed to explain at least 80% of variance: 27
This subsection summarizes the results from the subsections above and evaluates the performance on the whole training and test sets.
Basing on training CV RMSE:
Comparing 3 groups of predictors (team-related features, player-related features and betting odds), betting odds show the best predictive abilities and team-related features show the worst ones (see Table 5.13).
Comparing predictions based on original variables and PCs of these variables, PCs did not improve the predictions.
For the further investigation, 3 models were selected.
Table 5.13. Regression model selection results: selected models for each feature type and algorithm.
Features type
Method
Number of features selected
Training CV RMSE
Selected as final model
Note
Team-related
Linear regression
k = 2
1.241
No⁴
Random forest
k = 2
1.242
No
Player-related
Linear regression
k = 10
1.203
No⁴
Included¹: all Max. allowed²: 30 With k = 20, RMSE: 1.199
Random forest
k = 10
1.224
No
Included¹: 10
Betting odds
Linear regression
k = 2
1.162
No
Random forest
k = 1
1.162
Yes⁵
All variables
Linear regression
k = 6
1.158
No
Included¹: all Max. allowed²: 50
Random forest
k = 1
1.162
Yes⁵
Included¹: 12 The same model as in “Betting odds | Random forest”
6 PCs of all³ variables
Linear regression
k = 4
1.176
No
Included¹: 7
Random forest
k = 6
1.201
No
Included¹: 7
27 PCs of all³ variables
Linear regression
k = 5
1.160
No
Included¹: 28
Random forest
k = 13
1.180
No
Included¹: 28 With k = 17, RMSE: 1.178
¹ – Number of features included in SFS selection.
² – Maximum allowed number of features to be selected.
³ – PCs of all numeric variables.
⁴ – Model was a candidate to become a final model but rejected due to low performance.
⁵ – In both cases the same model was selected.
Two candidates to final models out of 3 were discarded due to low explained variance (R²<0.15; see Table 5.14).
Table 5.14. Final evaluation of selected models for team goal prediction.
set
n
SD
RMSE
R²
RMSE_SD_ratio
SD_RMSE_ratio
1
Train (team-related features)
27200
1.26
1.24
0.03
0.98
1.02
2
Train (player-related features)
27200
1.26
1.20
0.09
0.95
1.05
3
Train (all features/betting odds)
27200
1.26
1.16
0.16
0.92
1.09
4
Test (all features/betting odds)
5455
1.26
1.16
0.15
0.92
1.08
5.9 Match Outcome Prediction
Can we predict which team will win the match?
In this section, the output of the match (home wins, draw, away wins) is modeled.
The initial idea was to select 4 models: one from each feature type group as these features are available at different time before the match. But model based on team-related features showed low performance. And model based on all type of variables was rejected due to possible overfitting in preference to less complex model with 1 variable based on betting odds: these models share she same most important feature and inclusion of additional features only slightly improved model performance on training set. So only 2 final models were selected.
The test performance of the final models:
for the model based on player attributes accuracy is 50%, balanced accuracy is 42%;
for the model based on betting odds is as follows: accuracy 52%, balanced accuracy is 45%.
These models can be used in different situations when different typos of variables are available.
Unfortunately, both models are unable to predict outcome “draw” correctly. This might be related to the findings in section Relationship Between Betting Odds that betting odds of “draw” are correlated to the outcome “away wins”.
Conclusion: despite the fact that there is a lot of randomness in the game, decisions based on data can improve predictions on the football match outcome. Still, this prediction is not perfect.
The results of classification model selection are in Table 5.15. The performance of the selected models is presented in Table 5.16 and in the output below this table.
The details are in the subsections below.
5.9.1 Team-Related Features as Predictors
Logistic Regression
Code
# Do SFS or take results from cachedef fun_sfs_res_match_team(): np.random.seed(250) estimator = LogisticRegression( solver="newton-cg", multi_class="multinomial" ) subset = [match_target, *match_vars_team] X, y = match_train[subset].make_dummies(exclude=match_target)return ml.sfs(estimator, "classification").fit(X, y)file="saved-output/sfs_res_match_team.pickle"sfs_res_match_team = my.cached_results(file, fun_sfs_res_match_team)
Code
ml.sfs_plot_results( sfs_res_match_team,"Predictors: Team-Related Features (Logistic Regression)",);
k = 9, avg. BAcc = 0.347 [Parsimonious]
(Smallest number of predictors at best ± 1 SE score)
This subsection summarizes the results from the subsections above and evaluates the performance on whole training and test sets.
Table 5.15. Classification model selection results: selected models for each feature type and algorithm.
Features type
Method
Number of features selected
Training CV BAcc
Selected as final model
Notes
Team-related
Logistic regression
k = 9
0.347
No
Random forest
k = 3
0.350
No³
Player-related
Logistic regression
k = 9
0.451
Yes
Included¹: all Max. allowed²: 30 With k = 28, BAcc: 0.454.
Random forest
k = 7
0.430
No
Included¹: 21 Max. allowed²: 10 With k = 10, BAcc: 0.431.
Betting odds
Logistic regression
k = 1
0.454
Yes
Random forest
k = 1
0.445
No
Included¹: all Max. allowed²: 10
All Variables
Logistic regression
k = 4
0.458
No⁴
Included¹: all Max. allowed²: 10
Random forest
k = 7
0.451
No
Included¹: 50 Max. allowed²: 10
¹ – Number of features included in SFS selection.
² – Maximum allowed number of features to be selected.
³ – This model was a candidate to become the final model but it was rejected due to low performance.
⁴ – This model was a candidate to become the final model but it shares the same variable as betting odds based model and with 3 additional variables the performance increased only slightly. So model was rejected due to possible overfitting in preference to less complex model with 1 variable.
print("Classification Report\nTest set (betting odds based prediction)\n")ml.print_classification_report(y_test_2, y_pred_test_2, "test")
Classification Report
Test set (betting odds based prediction)
set n Accuracy BAcc BAcc_01 f1_macro f1_weighted Kappa
0 test 2575 0.52 0.45 0.18 0.39 0.45 0.21
precision recall f1-score support
Away Wins 0.48 0.56 0.52 790
Draw 0.00 0.00 0.00 641
Home Wins 0.54 0.79 0.64 1144
accuracy 0.52 2575
macro avg 0.34 0.45 0.39 2575
weighted avg 0.39 0.52 0.45 2575
Confusion matrix (rows - true, columns - predicted):
[[446 0 344]
[231 0 410]
[244 0 900]]
6 Summary
In this project, the European Football database, which includes data from seasons 2008/2009 to 2015/2016 was analyzed. Nine main questions in the “Analysis” section were answered. At the beginning of each main subsection, the most important findings were summarized. The game includes a lot of randomness but in some situations data-based approach can give additional valuable information about the European Football game.
6.1 Things to Improve
Some pre-precessing steps were performed but data from those steps were not included in the final analysis. These pre-processing steps could be removed from the analysis.
Some pre-processing steps should be explained in more detail in a written form.
I preferred .eval() over .assign() were possible and used .assign() elsewhere. Some users may find this as inconsistent coding style.
Some tables have names that are technical (n_goals) rather that natural for humans (e.g., “Number of goals”).
Variable names in the last part (e.g., y_train_1, y_pred_train_1) could have been more human-friendly.
Parameter tuning may improve RF performance.
Other types of machine learning algorithms (e.g., SVM, xgBoost) may capture the rends better and lead to better performance. This should be tested.
Some parts of this database (e.g., tables with player data) could be investigated in more detail to get even more insights.
Some plots (e.g., heat maps or cluster maps) are very large in order not to loose variable names. But these plots may not fin on the screen. So to fit then onto a screen, the user should make browser window narrower but as tall as it was before. On the other hand, some HTML output (profiling report) can be effectively studied only on wide screens.