Visually finding a statistical Toni Kroos replacement

If you have watched any of Real Madrid’s first few games in this season (2024/2025) and also previous seasons, you must have noticed a hasty and uncontrolled playing style in recent games. Very clearly this is due to the absence of Toni Kroos in midfield, and you don’t have to be a tactico to observe that. Even though he is truly one of a kind, I would like to use football data to find the most similar player in terms of statistics. A recent trip to PyData Amsterdam 2024 and the talk of Jeroen Janssens about ‘How I hacked UMAP and won at a plotting contest’ has inspired me to try and play around with Polars for a small side project, and this seems like the perfect one! Also the use of UMAP suits this project particularly well.

In an earlier post with football data we used a type of football data which is great for dimensionality reduction, which is what UMAP entails as a technique. The only problem with that data is that it was for the 2022-2023 season, and I want to use the most recent completed season. Fortunately the Kaggle dataset also showed the source and it was quite easy to find the source data for the 2023-2024 season.

Step 1: Obtaining the data

After clicking around on FBREF, I found the player stats of the 2023-2024 season, so let’s get digging!

… Diglett Dig Diglett Dig Diglett Dig Diglett Dig … — … *Diglett Dig Diglett Dig Diglett Dig Diglett Dig* …

The data is not yet in an easy format to work with. We could opt for pd.read_html() method, but that’s no fun! Besides, I think doing manual HTML inspecting and then extracting is a good skill to maintain (from time to time). As it is simply a matter of extracting some HTML content, beautifulsoup should be sufficient!

Terminal

pip install bs4 polars

Let’s first inspect the stats. There are several categories with each their own URL:

standard stats
goalkeeping
advanced goalkeeping
shooting
passing
pass types
goal and shot creation
possession
playing time
miscellaneous stats

Even though Toni Kroos is not a goalkeeper, I think it’s good to keep the goalkeeping stats for the final UMAP plot. We will, however, disregard the ‘standard stats’ and ‘miscellaneous stats’. Fortunately the URLs are similar and only differ by a simple substring.

Code

import requests
from bs4 import BeautifulSoup
import polars as pl


def extract_table_and_columns(stat: str) -> (pl.DataFrame, dict):
    """Extract table from url.

    Args:
        stat (str): The statistic of the table

    Returns:
        pl.DataFrame: A Polars dataframe with the information of the players.
        dict: A dictionary of columns and their descriptions.
    """
    base_url = "https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats"
    url = base_url.format(stat=stat)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.content, features="lxml")

    # The HTML of the table
    stats_table = soup.find_all("table")[-1]

    # Parsing headers
    tr_header = stats_table.findChildren(name="tr")[1]
    column_descriptions = {
        cell.get("aria-label"): cell.get("data-tip")
        for cell in tr_header.find_all(["th", "td"])
    }
    columns = [key for key in column_descriptions.keys()]

    if len(columns) != len(set(columns)):
        raise AssertionError("Column names must be unique.")

    # Parsing stats
    table_body = stats_table.find_next(name="tbody")
    rows_players = table_body.findChildren(name="tr", attrs={"class": False})

    table_data: list = []
    for row in rows_players:
        cells = row.find_all(
            ["th", "td"]
        )  # Collect all cell data, both headers and regular cells
        table_data.append([cell.get_text(strip=True) for cell in cells])

    return pl.DataFrame(table_data, schema=columns, orient="row"), column_descriptions

Let’s look at the passing stats!

Code

df, descriptions = extract_table_and_columns("passing")
df

shape: (2_852, 33)

Rk	Player	Nation	Position	Squad	Competition	Age	Year of birth	90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches
str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str
"1"	"Max Aarons"	"engENG"	"DF"	"Bournemouth"	"engPremier League"	"23"	"2000"	"13.7"	"450"	"581"	"77.5"	"7402"	"2789"	"220"	"248"	"88.7"	"188"	"235"	"80.0"	"34"	"63"	"54.0"	"1"	"0.8"	"0.9"	"+0.2"	"7"	"25"	"13"	"2"	"43"	"Matches"
"2"	"Brenden Aaronson"	"usUSA"	"MF,FW"	"Union Berlin"	"deBundesliga"	"22"	"2000"	"14.1"	"365"	"472"	"77.3"	"4890"	"1506"	"206"	"240"	"85.8"	"105"	"130"	"80.8"	"19"	"32"	"59.4"	"2"	"1.9"	"2.0"	"+0.1"	"22"	"30"	"14"	"3"	"56"	"Matches"
"3"	"Paxten Aaronson"	"usUSA"	"MF"	"Eint Frankfurt"	"deBundesliga"	"19"	"2003"	"1.1"	"41"	"50"	"82.0"	"576"	"71"	"20"	"25"	"80.0"	"20"	"22"	"90.9"	"0"	"2"	"0.0"	"1"	"0.1"	"0.1"	"+0.9"	"1"	"4"	"2"	"0"	"5"	"Matches"
"4"	"Keyliane Abdallah"	"frFRA"	"FW"	"Marseille"	"frLigue 1"	"17"	"2006"	"0.0"	"1"	"1"	"100.0"	"8"	"0"	"1"	"1"	"100.0"	"0"	"0"	""	"0"	"0"	""	"0"	"0.0"	"0.0"	"0.0"	"0"	"0"	"0"	"0"	"0"	"Matches"
"5"	"Yunis Abdelhamid"	"maMAR"	"DF"	"Reims"	"frLigue 1"	"35"	"1987"	"30.9"	"1552"	"1836"	"84.5"	"29618"	"9672"	"487"	"548"	"88.9"	"893"	"976"	"91.5"	"141"	"252"	"56.0"	"0"	"0.3"	"0.6"	"-0.3"	"8"	"129"	"3"	"0"	"137"	"Matches"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"2848"	"Lovro Zvonarek"	"hrCRO"	"FW,MF"	"Bayern Munich"	"deBundesliga"	"18"	"2005"	"1.8"	"50"	"60"	"83.3"	"766"	"123"	"20"	"23"	"87.0"	"25"	"29"	"86.2"	"2"	"2"	"100.0"	"0"	"0.2"	"0.3"	"-0.2"	"2"	"5"	"1"	"0"	"7"	"Matches"
"2849"	"Martin Ødegaard"	"noNOR"	"MF"	"Arsenal"	"engPremier League"	"24"	"1998"	"34.3"	"1692"	"2006"	"84.3"	"25780"	"7898"	"907"	"992"	"91.4"	"635"	"744"	"85.3"	"91"	"140"	"65.0"	"10"	"9.6"	"11.2"	"+0.4"	"102"	"160"	"130"	"13"	"344"	"Matches"
"2850"	"Milan Đurić"	"baBIH"	"FW"	"Hellas Verona"	"itSerie A"	"33"	"1990"	"13.4"	"181"	"334"	"54.2"	"2116"	"392"	"130"	"213"	"61.0"	"32"	"69"	"46.4"	"5"	"12"	"41.7"	"1"	"1.2"	"0.5"	"-0.2"	"16"	"18"	"3"	"0"	"15"	"Matches"
"2851"	"Milan Đurić"	"baBIH"	"FW"	"Monza"	"itSerie A"	"33"	"1990"	"14.0"	"140"	"272"	"51.5"	"1700"	"251"	"93"	"156"	"59.6"	"35"	"86"	"40.7"	"0"	"3"	"0.0"	"1"	"0.7"	"0.2"	"+0.3"	"5"	"3"	"2"	"0"	"8"	"Matches"
"2852"	"Mateusz Łęgowski"	"plPOL"	"MF"	"Salernitana"	"itSerie A"	"20"	"2003"	"12.7"	"292"	"378"	"77.2"	"4268"	"1261"	"169"	"196"	"86.2"	"95"	"120"	"79.2"	"16"	"31"	"51.6"	"0"	"0.2"	"0.4"	"-0.2"	"7"	"30"	"5"	"1"	"37"	"Matches"

Damn, looks kinda ugly with the automatically inferred strings for every column. We are actually going to try something quick and nasty for this, like saving these as .csv files and then reading them again for a better inference.¹

Code

import os


def parse_columns(df: pl.DataFrame) -> pl.DataFrame:
    """Parse columns of a dataframe for better inference.

    Args:
        df (pl.DataFrame): A Polars dataframe which has structurally incorrect columns.

    Returns:
        pl.DataFrame: A Polars dataframe with better inferred column types.
    """
    df.write_csv("temp.csv", quote_style="necessary")
    parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)
    if os.path.exists("temp.csv"):
        os.remove("temp.csv")
    return parsed_df


def extract_and_parse_table(stat) -> pl.DataFrame:
    df, descriptions = extract_table_and_columns(stat)
    return parse_columns(df)

extract_and_parse_table("passing")

shape: (2_852, 33)

Rk	Player	Nation	Position	Squad	Competition	Age	Year of birth	90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches
i64	str	str	str	str	str	i64	i64	f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	f64	i64	f64	f64	f64	i64	i64	i64	i64	i64	str
1	"Max Aarons"	"engENG"	"DF"	"Bournemouth"	"engPremier League"	23	2000	13.7	450	581	77.5	7402	2789	220	248	88.7	188	235	80.0	34	63	54.0	1	0.8	0.9	0.2	7	25	13	2	43	"Matches"
2	"Brenden Aaronson"	"usUSA"	"MF,FW"	"Union Berlin"	"deBundesliga"	22	2000	14.1	365	472	77.3	4890	1506	206	240	85.8	105	130	80.8	19	32	59.4	2	1.9	2.0	0.1	22	30	14	3	56	"Matches"
3	"Paxten Aaronson"	"usUSA"	"MF"	"Eint Frankfurt"	"deBundesliga"	19	2003	1.1	41	50	82.0	576	71	20	25	80.0	20	22	90.9	0	2	0.0	1	0.1	0.1	0.9	1	4	2	0	5	"Matches"
4	"Keyliane Abdallah"	"frFRA"	"FW"	"Marseille"	"frLigue 1"	17	2006	0.0	1	1	100.0	8	0	1	1	100.0	0	0	null	0	0	null	0	0.0	0.0	0.0	0	0	0	0	0	"Matches"
5	"Yunis Abdelhamid"	"maMAR"	"DF"	"Reims"	"frLigue 1"	35	1987	30.9	1552	1836	84.5	29618	9672	487	548	88.9	893	976	91.5	141	252	56.0	0	0.3	0.6	-0.3	8	129	3	0	137	"Matches"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2848	"Lovro Zvonarek"	"hrCRO"	"FW,MF"	"Bayern Munich"	"deBundesliga"	18	2005	1.8	50	60	83.3	766	123	20	23	87.0	25	29	86.2	2	2	100.0	0	0.2	0.3	-0.2	2	5	1	0	7	"Matches"
2849	"Martin Ødegaard"	"noNOR"	"MF"	"Arsenal"	"engPremier League"	24	1998	34.3	1692	2006	84.3	25780	7898	907	992	91.4	635	744	85.3	91	140	65.0	10	9.6	11.2	0.4	102	160	130	13	344	"Matches"
2850	"Milan Đurić"	"baBIH"	"FW"	"Hellas Verona"	"itSerie A"	33	1990	13.4	181	334	54.2	2116	392	130	213	61.0	32	69	46.4	5	12	41.7	1	1.2	0.5	-0.2	16	18	3	0	15	"Matches"
2851	"Milan Đurić"	"baBIH"	"FW"	"Monza"	"itSerie A"	33	1990	14.0	140	272	51.5	1700	251	93	156	59.6	35	86	40.7	0	3	0.0	1	0.7	0.2	0.3	5	3	2	0	8	"Matches"
2852	"Mateusz Łęgowski"	"plPOL"	"MF"	"Salernitana"	"itSerie A"	20	2003	12.7	292	378	77.2	4268	1261	169	196	86.2	95	120	79.2	16	31	51.6	0	0.2	0.4	-0.2	7	30	5	1	37	"Matches"

This is much better! Now let’s refactor a little and put this together into a class:

Code

class StatsExtractor:
    def __init__(self, stat: str):
        self.stat: str = stat
        self.base_url: str = (
            "https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats"
        )
        self.url: str = self.base_url.format(stat=self.stat)
        self.df = None
        self.column_descriptions = None
        self.fetch_data()
        self.extract_column_descriptions()
        self.extract_stats()
        self.parse_columns()

    def fetch_data(self) -> None:
        """Fetch the HTML content from the URL."""
        url = self.base_url.format(stat=self.stat)
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content, features="lxml")
        self.html = soup.find_all("table")[-1]

    def extract_column_descriptions(self) -> None:
        """Extract column descriptions"""
        tr_header = self.html.findChildren(name="tr")[1]
        column_descriptions = {
            cell.get("aria-label"): cell.get("data-tip")
            for cell in tr_header.find_all(["th", "td"])
        }
        columns = [key for key in column_descriptions.keys()]

        if len(columns) != len(set(columns)):
            raise AssertionError("Column names must be unique.")
        self.columns = columns
        self.column_descriptions = column_descriptions

    def extract_stats(self) -> None:
        """Extract table from URL and store it in a DataFrame."""
        table_body = self.html.find_next(name="tbody")
        rows_players = table_body.findChildren(name="tr", attrs={"class": False})

        table_data: list = []
        for row in rows_players:
            cells = row.find_all(["th", "td"])
            table_data.append([cell.get_text(strip=True) for cell in cells])

        self.df = pl.DataFrame(table_data, schema=self.columns, orient="row")

    def parse_columns(self) -> None:
        """Parse columns of a DataFrame for better inference."""
        self.df.write_csv("temp.csv", quote_style="necessary")
        parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)
        # if os.path.exists("temp.csv"):
            # os.remove("temp.csv")
        self.df = parsed_df

We can now invoke the methods after creating a StatsExtractor instance.

Code

stats_to_extract = [
        "passing",
        "keepers",
        "shooting",
        "passing_types",
        "gca",
        "defense",
        "possession",
]
extractors = {stat: StatsExtractor(stat) for stat in stats_to_extract}
dfs = {stat: extractor.df for stat, extractor in extractors.items()}

With dfs we only need to create one combined dataframe, and actually get to learn more about the syntax of polars.

Step 2: Exploring `polars` by data wrangling

After viewing the documentation and some sample code, polars looks very similar to dplyr in R - and I have to say I am quite pleased. My career started with loads of R and in my own time and projects I have transitioned more towards Python. However, the one and only thing I really miss about R is how straightforward and concise the syntax is compared to pandas. Take the following two dataframes

Code

df_passing = dfs["passing"]
df_possession = dfs["possession"]

I am so used to not having duplicate columns when performing joins. However in Python using either pandas or polars, I have not found any other way than to drop the duplicate columns. Let’s join these dataframes on the columns Player and Squad ², and drop all the other duplicate columns. In polars we can do so using the selector cs.ends_with().

Code

import polars.selectors as cs

df_passing.join(df_possession, how="left", on=["Player", "Squad"]).drop(cs.ends_with("_right"))

shape: (2_852, 55)

Rk	Player	Nation	Position	Squad	Competition	Age	Year of birth	90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
i64	str	str	str	str	str	i64	i64	f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	f64	i64	f64	f64	f64	i64	i64	i64	i64	i64	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
1	"Max Aarons"	"engENG"	"DF"	"Bournemouth"	"engPremier League"	23	2000	13.7	450	581	77.5	7402	2789	220	248	88.7	188	235	80.0	34	63	54.0	1	0.8	0.9	0.2	7	25	13	2	43	"Matches"	711	43	252	303	165	11	711	34	14	41.2	12	35.3	364	2174	1121	22	12	7	13	8	371	26
2	"Brenden Aaronson"	"usUSA"	"MF,FW"	"Union Berlin"	"deBundesliga"	22	2000	14.1	365	472	77.3	4890	1506	206	240	85.8	105	130	80.8	19	32	59.4	2	1.9	2.0	0.1	22	30	14	3	56	"Matches"	675	11	108	301	293	47	675	77	34	44.2	41	53.2	406	2721	1387	37	29	9	41	38	457	91
3	"Paxten Aaronson"	"usUSA"	"MF"	"Eint Frankfurt"	"deBundesliga"	19	2003	1.1	41	50	82.0	576	71	20	25	80.0	20	22	90.9	0	2	0.0	1	0.1	0.1	0.9	1	4	2	0	5	"Matches"	72	0	6	39	28	5	72	7	2	28.6	4	57.1	43	193	50	2	0	1	5	4	46	7
4	"Keyliane Abdallah"	"frFRA"	"FW"	"Marseille"	"frLigue 1"	17	2006	0.0	1	1	100.0	8	0	1	1	100.0	0	0	null	0	0	null	0	0.0	0.0	0.0	0	0	0	0	0	"Matches"	1	0	0	1	0	0	1	0	0	null	0	null	1	3	0	0	0	0	0	0	1	0
5	"Yunis Abdelhamid"	"maMAR"	"DF"	"Reims"	"frLigue 1"	35	1987	30.9	1552	1836	84.5	29618	9672	487	548	88.9	893	976	91.5	141	252	56.0	0	0.3	0.6	-0.3	8	129	3	0	137	"Matches"	2185	293	976	1119	114	35	2184	15	8	53.3	7	46.7	1506	8663	4921	36	19	0	23	4	1403	9
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2848	"Lovro Zvonarek"	"hrCRO"	"FW,MF"	"Bayern Munich"	"deBundesliga"	18	2005	1.8	50	60	83.3	766	123	20	23	87.0	25	29	86.2	2	2	100.0	0	0.2	0.3	-0.2	2	5	1	0	7	"Matches"	73	2	9	38	26	8	73	3	1	33.3	2	66.7	43	207	86	7	1	4	3	1	55	9
2849	"Martin Ødegaard"	"noNOR"	"MF"	"Arsenal"	"engPremier League"	24	1998	34.3	1692	2006	84.3	25780	7898	907	992	91.4	635	744	85.3	91	140	65.0	10	9.6	11.2	0.4	102	160	130	13	344	"Matches"	2326	24	244	926	1173	165	2324	81	37	45.7	38	46.9	1563	8190	3514	91	83	37	53	58	1827	202
2850	"Milan Đurić"	"baBIH"	"FW"	"Hellas Verona"	"itSerie A"	33	1990	13.4	181	334	54.2	2116	392	130	213	61.0	32	69	46.4	5	12	41.7	1	1.2	0.5	-0.2	16	18	3	0	15	"Matches"	421	14	41	227	153	46	418	3	2	66.7	1	33.3	183	529	151	3	6	1	22	8	352	33
2851	"Milan Đurić"	"baBIH"	"FW"	"Monza"	"itSerie A"	33	1990	14.0	140	272	51.5	1700	251	93	156	59.6	35	86	40.7	0	3	0.0	1	0.7	0.2	0.3	5	3	2	0	8	"Matches"	381	32	51	227	104	37	381	4	2	50.0	2	50.0	157	458	108	2	2	1	24	14	323	31
2852	"Mateusz Łęgowski"	"plPOL"	"MF"	"Salernitana"	"itSerie A"	20	2003	12.7	292	378	77.2	4268	1261	169	196	86.2	95	120	79.2	16	31	51.6	0	0.2	0.4	-0.2	7	30	5	1	37	"Matches"	504	31	120	283	111	11	504	21	6	28.6	12	57.1	265	1526	757	16	13	3	20	12	294	17

Exactly what I wanted! Now we can iterate this process until there are no dataframes left to merge on.

Code

from functools import reduce

df_stats = reduce(
    lambda left, right: left.join(right, how="left", on=["Player", "Squad"]).drop(
        cs.ends_with("_right")
    ),
    dfs.values()
)
df_stats

shape: (2_852, 133)

Rk	Player	Nation	Position	Squad	Competition	Age	Year of birth	90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches	Matches Played	Starts	Minutes	Goals Against	…	Tackles	Tackles Won	Tackles (Def 3rd)	Tackles (Mid 3rd)	Tackles (Att 3rd)	Dribblers Tackled	Dribbles Challenged	% of Dribblers Tackled	Challenges Lost	Blocks	Shots Blocked	Interceptions	Tkl+Int	Clearances	Errors	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
i64	str	str	str	str	str	i64	i64	f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	f64	i64	f64	f64	f64	i64	i64	i64	i64	i64	str	i64	i64	str	i64	…	i64	i64	i64	i64	i64	i64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
1	"Max Aarons"	"engENG"	"DF"	"Bournemouth"	"engPremier League"	23	2000	13.7	450	581	77.5	7402	2789	220	248	88.7	188	235	80.0	34	63	54.0	1	0.8	0.9	0.2	7	25	13	2	43	"Matches"	null	null	null	null	…	29	19	20	7	2	20	34	58.8	14	9	5	8	37	27	0	711	43	252	303	165	11	711	34	14	41.2	12	35.3	364	2174	1121	22	12	7	13	8	371	26
2	"Brenden Aaronson"	"usUSA"	"MF,FW"	"Union Berlin"	"deBundesliga"	22	2000	14.1	365	472	77.3	4890	1506	206	240	85.8	105	130	80.8	19	32	59.4	2	1.9	2.0	0.1	22	30	14	3	56	"Matches"	null	null	null	null	…	32	18	13	13	6	16	32	50.0	16	26	1	2	34	4	0	675	11	108	301	293	47	675	77	34	44.2	41	53.2	406	2721	1387	37	29	9	41	38	457	91
3	"Paxten Aaronson"	"usUSA"	"MF"	"Eint Frankfurt"	"deBundesliga"	19	2003	1.1	41	50	82.0	576	71	20	25	80.0	20	22	90.9	0	2	0.0	1	0.1	0.1	0.9	1	4	2	0	5	"Matches"	null	null	null	null	…	2	2	1	1	0	1	1	100.0	0	2	0	0	2	0	0	72	0	6	39	28	5	72	7	2	28.6	4	57.1	43	193	50	2	0	1	5	4	46	7
4	"Keyliane Abdallah"	"frFRA"	"FW"	"Marseille"	"frLigue 1"	17	2006	0.0	1	1	100.0	8	0	1	1	100.0	0	0	null	0	0	null	0	0.0	0.0	0.0	0	0	0	0	0	"Matches"	null	null	null	null	…	0	0	0	0	0	0	0	null	0	0	0	0	0	0	0	1	0	0	1	0	0	1	0	0	null	0	null	1	3	0	0	0	0	0	0	1	0
5	"Yunis Abdelhamid"	"maMAR"	"DF"	"Reims"	"frLigue 1"	35	1987	30.9	1552	1836	84.5	29618	9672	487	548	88.9	893	976	91.5	141	252	56.0	0	0.3	0.6	-0.3	8	129	3	0	137	"Matches"	null	null	null	null	…	64	35	36	23	5	26	45	57.8	19	51	32	39	103	109	2	2185	293	976	1119	114	35	2184	15	8	53.3	7	46.7	1506	8663	4921	36	19	0	23	4	1403	9
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2848	"Lovro Zvonarek"	"hrCRO"	"FW,MF"	"Bayern Munich"	"deBundesliga"	18	2005	1.8	50	60	83.3	766	123	20	23	87.0	25	29	86.2	2	2	100.0	0	0.2	0.3	-0.2	2	5	1	0	7	"Matches"	null	null	null	null	…	0	0	0	0	0	0	5	0.0	5	3	0	0	0	2	0	73	2	9	38	26	8	73	3	1	33.3	2	66.7	43	207	86	7	1	4	3	1	55	9
2849	"Martin Ødegaard"	"noNOR"	"MF"	"Arsenal"	"engPremier League"	24	1998	34.3	1692	2006	84.3	25780	7898	907	992	91.4	635	744	85.3	91	140	65.0	10	9.6	11.2	0.4	102	160	130	13	344	"Matches"	null	null	null	null	…	49	17	18	21	10	17	58	29.3	41	23	1	15	64	5	0	2326	24	244	926	1173	165	2324	81	37	45.7	38	46.9	1563	8190	3514	91	83	37	53	58	1827	202
2850	"Milan Đurić"	"baBIH"	"FW"	"Hellas Verona"	"itSerie A"	33	1990	13.4	181	334	54.2	2116	392	130	213	61.0	32	69	46.4	5	12	41.7	1	1.2	0.5	-0.2	16	18	3	0	15	"Matches"	null	null	null	null	…	1	0	0	0	1	0	3	0.0	3	0	0	1	2	13	0	421	14	41	227	153	46	418	3	2	66.7	1	33.3	183	529	151	3	6	1	22	8	352	33
2851	"Milan Đurić"	"baBIH"	"FW"	"Monza"	"itSerie A"	33	1990	14.0	140	272	51.5	1700	251	93	156	59.6	35	86	40.7	0	3	0.0	1	0.7	0.2	0.3	5	3	2	0	8	"Matches"	null	null	null	null	…	5	5	0	5	0	2	4	50.0	2	6	2	0	5	27	0	381	32	51	227	104	37	381	4	2	50.0	2	50.0	157	458	108	2	2	1	24	14	323	31
2852	"Mateusz Łęgowski"	"plPOL"	"MF"	"Salernitana"	"itSerie A"	20	2003	12.7	292	378	77.2	4268	1261	169	196	86.2	95	120	79.2	16	31	51.6	0	0.2	0.4	-0.2	7	30	5	1	37	"Matches"	null	null	null	null	…	23	12	8	12	3	11	26	42.3	15	13	2	7	30	16	0	504	31	120	283	111	11	504	21	6	28.6	12	57.1	265	1526	757	16	13	3	20	12	294	17

Now it’s time to filter and select relevant features and change them slightly. Before doing alterations, I think it is fair to only include players who have played more than 90 minutes 15 times. Perhaps fair is not the right word, but including players who only played a handful of matches is misleading when evaluating their overall season performance. It does not capture consistency which is what we do want to include.

Code

df_stats.filter(pl.col("90s Played") >= 15)

shape: (1_215, 133)

Rk	Player	Nation	Position	Squad	Competition	Age	Year of birth	90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches	Matches Played	Starts	Minutes	Goals Against	…	Tackles	Tackles Won	Tackles (Def 3rd)	Tackles (Mid 3rd)	Tackles (Att 3rd)	Dribblers Tackled	Dribbles Challenged	% of Dribblers Tackled	Challenges Lost	Blocks	Shots Blocked	Interceptions	Tkl+Int	Clearances	Errors	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
i64	str	str	str	str	str	i64	i64	f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	f64	i64	f64	f64	f64	i64	i64	i64	i64	i64	str	i64	i64	str	i64	…	i64	i64	i64	i64	i64	i64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
5	"Yunis Abdelhamid"	"maMAR"	"DF"	"Reims"	"frLigue 1"	35	1987	30.9	1552	1836	84.5	29618	9672	487	548	88.9	893	976	91.5	141	252	56.0	0	0.3	0.6	-0.3	8	129	3	0	137	"Matches"	null	null	null	null	…	64	35	36	23	5	26	45	57.8	19	51	32	39	103	109	2	2185	293	976	1119	114	35	2184	15	8	53.3	7	46.7	1506	8663	4921	36	19	0	23	4	1403	9
6	"Salis Abdul Samed"	"ghGHA"	"MF"	"Lens"	"frLigue 1"	23	2000	16.9	796	895	88.9	12470	3008	393	433	90.8	330	360	91.7	41	54	75.9	0	0.5	1.1	-0.5	6	87	5	2	78	"Matches"	null	null	null	null	…	21	14	8	10	3	8	18	44.4	10	12	1	12	33	18	0	1022	21	168	647	218	7	1022	19	7	36.8	11	57.9	823	4361	1683	9	33	0	23	11	780	20
8	"Laurent Abergel"	"frFRA"	"MF"	"Lorient"	"frLigue 1"	30	1993	31.8	1551	1836	84.5	27382	8613	629	707	89.0	711	802	88.7	150	233	64.4	1	2.2	1.9	-1.2	30	188	23	3	194	"Matches"	null	null	null	null	…	85	52	43	34	8	38	96	39.6	58	29	6	61	146	61	3	2164	105	553	1288	349	10	2164	65	44	67.7	18	27.7	1649	8707	3892	38	49	3	32	30	1502	51
10	"Abner"	"brBRA"	"DF"	"Betis"	"esLa Liga"	23	2000	15.6	615	797	77.2	9211	4552	351	394	89.1	203	265	76.6	37	79	46.8	1	1.0	1.0	0.0	8	20	7	2	33	"Matches"	null	null	null	null	…	25	19	15	9	1	17	34	50.0	17	23	5	15	40	62	0	966	72	344	405	226	28	966	13	7	53.8	5	38.5	447	1802	949	14	10	5	18	11	502	58
12	"Abdel Abqar"	"maMAR"	"DF"	"Alavés"	"esLa Liga"	24	1999	25.7	553	762	72.6	12370	4872	141	164	86.0	302	339	89.1	99	225	44.0	0	0.2	0.2	-0.2	3	61	2	0	49	"Matches"	null	null	null	null	…	35	19	28	5	2	21	35	60.0	14	31	26	23	58	115	0	1032	191	570	422	42	21	1032	5	3	60.0	1	20.0	492	2460	1418	7	6	0	15	1	499	1
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2839	"Joshua Zirkzee"	"nlNED"	"FW"	"Bologna"	"itSerie A"	22	2001	30.7	710	917	77.4	9106	1752	446	539	82.7	166	213	77.9	32	46	69.6	4	5.2	2.5	-1.2	43	40	28	0	86	"Matches"	null	null	null	null	…	32	20	8	15	9	13	24	54.2	11	25	2	3	35	36	0	1374	46	135	628	630	136	1372	128	52	40.6	72	56.3	814	4132	1772	53	46	22	114	64	1033	182
2844	"Kurt Zouma"	"frFRA"	"DF"	"West Ham"	"engPremier League"	28	1994	31.5	962	1148	83.8	18843	6791	335	365	91.8	509	558	91.2	111	198	56.1	0	0.2	0.1	-0.2	4	57	1	0	53	"Matches"	null	null	null	null	…	23	17	14	9	0	10	16	62.5	6	38	28	29	52	160	0	1455	275	798	626	34	28	1455	3	2	66.7	1	33.3	794	3751	2019	7	5	1	13	3	873	1
2845	"Igor Zubeldia"	"esESP"	"DF"	"Real Sociedad"	"esLa Liga"	26	1997	28.0	1476	1746	84.5	30351	9423	387	450	86.0	913	989	92.3	167	267	62.5	2	1.1	0.8	0.9	10	99	4	0	129	"Matches"	null	null	null	null	…	32	25	20	10	2	19	26	73.1	7	28	19	32	64	97	3	1983	162	908	1034	50	18	1983	8	4	50.0	4	50.0	1229	7537	4725	28	2	0	14	3	1316	0
2846	"Martín Zubimendi"	"esESP"	"MF"	"Real Sociedad"	"esLa Liga"	24	1999	29.5	1321	1542	85.7	22165	6269	607	677	89.7	607	677	89.7	86	126	68.3	1	1.5	2.0	-0.5	16	136	15	1	152	"Matches"	null	null	null	null	…	49	27	18	29	2	23	42	54.8	19	33	9	37	86	54	0	1793	91	397	1144	267	32	1793	18	12	66.7	5	27.8	942	5305	2671	34	41	5	26	16	1185	27
2849	"Martin Ødegaard"	"noNOR"	"MF"	"Arsenal"	"engPremier League"	24	1998	34.3	1692	2006	84.3	25780	7898	907	992	91.4	635	744	85.3	91	140	65.0	10	9.6	11.2	0.4	102	160	130	13	344	"Matches"	null	null	null	null	…	49	17	18	21	10	17	58	29.3	41	23	1	15	64	5	0	2326	24	244	926	1173	165	2324	81	37	45.7	38	46.9	1563	8190	3514	91	83	37	53	58	1827	202

Although players can play multiple positions, for the plots I prefer we keep it on a down low. The Position column now has these unique values:

Code

df_stats.select(pl.col("Position")).unique().to_series().to_list()

['DF,FW', 'GK', 'DF', 'MF,FW', 'FW,DF', 'MF', 'DF,MF', 'FW,MF', 'FW', 'MF,DF']

For the players that play multiple positions, we take the position specified first assuming it is their primary position. This will bring down the number of different positions to four!

Code

df_stats_relevant = (df_stats
    .filter(pl.col("90s Played") >= 15)
    .with_columns(pl.col("Position").str.slice(0, 2))
)

UMAP requires numerical data, so let’s only select numerical columns, which in this case are all non-string columns.

Code

df_stats_relevant.select(~cs.by_dtype(pl.String))

shape: (1_215, 126)

Rk	Age	Year of birth	90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches Played	Starts	Goals Against	Goals Against/90	Shots on Target Against	Saves	Save Percentage	Wins	Draws	Losses	…	Tackles	Tackles Won	Tackles (Def 3rd)	Tackles (Mid 3rd)	Tackles (Att 3rd)	Dribblers Tackled	Dribbles Challenged	% of Dribblers Tackled	Challenges Lost	Blocks	Shots Blocked	Interceptions	Tkl+Int	Clearances	Errors	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
i64	i64	i64	f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	f64	i64	f64	f64	f64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	i64	…	i64	i64	i64	i64	i64	i64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
5	35	1987	30.9	1552	1836	84.5	29618	9672	487	548	88.9	893	976	91.5	141	252	56.0	0	0.3	0.6	-0.3	8	129	3	0	137	null	null	null	null	null	null	null	null	null	null	…	64	35	36	23	5	26	45	57.8	19	51	32	39	103	109	2	2185	293	976	1119	114	35	2184	15	8	53.3	7	46.7	1506	8663	4921	36	19	0	23	4	1403	9
6	23	2000	16.9	796	895	88.9	12470	3008	393	433	90.8	330	360	91.7	41	54	75.9	0	0.5	1.1	-0.5	6	87	5	2	78	null	null	null	null	null	null	null	null	null	null	…	21	14	8	10	3	8	18	44.4	10	12	1	12	33	18	0	1022	21	168	647	218	7	1022	19	7	36.8	11	57.9	823	4361	1683	9	33	0	23	11	780	20
8	30	1993	31.8	1551	1836	84.5	27382	8613	629	707	89.0	711	802	88.7	150	233	64.4	1	2.2	1.9	-1.2	30	188	23	3	194	null	null	null	null	null	null	null	null	null	null	…	85	52	43	34	8	38	96	39.6	58	29	6	61	146	61	3	2164	105	553	1288	349	10	2164	65	44	67.7	18	27.7	1649	8707	3892	38	49	3	32	30	1502	51
10	23	2000	15.6	615	797	77.2	9211	4552	351	394	89.1	203	265	76.6	37	79	46.8	1	1.0	1.0	0.0	8	20	7	2	33	null	null	null	null	null	null	null	null	null	null	…	25	19	15	9	1	17	34	50.0	17	23	5	15	40	62	0	966	72	344	405	226	28	966	13	7	53.8	5	38.5	447	1802	949	14	10	5	18	11	502	58
12	24	1999	25.7	553	762	72.6	12370	4872	141	164	86.0	302	339	89.1	99	225	44.0	0	0.2	0.2	-0.2	3	61	2	0	49	null	null	null	null	null	null	null	null	null	null	…	35	19	28	5	2	21	35	60.0	14	31	26	23	58	115	0	1032	191	570	422	42	21	1032	5	3	60.0	1	20.0	492	2460	1418	7	6	0	15	1	499	1
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2839	22	2001	30.7	710	917	77.4	9106	1752	446	539	82.7	166	213	77.9	32	46	69.6	4	5.2	2.5	-1.2	43	40	28	0	86	null	null	null	null	null	null	null	null	null	null	…	32	20	8	15	9	13	24	54.2	11	25	2	3	35	36	0	1374	46	135	628	630	136	1372	128	52	40.6	72	56.3	814	4132	1772	53	46	22	114	64	1033	182
2844	28	1994	31.5	962	1148	83.8	18843	6791	335	365	91.8	509	558	91.2	111	198	56.1	0	0.2	0.1	-0.2	4	57	1	0	53	null	null	null	null	null	null	null	null	null	null	…	23	17	14	9	0	10	16	62.5	6	38	28	29	52	160	0	1455	275	798	626	34	28	1455	3	2	66.7	1	33.3	794	3751	2019	7	5	1	13	3	873	1
2845	26	1997	28.0	1476	1746	84.5	30351	9423	387	450	86.0	913	989	92.3	167	267	62.5	2	1.1	0.8	0.9	10	99	4	0	129	null	null	null	null	null	null	null	null	null	null	…	32	25	20	10	2	19	26	73.1	7	28	19	32	64	97	3	1983	162	908	1034	50	18	1983	8	4	50.0	4	50.0	1229	7537	4725	28	2	0	14	3	1316	0
2846	24	1999	29.5	1321	1542	85.7	22165	6269	607	677	89.7	607	677	89.7	86	126	68.3	1	1.5	2.0	-0.5	16	136	15	1	152	null	null	null	null	null	null	null	null	null	null	…	49	27	18	29	2	23	42	54.8	19	33	9	37	86	54	0	1793	91	397	1144	267	32	1793	18	12	66.7	5	27.8	942	5305	2671	34	41	5	26	16	1185	27
2849	24	1998	34.3	1692	2006	84.3	25780	7898	907	992	91.4	635	744	85.3	91	140	65.0	10	9.6	11.2	0.4	102	160	130	13	344	null	null	null	null	null	null	null	null	null	null	…	49	17	18	21	10	17	58	29.3	41	23	1	15	64	5	0	2326	24	244	926	1173	165	2324	81	37	45.7	38	46.9	1563	8190	3514	91	83	37	53	58	1827	202

There are still some irrelevant columns, namely Rank, Age and Year of Birth.

Code

df_stats_relevant.select(
    ~cs.by_dtype(pl.String) & ~cs.by_name(["Age", "Rk", "Year of birth"])
)

shape: (1_215, 123)

90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches Played	Starts	Goals Against	Goals Against/90	Shots on Target Against	Saves	Save Percentage	Wins	Draws	Losses	Clean Sheets	Clean Sheet Percentage	Penalty Kicks Attempted	…	Tackles	Tackles Won	Tackles (Def 3rd)	Tackles (Mid 3rd)	Tackles (Att 3rd)	Dribblers Tackled	Dribbles Challenged	% of Dribblers Tackled	Challenges Lost	Blocks	Shots Blocked	Interceptions	Tkl+Int	Clearances	Errors	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	f64	i64	f64	f64	f64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	i64	f64	i64	i64	i64	i64	f64	i64	…	i64	i64	i64	i64	i64	i64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
30.9	1552	1836	84.5	29618	9672	487	548	88.9	893	976	91.5	141	252	56.0	0	0.3	0.6	-0.3	8	129	3	0	137	null	null	null	null	null	null	null	null	null	null	null	null	null	…	64	35	36	23	5	26	45	57.8	19	51	32	39	103	109	2	2185	293	976	1119	114	35	2184	15	8	53.3	7	46.7	1506	8663	4921	36	19	0	23	4	1403	9
16.9	796	895	88.9	12470	3008	393	433	90.8	330	360	91.7	41	54	75.9	0	0.5	1.1	-0.5	6	87	5	2	78	null	null	null	null	null	null	null	null	null	null	null	null	null	…	21	14	8	10	3	8	18	44.4	10	12	1	12	33	18	0	1022	21	168	647	218	7	1022	19	7	36.8	11	57.9	823	4361	1683	9	33	0	23	11	780	20
31.8	1551	1836	84.5	27382	8613	629	707	89.0	711	802	88.7	150	233	64.4	1	2.2	1.9	-1.2	30	188	23	3	194	null	null	null	null	null	null	null	null	null	null	null	null	null	…	85	52	43	34	8	38	96	39.6	58	29	6	61	146	61	3	2164	105	553	1288	349	10	2164	65	44	67.7	18	27.7	1649	8707	3892	38	49	3	32	30	1502	51
15.6	615	797	77.2	9211	4552	351	394	89.1	203	265	76.6	37	79	46.8	1	1.0	1.0	0.0	8	20	7	2	33	null	null	null	null	null	null	null	null	null	null	null	null	null	…	25	19	15	9	1	17	34	50.0	17	23	5	15	40	62	0	966	72	344	405	226	28	966	13	7	53.8	5	38.5	447	1802	949	14	10	5	18	11	502	58
25.7	553	762	72.6	12370	4872	141	164	86.0	302	339	89.1	99	225	44.0	0	0.2	0.2	-0.2	3	61	2	0	49	null	null	null	null	null	null	null	null	null	null	null	null	null	…	35	19	28	5	2	21	35	60.0	14	31	26	23	58	115	0	1032	191	570	422	42	21	1032	5	3	60.0	1	20.0	492	2460	1418	7	6	0	15	1	499	1
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
30.7	710	917	77.4	9106	1752	446	539	82.7	166	213	77.9	32	46	69.6	4	5.2	2.5	-1.2	43	40	28	0	86	null	null	null	null	null	null	null	null	null	null	null	null	null	…	32	20	8	15	9	13	24	54.2	11	25	2	3	35	36	0	1374	46	135	628	630	136	1372	128	52	40.6	72	56.3	814	4132	1772	53	46	22	114	64	1033	182
31.5	962	1148	83.8	18843	6791	335	365	91.8	509	558	91.2	111	198	56.1	0	0.2	0.1	-0.2	4	57	1	0	53	null	null	null	null	null	null	null	null	null	null	null	null	null	…	23	17	14	9	0	10	16	62.5	6	38	28	29	52	160	0	1455	275	798	626	34	28	1455	3	2	66.7	1	33.3	794	3751	2019	7	5	1	13	3	873	1
28.0	1476	1746	84.5	30351	9423	387	450	86.0	913	989	92.3	167	267	62.5	2	1.1	0.8	0.9	10	99	4	0	129	null	null	null	null	null	null	null	null	null	null	null	null	null	…	32	25	20	10	2	19	26	73.1	7	28	19	32	64	97	3	1983	162	908	1034	50	18	1983	8	4	50.0	4	50.0	1229	7537	4725	28	2	0	14	3	1316	0
29.5	1321	1542	85.7	22165	6269	607	677	89.7	607	677	89.7	86	126	68.3	1	1.5	2.0	-0.5	16	136	15	1	152	null	null	null	null	null	null	null	null	null	null	null	null	null	…	49	27	18	29	2	23	42	54.8	19	33	9	37	86	54	0	1793	91	397	1144	267	32	1793	18	12	66.7	5	27.8	942	5305	2671	34	41	5	26	16	1185	27
34.3	1692	2006	84.3	25780	7898	907	992	91.4	635	744	85.3	91	140	65.0	10	9.6	11.2	0.4	102	160	130	13	344	null	null	null	null	null	null	null	null	null	null	null	null	null	…	49	17	18	21	10	17	58	29.3	41	23	1	15	64	5	0	2326	24	244	926	1173	165	2324	81	37	45.7	38	46.9	1563	8190	3514	91	83	37	53	58	1827	202

For a good comparison, it is fair to condition all stats per 90 minutes played. I am happy to say there is a column for that, so we can divide each column by that column.

Code

df_stats_relevant.select(
    ~cs.by_dtype(pl.String) & ~cs.by_name(["Age", "Rk", "Year of birth"])
).with_columns(pl.all() / pl.col("90s Played").exclude("90s Played"))

shape: (1_215, 123)

90s Played	Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches Played	Starts	Goals Against	Goals Against/90	Shots on Target Against	Saves	Save Percentage	Wins	Draws	Losses	Clean Sheets	Clean Sheet Percentage	Penalty Kicks Attempted	…	Tackles	Tackles Won	Tackles (Def 3rd)	Tackles (Mid 3rd)	Tackles (Att 3rd)	Dribblers Tackled	Dribbles Challenged	% of Dribblers Tackled	Challenges Lost	Blocks	Shots Blocked	Interceptions	Tkl+Int	Clearances	Errors	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	…	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
30.9	50.226537	59.417476	2.734628	958.511327	313.009709	15.760518	17.734628	2.877023	28.899676	31.585761	2.961165	4.563107	8.15534	1.812298	0.0	0.009709	0.019417	-0.009709	0.2589	4.174757	0.097087	0.0	4.433657	null	null	null	null	null	null	null	null	null	null	null	null	null	…	2.071197	1.132686	1.165049	0.744337	0.161812	0.841424	1.456311	1.87055	0.614887	1.650485	1.035599	1.262136	3.333333	3.527508	0.064725	70.711974	9.482201	31.585761	36.213592	3.68932	1.132686	70.679612	0.485437	0.2589	1.724919	0.226537	1.511327	48.737864	280.355987	159.255663	1.165049	0.614887	0.0	0.744337	0.12945	45.404531	0.291262
16.9	47.100592	52.95858	5.260355	737.869822	177.988166	23.254438	25.621302	5.372781	19.526627	21.301775	5.426036	2.426036	3.195266	4.491124	0.0	0.029586	0.065089	-0.029586	0.35503	5.147929	0.295858	0.118343	4.615385	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.242604	0.828402	0.473373	0.591716	0.177515	0.473373	1.065089	2.627219	0.591716	0.710059	0.059172	0.710059	1.952663	1.065089	0.0	60.473373	1.242604	9.940828	38.284024	12.899408	0.414201	60.473373	1.12426	0.414201	2.177515	0.650888	3.426036	48.698225	258.047337	99.585799	0.532544	1.952663	0.0	1.360947	0.650888	46.153846	1.183432
31.8	48.773585	57.735849	2.657233	861.069182	270.849057	19.779874	22.232704	2.798742	22.358491	25.220126	2.789308	4.716981	7.327044	2.025157	0.031447	0.069182	0.059748	-0.037736	0.943396	5.91195	0.72327	0.09434	6.100629	null	null	null	null	null	null	null	null	null	null	null	null	null	…	2.672956	1.63522	1.352201	1.069182	0.251572	1.194969	3.018868	1.245283	1.823899	0.91195	0.188679	1.918239	4.591195	1.918239	0.09434	68.050314	3.301887	17.389937	40.503145	10.974843	0.314465	68.050314	2.044025	1.383648	2.128931	0.566038	0.871069	51.855346	273.805031	122.389937	1.194969	1.540881	0.09434	1.006289	0.943396	47.232704	1.603774
15.6	39.423077	51.089744	4.948718	590.448718	291.794872	22.5	25.25641	5.711538	13.012821	16.987179	4.910256	2.371795	5.064103	3.0	0.064103	0.064103	0.064103	0.0	0.512821	1.282051	0.448718	0.128205	2.115385	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.602564	1.217949	0.961538	0.576923	0.064103	1.089744	2.179487	3.205128	1.089744	1.474359	0.320513	0.961538	2.564103	3.974359	0.0	61.923077	4.615385	22.051282	25.961538	14.487179	1.794872	61.923077	0.833333	0.448718	3.448718	0.320513	2.467949	28.653846	115.512821	60.833333	0.897436	0.641026	0.320513	1.153846	0.705128	32.179487	3.717949
25.7	21.51751	29.649805	2.824903	481.322957	189.571984	5.486381	6.381323	3.346304	11.750973	13.190661	3.466926	3.85214	8.754864	1.712062	0.0	0.007782	0.007782	-0.007782	0.116732	2.373541	0.077821	0.0	1.906615	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.361868	0.7393	1.089494	0.194553	0.077821	0.817121	1.361868	2.33463	0.544747	1.206226	1.011673	0.894942	2.256809	4.474708	0.0	40.155642	7.431907	22.178988	16.420233	1.634241	0.817121	40.155642	0.194553	0.116732	2.33463	0.038911	0.77821	19.143969	95.719844	55.175097	0.272374	0.233463	0.0	0.583658	0.038911	19.416342	0.038911
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
30.7	23.127036	29.869707	2.521173	296.612378	57.068404	14.527687	17.557003	2.693811	5.407166	6.938111	2.537459	1.042345	1.498371	2.267101	0.130293	0.169381	0.081433	-0.039088	1.400651	1.302932	0.912052	0.0	2.801303	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.042345	0.651466	0.260586	0.488599	0.29316	0.423453	0.781759	1.765472	0.358306	0.814332	0.065147	0.09772	1.140065	1.172638	0.0	44.7557	1.498371	4.397394	20.456026	20.521173	4.429967	44.690554	4.169381	1.693811	1.322476	2.345277	1.833876	26.514658	134.592834	57.71987	1.726384	1.498371	0.716612	3.713355	2.084691	33.648208	5.928339
31.5	30.539683	36.444444	2.660317	598.190476	215.587302	10.634921	11.587302	2.914286	16.15873	17.714286	2.895238	3.52381	6.285714	1.780952	0.0	0.006349	0.003175	-0.006349	0.126984	1.809524	0.031746	0.0	1.68254	null	null	null	null	null	null	null	null	null	null	null	null	null	…	0.730159	0.539683	0.444444	0.285714	0.0	0.31746	0.507937	1.984127	0.190476	1.206349	0.888889	0.920635	1.650794	5.079365	0.0	46.190476	8.730159	25.333333	19.873016	1.079365	0.888889	46.190476	0.095238	0.063492	2.11746	0.031746	1.057143	25.206349	119.079365	64.095238	0.222222	0.15873	0.031746	0.412698	0.095238	27.714286	0.031746
28.0	52.714286	62.357143	3.017857	1083.964286	336.535714	13.821429	16.071429	3.071429	32.607143	35.321429	3.296429	5.964286	9.535714	2.232143	0.071429	0.039286	0.028571	0.032143	0.357143	3.535714	0.142857	0.0	4.607143	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.142857	0.892857	0.714286	0.357143	0.071429	0.678571	0.928571	2.610714	0.25	1.0	0.678571	1.142857	2.285714	3.464286	0.107143	70.821429	5.785714	32.428571	36.928571	1.785714	0.642857	70.821429	0.285714	0.142857	1.785714	0.142857	1.785714	43.892857	269.178571	168.75	1.0	0.071429	0.0	0.5	0.107143	47.0	0.0
29.5	44.779661	52.271186	2.905085	751.355932	212.508475	20.576271	22.949153	3.040678	20.576271	22.949153	3.040678	2.915254	4.271186	2.315254	0.033898	0.050847	0.067797	-0.016949	0.542373	4.610169	0.508475	0.033898	5.152542	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.661017	0.915254	0.610169	0.983051	0.067797	0.779661	1.423729	1.857627	0.644068	1.118644	0.305085	1.254237	2.915254	1.830508	0.0	60.779661	3.084746	13.457627	38.779661	9.050847	1.084746	60.779661	0.610169	0.40678	2.261017	0.169492	0.942373	31.932203	179.830508	90.542373	1.152542	1.389831	0.169492	0.881356	0.542373	40.169492	0.915254
34.3	49.329446	58.483965	2.457726	751.603499	230.262391	26.443149	28.921283	2.664723	18.51312	21.690962	2.48688	2.653061	4.081633	1.895044	0.291545	0.279883	0.326531	0.011662	2.973761	4.664723	3.790087	0.379009	10.029155	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.428571	0.495627	0.524781	0.612245	0.291545	0.495627	1.690962	0.854227	1.195335	0.670554	0.029155	0.437318	1.865889	0.145773	0.0	67.813411	0.699708	7.113703	26.997085	34.198251	4.810496	67.755102	2.361516	1.078717	1.332362	1.107872	1.367347	45.568513	238.77551	102.44898	2.653061	2.419825	1.078717	1.54519	1.690962	53.265306	5.889213

The stat of number of 90s played can be ignored if we disregard ‘fitness’ as a trait Toni Kroos had. This brings us to the following dataframe:

Code

df_stats_relevant_umap = (
    df_stats_relevant.select(
        ~cs.by_dtype(pl.String) & ~cs.by_name(["Age", "Rk", "Year of birth"])
    )
    .with_columns(pl.all() / pl.col("90s Played").exclude("90s Played"))
    .drop(pl.col("90s Played"))
)
df_stats_relevant_umap

shape: (1_215, 122)

Passes Completed	Passes Attempted	Pass Completion %	Total Passing Distance	Progressive Passing Distance	Passes Completed (Short)	Passes Attempted (Short)	Pass Completion % (Short)	Passes Completed (Medium)	Passes Attempted (Medium)	Pass Completion % (Medium)	Passes Completed (Long)	Passes Attempted (Long)	Pass Completion % (Long)	Assists	xAG: Exp. Assisted Goals	xA: Expected Assists	Assists - xAG	Key Passes	Passes into Final Third	Passes into Penalty Area	Crosses into Penalty Area	Progressive Passes	Matches Played	Starts	Goals Against	Goals Against/90	Shots on Target Against	Saves	Save Percentage	Wins	Draws	Losses	Clean Sheets	Clean Sheet Percentage	Penalty Kicks Attempted	Penalty Kicks Allowed	…	Tackles	Tackles Won	Tackles (Def 3rd)	Tackles (Mid 3rd)	Tackles (Att 3rd)	Dribblers Tackled	Dribbles Challenged	% of Dribblers Tackled	Challenges Lost	Blocks	Shots Blocked	Interceptions	Tkl+Int	Clearances	Errors	Touches	Touches (Def Pen)	Touches (Def 3rd)	Touches (Mid 3rd)	Touches (Att 3rd)	Touches (Att Pen)	Touches (Live-Ball)	Take-Ons Attempted	Successful Take-Ons	Successful Take-On %	Times Tackled During Take-On	Tackled During Take-On Percentage	Carries	Total Carrying Distance	Progressive Carrying Distance	Progressive Carries	Carries into Final Third	Carries into Penalty Area	Miscontrols	Dispossessed	Passes Received	Progressive Passes Rec
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	…	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
50.226537	59.417476	2.734628	958.511327	313.009709	15.760518	17.734628	2.877023	28.899676	31.585761	2.961165	4.563107	8.15534	1.812298	0.0	0.009709	0.019417	-0.009709	0.2589	4.174757	0.097087	0.0	4.433657	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	2.071197	1.132686	1.165049	0.744337	0.161812	0.841424	1.456311	1.87055	0.614887	1.650485	1.035599	1.262136	3.333333	3.527508	0.064725	70.711974	9.482201	31.585761	36.213592	3.68932	1.132686	70.679612	0.485437	0.2589	1.724919	0.226537	1.511327	48.737864	280.355987	159.255663	1.165049	0.614887	0.0	0.744337	0.12945	45.404531	0.291262
47.100592	52.95858	5.260355	737.869822	177.988166	23.254438	25.621302	5.372781	19.526627	21.301775	5.426036	2.426036	3.195266	4.491124	0.0	0.029586	0.065089	-0.029586	0.35503	5.147929	0.295858	0.118343	4.615385	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.242604	0.828402	0.473373	0.591716	0.177515	0.473373	1.065089	2.627219	0.591716	0.710059	0.059172	0.710059	1.952663	1.065089	0.0	60.473373	1.242604	9.940828	38.284024	12.899408	0.414201	60.473373	1.12426	0.414201	2.177515	0.650888	3.426036	48.698225	258.047337	99.585799	0.532544	1.952663	0.0	1.360947	0.650888	46.153846	1.183432
48.773585	57.735849	2.657233	861.069182	270.849057	19.779874	22.232704	2.798742	22.358491	25.220126	2.789308	4.716981	7.327044	2.025157	0.031447	0.069182	0.059748	-0.037736	0.943396	5.91195	0.72327	0.09434	6.100629	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	2.672956	1.63522	1.352201	1.069182	0.251572	1.194969	3.018868	1.245283	1.823899	0.91195	0.188679	1.918239	4.591195	1.918239	0.09434	68.050314	3.301887	17.389937	40.503145	10.974843	0.314465	68.050314	2.044025	1.383648	2.128931	0.566038	0.871069	51.855346	273.805031	122.389937	1.194969	1.540881	0.09434	1.006289	0.943396	47.232704	1.603774
39.423077	51.089744	4.948718	590.448718	291.794872	22.5	25.25641	5.711538	13.012821	16.987179	4.910256	2.371795	5.064103	3.0	0.064103	0.064103	0.064103	0.0	0.512821	1.282051	0.448718	0.128205	2.115385	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.602564	1.217949	0.961538	0.576923	0.064103	1.089744	2.179487	3.205128	1.089744	1.474359	0.320513	0.961538	2.564103	3.974359	0.0	61.923077	4.615385	22.051282	25.961538	14.487179	1.794872	61.923077	0.833333	0.448718	3.448718	0.320513	2.467949	28.653846	115.512821	60.833333	0.897436	0.641026	0.320513	1.153846	0.705128	32.179487	3.717949
21.51751	29.649805	2.824903	481.322957	189.571984	5.486381	6.381323	3.346304	11.750973	13.190661	3.466926	3.85214	8.754864	1.712062	0.0	0.007782	0.007782	-0.007782	0.116732	2.373541	0.077821	0.0	1.906615	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.361868	0.7393	1.089494	0.194553	0.077821	0.817121	1.361868	2.33463	0.544747	1.206226	1.011673	0.894942	2.256809	4.474708	0.0	40.155642	7.431907	22.178988	16.420233	1.634241	0.817121	40.155642	0.194553	0.116732	2.33463	0.038911	0.77821	19.143969	95.719844	55.175097	0.272374	0.233463	0.0	0.583658	0.038911	19.416342	0.038911
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
23.127036	29.869707	2.521173	296.612378	57.068404	14.527687	17.557003	2.693811	5.407166	6.938111	2.537459	1.042345	1.498371	2.267101	0.130293	0.169381	0.081433	-0.039088	1.400651	1.302932	0.912052	0.0	2.801303	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.042345	0.651466	0.260586	0.488599	0.29316	0.423453	0.781759	1.765472	0.358306	0.814332	0.065147	0.09772	1.140065	1.172638	0.0	44.7557	1.498371	4.397394	20.456026	20.521173	4.429967	44.690554	4.169381	1.693811	1.322476	2.345277	1.833876	26.514658	134.592834	57.71987	1.726384	1.498371	0.716612	3.713355	2.084691	33.648208	5.928339
30.539683	36.444444	2.660317	598.190476	215.587302	10.634921	11.587302	2.914286	16.15873	17.714286	2.895238	3.52381	6.285714	1.780952	0.0	0.006349	0.003175	-0.006349	0.126984	1.809524	0.031746	0.0	1.68254	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	0.730159	0.539683	0.444444	0.285714	0.0	0.31746	0.507937	1.984127	0.190476	1.206349	0.888889	0.920635	1.650794	5.079365	0.0	46.190476	8.730159	25.333333	19.873016	1.079365	0.888889	46.190476	0.095238	0.063492	2.11746	0.031746	1.057143	25.206349	119.079365	64.095238	0.222222	0.15873	0.031746	0.412698	0.095238	27.714286	0.031746
52.714286	62.357143	3.017857	1083.964286	336.535714	13.821429	16.071429	3.071429	32.607143	35.321429	3.296429	5.964286	9.535714	2.232143	0.071429	0.039286	0.028571	0.032143	0.357143	3.535714	0.142857	0.0	4.607143	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.142857	0.892857	0.714286	0.357143	0.071429	0.678571	0.928571	2.610714	0.25	1.0	0.678571	1.142857	2.285714	3.464286	0.107143	70.821429	5.785714	32.428571	36.928571	1.785714	0.642857	70.821429	0.285714	0.142857	1.785714	0.142857	1.785714	43.892857	269.178571	168.75	1.0	0.071429	0.0	0.5	0.107143	47.0	0.0
44.779661	52.271186	2.905085	751.355932	212.508475	20.576271	22.949153	3.040678	20.576271	22.949153	3.040678	2.915254	4.271186	2.315254	0.033898	0.050847	0.067797	-0.016949	0.542373	4.610169	0.508475	0.033898	5.152542	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.661017	0.915254	0.610169	0.983051	0.067797	0.779661	1.423729	1.857627	0.644068	1.118644	0.305085	1.254237	2.915254	1.830508	0.0	60.779661	3.084746	13.457627	38.779661	9.050847	1.084746	60.779661	0.610169	0.40678	2.261017	0.169492	0.942373	31.932203	179.830508	90.542373	1.152542	1.389831	0.169492	0.881356	0.542373	40.169492	0.915254
49.329446	58.483965	2.457726	751.603499	230.262391	26.443149	28.921283	2.664723	18.51312	21.690962	2.48688	2.653061	4.081633	1.895044	0.291545	0.279883	0.326531	0.011662	2.973761	4.664723	3.790087	0.379009	10.029155	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	1.428571	0.495627	0.524781	0.612245	0.291545	0.495627	1.690962	0.854227	1.195335	0.670554	0.029155	0.437318	1.865889	0.145773	0.0	67.813411	0.699708	7.113703	26.997085	34.198251	4.810496	67.755102	2.361516	1.078717	1.332362	1.107872	1.367347	45.568513	238.77551	102.44898	2.653061	2.419825	1.078717	1.54519	1.690962	53.265306	5.889213

There is only one thing left I would like to change. For the plots in the next step, it is useful to highlight Toni Kroos. By changing the value of the Position column, we can still use that column to dictate the colours and also have this clear distinction for Kroos.

Code

df_stats_relevant = df_stats_relevant.with_columns(
    pl.when(pl.col("Player") == "Toni Kroos")
    .then(pl.lit("Toni Kroos"))
    .otherwise(pl.col("Position"))
    .alias("Position")
)

Step 3: Interpreting UMAP results and plots

We’ve got the dataframe and it is very tempting to just plug that into UMAP. Yet it generally is wiser to scale the data first. We can use the standard preprocessing tools that we know from sklearn.

Code

from sklearn.preprocessing import StandardScaler

df_stats_numeric_scaled = StandardScaler().fit_transform(df_stats_relevant_umap)

Let’s get on with the first plot!

Code

import umap.plot

reducer = umap.umap_.UMAP(random_state=10)
mapper = reducer.fit(df_stats_numeric_scaled, force_all_finite="allow-nan")
umap.plot.points(mapper, labels=df_stats_relevant["Position"], background="black")

Oh man…

What even is this?! Four positions… we would expect four ‘clusters’, right? Of course you can argue about the generality of the positions, but we would at least expect some type of separation between the positions. The current one is all over the place. But hope’s not lost yet! Checking the documentation of the arguments we can provide to UMAP shows ‘euclidean’ as default metric. I would rather not go into too much detail, so essentially: the Euclidean metric in this case does not make sense. It would mean that people that have similar ‘mean’ stats across all stats will be packed closer to each other. This is not what we want: there should be more weights on the qualities that make a player unique, and that should be compared to others. A better metric to use would be the Chebyshev metric. Again without going into too much detail: the Chebyshev metric measures the greatest (absolute) difference between any stat of two players. For example, if player A is a better dribbler and player B is a better passer, these individual qualities will be more prominent with the Chebyshev metric. Let’s see with a plot!

Code

import umap.plot

reducer = umap.umap_.UMAP(random_state=10, metric="chebyshev")
mapper = reducer.fit(df_stats_numeric_scaled, force_all_finite="allow-nan")
umap.plot.points(mapper, labels=df_stats_relevant["Position"], background="black")

Ah, much better.

But still something we can improve on! Now what about the other parameters? We have already touched upon the metric. The other frequently used parameters are:

n_components;
n_neighbors;
min_dist.

For the components we can show both a 2D and 3D case. Plotting in higher dimensions is a little more cumbersome. That leaves n_neighbors and min_dist. The former is set at 15 by default. A rule of thumb is that the lower this value, the more the local structure will be preserved. Conversely the higher this number, the more UMAP sees the overall structure of the data. After some tweaking³, it was clear from the results that a value of around 30 is quite good for both the local and global structure.

Afs for the min_dist, the rule of thumb is: the lower the value, the clumpier the embeddings. As is stated by the documentation of UMAP⁴, this can be useful for clustering. A default value of 0.1 is fine for our use case, which leads us to the following instances for 2D and 3D UMAP embeddings.

Code

reducer_2d = umap.umap_.UMAP(
    n_components=2, metric="chebyshev", n_neighbors=30, random_state=10
)
reducer_3d = umap.umap_.UMAP(
    n_components=3, metric="chebyshev", n_neighbors=30, random_state=10
)

embeddings_2d = reducer_2d.fit_transform(
    df_stats_numeric_scaled, force_all_finite="allow-nan"
)
embeddings_3d = reducer_3d.fit_transform(
    df_stats_numeric_scaled, force_all_finite="allow-nan"
)

This leads us to two interactive figures underneath!

Code

import plotly.express as px

df_plot_2d = df_stats_relevant.hstack(pl.DataFrame(embeddings_2d, schema=["x", "y"]))
df_plot_3d = df_stats_relevant.hstack(
    pl.DataFrame(embeddings_3d, schema=["x", "y", "z"])
)

fig_2d = px.scatter(
    df_plot_2d,
    title="2D",
    x="x",
    y="y",
    color=df_stats_relevant["Position"],
    hover_data=["Player", "Squad", "Competition"],
    labels={"color": ""},
)
fig_2d.update_layout(
    plot_bgcolor="white",
    margin=dict(l=20, r=20, t=50, b=20),
    xaxis=dict(showticklabels=False, title=None, linecolor="black"),
    yaxis=dict(showticklabels=False, title=None, linecolor="black"),
)
fig_2d.update_traces(
    hovertemplate="<br>".join(
        [
            "Player: %{customdata[0]}",
            "Squad: %{customdata[1]}",
            "Competition: %{customdata[2]}",
        ]
    )
)

fig_3d = px.scatter_3d(
    df_plot_3d,
    title="3D",
    x="x",
    y="y",
    z="z",
    color=df_stats_relevant["Position"],
    hover_data=["Player", "Squad", "Competition"],
    labels={"color": ""},
)
fig_3d.update_traces(
    marker_size=3,
    hovertemplate="<br>".join(
        [
            "Player: %{customdata[0]}",
            "Squad: %{customdata[1]}",
            "Competition: %{customdata[2]}",
        ]
    ),
)
fig_3d.update_layout(
    scene=dict(
        xaxis_title="",
        yaxis_title="",
        zaxis_title="",
        xaxis_showspikes=False,
        yaxis_showspikes=False,
        xaxis=dict(
            backgroundcolor="rgb(200, 200, 230)",
            showticklabels=False,
            linecolor="black",
        ),
        yaxis=dict(
            backgroundcolor="rgb(230, 200,230)",
            showticklabels=False,
            linecolor="black",
        ),
        zaxis=dict(
            backgroundcolor="rgb(230, 230,200)",
            showticklabels=False,
            linecolor="black",
        ),
    ),
    margin=dict(l=20, r=20, t=50, b=20),
)

fig_2d.show()
fig_3d.show()

Aleix Garcia is statistically⁵ the most similar player to Toni Kroos, which can be seen in both the 2D and 3D embedding. He is a central midfielder who played for Girona in the 2023/24 season. If you followed LaLiga last leason you should know that Girona was a serious title contender for the majority of the run. Only at the final third part did they not keep up with Real Madrid anymore. There have been many praises for this Girona team, with Aleix Garcia as one of the most prominent players. It does not surprise me the least that he is most similar to Kroos, and for his tremendous season he has earned a transfer to Bayer Leverkusen, the 2023/24 Bundesliga champions.

In the 3D embedding, it is quite extraordinary to see a defenders close to him as well. The one that is closes is Trent Alexander-Arnold, who plays for Liverpool in the Premier League. He is known to be an amazing creator, but usually plays right-back. There are some rumours about him joining Real Madrid as a free agent in 2025, so who knows… 👀

TL;DR

Aleix Garcia. Trent Alexander-Arnold was the defender that was the closest. Keep in mind that the source does not have all statistics. Also we took the entirety of stats. For a more scoped look in the key things that are missing for Real Madrid, we could opt to include only those stats, such as passing and possession.

That’s all for today, thanks for reading!

Footnotes

The only reason I put this up is because I spent too much time messing around with Polars, and I just did not get it to work…↩︎
Joining on the player name is trivial, but the squad is less obvious: the data could contain multiple records of the same player. This is due to either late transfers or a transfer in the winter window.↩︎
You can do so by running a simple for loop and inspecting all the plotted results.↩︎
https://umap-learn.readthedocs.io/en/latest/parameters.html#n-neighbors ↩︎
According to the Chebyshev metric and our choice for min_dist and n_neighbors.↩︎

Step 1: Obtaining the data

Step 2: Exploring polars by data wrangling

Step 3: Interpreting UMAP results and plots

TL;DR

Footnotes

Step 2: Exploring `polars` by data wrangling