Visually finding a statistical Toni Kroos replacement

Irreplacable
visualisation
Author

Li Yong Pan

Published

October 7, 2024

If you have watched any of Real Madrid’s first few games in this season (2024/2025) and also previous seasons, you must have noticed a hasty and uncontrolled playing style in recent games. Very clearly this is due to the absence of Toni Kroos in midfield, and you don’t have to be a tactico to observe that. Even though he is truly one of a kind, I would like to use football data to find the most similar player in terms of statistics. A recent trip to PyData Amsterdam 2024 and the talk of Jeroen Janssens about ‘How I hacked UMAP and won at a plotting contest’ has inspired me to try and play around with Polars for a small side project, and this seems like the perfect one! Also the use of UMAP suits this project particularly well.

Finding a Toni Kroos replacement

Finding a Toni Kroos replacement

In an earlier post with football data we used a type of football data which is great for dimensionality reduction, which is what UMAP entails as a technique. The only problem with that data is that it was for the 2022-2023 season, and I want to use the most recent completed season. Fortunately the Kaggle dataset also showed the source and it was quite easy to find the source data for the 2023-2024 season.

Step 1: Obtaining the data

After clicking around on FBREF, I found the player stats of the 2023-2024 season, so let’s get digging!

… Diglett Dig Diglett Dig Diglett Dig Diglett Dig …

Diglett Dig Diglett Dig Diglett Dig Diglett Dig

The data is not yet in an easy format to work with. We could opt for pd.read_html() method, but that’s no fun! Besides, I think doing manual HTML inspecting and then extracting is a good skill to maintain (from time to time). As it is simply a matter of extracting some HTML content, beautifulsoup should be sufficient!

Terminal
pip install bs4 polars

Let’s first inspect the stats. There are several categories with each their own URL:

  • standard stats
  • goalkeeping
  • advanced goalkeeping
  • shooting
  • passing
  • pass types
  • goal and shot creation
  • possession
  • playing time
  • miscellaneous stats

Even though Toni Kroos is not a goalkeeper, I think it’s good to keep the goalkeeping stats for the final UMAP plot. We will, however, disregard the ‘standard stats’ and ‘miscellaneous stats’. Fortunately the URLs are similar and only differ by a simple substring.

Code
import requests
from bs4 import BeautifulSoup
import polars as pl


def extract_table_and_columns(stat: str) -> (pl.DataFrame, dict):
    """Extract table from url.

    Args:
        stat (str): The statistic of the table

    Returns:
        pl.DataFrame: A Polars dataframe with the information of the players.
        dict: A dictionary of columns and their descriptions.
    """
    base_url = "https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats"
    url = base_url.format(stat=stat)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.content, features="lxml")

    # The HTML of the table
    stats_table = soup.find_all("table")[-1]

    # Parsing headers
    tr_header = stats_table.findChildren(name="tr")[1]
    column_descriptions = {
        cell.get("aria-label"): cell.get("data-tip")
        for cell in tr_header.find_all(["th", "td"])
    }
    columns = [key for key in column_descriptions.keys()]

    if len(columns) != len(set(columns)):
        raise AssertionError("Column names must be unique.")

    # Parsing stats
    table_body = stats_table.find_next(name="tbody")
    rows_players = table_body.findChildren(name="tr", attrs={"class": False})

    table_data: list = []
    for row in rows_players:
        cells = row.find_all(
            ["th", "td"]
        )  # Collect all cell data, both headers and regular cells
        table_data.append([cell.get_text(strip=True) for cell in cells])

    return pl.DataFrame(table_data, schema=columns, orient="row"), column_descriptions

Let’s look at the passing stats!

Code
df, descriptions = extract_table_and_columns("passing")
df
shape: (2_852, 33)
Rk Player Nation Position Squad Competition Age Year of birth 90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches
str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str
"1" "Max Aarons" "engENG" "DF" "Bournemouth" "engPremier League" "23" "2000" "13.7" "450" "581" "77.5" "7402" "2789" "220" "248" "88.7" "188" "235" "80.0" "34" "63" "54.0" "1" "0.8" "0.9" "+0.2" "7" "25" "13" "2" "43" "Matches"
"2" "Brenden Aaronson" "usUSA" "MF,FW" "Union Berlin" "deBundesliga" "22" "2000" "14.1" "365" "472" "77.3" "4890" "1506" "206" "240" "85.8" "105" "130" "80.8" "19" "32" "59.4" "2" "1.9" "2.0" "+0.1" "22" "30" "14" "3" "56" "Matches"
"3" "Paxten Aaronson" "usUSA" "MF" "Eint Frankfurt" "deBundesliga" "19" "2003" "1.1" "41" "50" "82.0" "576" "71" "20" "25" "80.0" "20" "22" "90.9" "0" "2" "0.0" "1" "0.1" "0.1" "+0.9" "1" "4" "2" "0" "5" "Matches"
"4" "Keyliane Abdallah" "frFRA" "FW" "Marseille" "frLigue 1" "17" "2006" "0.0" "1" "1" "100.0" "8" "0" "1" "1" "100.0" "0" "0" "" "0" "0" "" "0" "0.0" "0.0" "0.0" "0" "0" "0" "0" "0" "Matches"
"5" "Yunis Abdelhamid" "maMAR" "DF" "Reims" "frLigue 1" "35" "1987" "30.9" "1552" "1836" "84.5" "29618" "9672" "487" "548" "88.9" "893" "976" "91.5" "141" "252" "56.0" "0" "0.3" "0.6" "-0.3" "8" "129" "3" "0" "137" "Matches"
"2848" "Lovro Zvonarek" "hrCRO" "FW,MF" "Bayern Munich" "deBundesliga" "18" "2005" "1.8" "50" "60" "83.3" "766" "123" "20" "23" "87.0" "25" "29" "86.2" "2" "2" "100.0" "0" "0.2" "0.3" "-0.2" "2" "5" "1" "0" "7" "Matches"
"2849" "Martin Ødegaard" "noNOR" "MF" "Arsenal" "engPremier League" "24" "1998" "34.3" "1692" "2006" "84.3" "25780" "7898" "907" "992" "91.4" "635" "744" "85.3" "91" "140" "65.0" "10" "9.6" "11.2" "+0.4" "102" "160" "130" "13" "344" "Matches"
"2850" "Milan Đurić" "baBIH" "FW" "Hellas Verona" "itSerie A" "33" "1990" "13.4" "181" "334" "54.2" "2116" "392" "130" "213" "61.0" "32" "69" "46.4" "5" "12" "41.7" "1" "1.2" "0.5" "-0.2" "16" "18" "3" "0" "15" "Matches"
"2851" "Milan Đurić" "baBIH" "FW" "Monza" "itSerie A" "33" "1990" "14.0" "140" "272" "51.5" "1700" "251" "93" "156" "59.6" "35" "86" "40.7" "0" "3" "0.0" "1" "0.7" "0.2" "+0.3" "5" "3" "2" "0" "8" "Matches"
"2852" "Mateusz Łęgowski" "plPOL" "MF" "Salernitana" "itSerie A" "20" "2003" "12.7" "292" "378" "77.2" "4268" "1261" "169" "196" "86.2" "95" "120" "79.2" "16" "31" "51.6" "0" "0.2" "0.4" "-0.2" "7" "30" "5" "1" "37" "Matches"

Damn, looks kinda ugly with the automatically inferred strings for every column. We are actually going to try something quick and nasty for this, like saving these as .csv files and then reading them again for a better inference.

Code
import os


def parse_columns(df: pl.DataFrame) -> pl.DataFrame:
    """Parse columns of a dataframe for better inference.

    Args:
        df (pl.DataFrame): A Polars dataframe which has structurally incorrect columns.

    Returns:
        pl.DataFrame: A Polars dataframe with better inferred column types.
    """
    df.write_csv("temp.csv", quote_style="necessary")
    parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)
    if os.path.exists("temp.csv"):
        os.remove("temp.csv")
    return parsed_df


def extract_and_parse_table(stat) -> pl.DataFrame:
    df, descriptions = extract_table_and_columns(stat)
    return parse_columns(df)

extract_and_parse_table("passing")
shape: (2_852, 33)
Rk Player Nation Position Squad Competition Age Year of birth 90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches
i64 str str str str str i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 f64 i64 f64 f64 f64 i64 i64 i64 i64 i64 str
1 "Max Aarons" "engENG" "DF" "Bournemouth" "engPremier League" 23 2000 13.7 450 581 77.5 7402 2789 220 248 88.7 188 235 80.0 34 63 54.0 1 0.8 0.9 0.2 7 25 13 2 43 "Matches"
2 "Brenden Aaronson" "usUSA" "MF,FW" "Union Berlin" "deBundesliga" 22 2000 14.1 365 472 77.3 4890 1506 206 240 85.8 105 130 80.8 19 32 59.4 2 1.9 2.0 0.1 22 30 14 3 56 "Matches"
3 "Paxten Aaronson" "usUSA" "MF" "Eint Frankfurt" "deBundesliga" 19 2003 1.1 41 50 82.0 576 71 20 25 80.0 20 22 90.9 0 2 0.0 1 0.1 0.1 0.9 1 4 2 0 5 "Matches"
4 "Keyliane Abdallah" "frFRA" "FW" "Marseille" "frLigue 1" 17 2006 0.0 1 1 100.0 8 0 1 1 100.0 0 0 null 0 0 null 0 0.0 0.0 0.0 0 0 0 0 0 "Matches"
5 "Yunis Abdelhamid" "maMAR" "DF" "Reims" "frLigue 1" 35 1987 30.9 1552 1836 84.5 29618 9672 487 548 88.9 893 976 91.5 141 252 56.0 0 0.3 0.6 -0.3 8 129 3 0 137 "Matches"
2848 "Lovro Zvonarek" "hrCRO" "FW,MF" "Bayern Munich" "deBundesliga" 18 2005 1.8 50 60 83.3 766 123 20 23 87.0 25 29 86.2 2 2 100.0 0 0.2 0.3 -0.2 2 5 1 0 7 "Matches"
2849 "Martin Ødegaard" "noNOR" "MF" "Arsenal" "engPremier League" 24 1998 34.3 1692 2006 84.3 25780 7898 907 992 91.4 635 744 85.3 91 140 65.0 10 9.6 11.2 0.4 102 160 130 13 344 "Matches"
2850 "Milan Đurić" "baBIH" "FW" "Hellas Verona" "itSerie A" 33 1990 13.4 181 334 54.2 2116 392 130 213 61.0 32 69 46.4 5 12 41.7 1 1.2 0.5 -0.2 16 18 3 0 15 "Matches"
2851 "Milan Đurić" "baBIH" "FW" "Monza" "itSerie A" 33 1990 14.0 140 272 51.5 1700 251 93 156 59.6 35 86 40.7 0 3 0.0 1 0.7 0.2 0.3 5 3 2 0 8 "Matches"
2852 "Mateusz Łęgowski" "plPOL" "MF" "Salernitana" "itSerie A" 20 2003 12.7 292 378 77.2 4268 1261 169 196 86.2 95 120 79.2 16 31 51.6 0 0.2 0.4 -0.2 7 30 5 1 37 "Matches"

This is much better! Now let’s refactor a little and put this together into a class:

Code
class StatsExtractor:
    def __init__(self, stat: str):
        self.stat: str = stat
        self.base_url: str = (
            "https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats"
        )
        self.url: str = self.base_url.format(stat=self.stat)
        self.df = None
        self.column_descriptions = None
        self.fetch_data()
        self.extract_column_descriptions()
        self.extract_stats()
        self.parse_columns()

    def fetch_data(self) -> None:
        """Fetch the HTML content from the URL."""
        url = self.base_url.format(stat=self.stat)
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content, features="lxml")
        self.html = soup.find_all("table")[-1]

    def extract_column_descriptions(self) -> None:
        """Extract column descriptions"""
        tr_header = self.html.findChildren(name="tr")[1]
        column_descriptions = {
            cell.get("aria-label"): cell.get("data-tip")
            for cell in tr_header.find_all(["th", "td"])
        }
        columns = [key for key in column_descriptions.keys()]

        if len(columns) != len(set(columns)):
            raise AssertionError("Column names must be unique.")
        self.columns = columns
        self.column_descriptions = column_descriptions

    def extract_stats(self) -> None:
        """Extract table from URL and store it in a DataFrame."""
        table_body = self.html.find_next(name="tbody")
        rows_players = table_body.findChildren(name="tr", attrs={"class": False})

        table_data: list = []
        for row in rows_players:
            cells = row.find_all(["th", "td"])
            table_data.append([cell.get_text(strip=True) for cell in cells])

        self.df = pl.DataFrame(table_data, schema=self.columns, orient="row")

    def parse_columns(self) -> None:
        """Parse columns of a DataFrame for better inference."""
        self.df.write_csv("temp.csv", quote_style="necessary")
        parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)
        # if os.path.exists("temp.csv"):
            # os.remove("temp.csv")
        self.df = parsed_df

We can now invoke the methods after creating a StatsExtractor instance.

Code
stats_to_extract = [
        "passing",
        "keepers",
        "shooting",
        "passing_types",
        "gca",
        "defense",
        "possession",
]
extractors = {stat: StatsExtractor(stat) for stat in stats_to_extract}
dfs = {stat: extractor.df for stat, extractor in extractors.items()}

With dfs we only need to create one combined dataframe, and actually get to learn more about the syntax of polars.

Step 2: Exploring polars by data wrangling

After viewing the documentation and some sample code, polars looks very similar to dplyr in R - and I have to say I am quite pleased. My career started with loads of R and in my own time and projects I have transitioned more towards Python. However, the one and only thing I really miss about R is how straightforward and concise the syntax is compared to pandas. Take the following two dataframes

Code
df_passing = dfs["passing"]
df_possession = dfs["possession"]

I am so used to not having duplicate columns when performing joins. However in Python using either pandas or polars, I have not found any other way than to drop the duplicate columns. Let’s join these dataframes on the columns Player and Squad , and drop all the other duplicate columns. In polars we can do so using the selector cs.ends_with().

Code
import polars.selectors as cs

df_passing.join(df_possession, how="left", on=["Player", "Squad"]).drop(cs.ends_with("_right"))
shape: (2_852, 55)
Rk Player Nation Position Squad Competition Age Year of birth 90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
i64 str str str str str i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 f64 i64 f64 f64 f64 i64 i64 i64 i64 i64 str i64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
1 "Max Aarons" "engENG" "DF" "Bournemouth" "engPremier League" 23 2000 13.7 450 581 77.5 7402 2789 220 248 88.7 188 235 80.0 34 63 54.0 1 0.8 0.9 0.2 7 25 13 2 43 "Matches" 711 43 252 303 165 11 711 34 14 41.2 12 35.3 364 2174 1121 22 12 7 13 8 371 26
2 "Brenden Aaronson" "usUSA" "MF,FW" "Union Berlin" "deBundesliga" 22 2000 14.1 365 472 77.3 4890 1506 206 240 85.8 105 130 80.8 19 32 59.4 2 1.9 2.0 0.1 22 30 14 3 56 "Matches" 675 11 108 301 293 47 675 77 34 44.2 41 53.2 406 2721 1387 37 29 9 41 38 457 91
3 "Paxten Aaronson" "usUSA" "MF" "Eint Frankfurt" "deBundesliga" 19 2003 1.1 41 50 82.0 576 71 20 25 80.0 20 22 90.9 0 2 0.0 1 0.1 0.1 0.9 1 4 2 0 5 "Matches" 72 0 6 39 28 5 72 7 2 28.6 4 57.1 43 193 50 2 0 1 5 4 46 7
4 "Keyliane Abdallah" "frFRA" "FW" "Marseille" "frLigue 1" 17 2006 0.0 1 1 100.0 8 0 1 1 100.0 0 0 null 0 0 null 0 0.0 0.0 0.0 0 0 0 0 0 "Matches" 1 0 0 1 0 0 1 0 0 null 0 null 1 3 0 0 0 0 0 0 1 0
5 "Yunis Abdelhamid" "maMAR" "DF" "Reims" "frLigue 1" 35 1987 30.9 1552 1836 84.5 29618 9672 487 548 88.9 893 976 91.5 141 252 56.0 0 0.3 0.6 -0.3 8 129 3 0 137 "Matches" 2185 293 976 1119 114 35 2184 15 8 53.3 7 46.7 1506 8663 4921 36 19 0 23 4 1403 9
2848 "Lovro Zvonarek" "hrCRO" "FW,MF" "Bayern Munich" "deBundesliga" 18 2005 1.8 50 60 83.3 766 123 20 23 87.0 25 29 86.2 2 2 100.0 0 0.2 0.3 -0.2 2 5 1 0 7 "Matches" 73 2 9 38 26 8 73 3 1 33.3 2 66.7 43 207 86 7 1 4 3 1 55 9
2849 "Martin Ødegaard" "noNOR" "MF" "Arsenal" "engPremier League" 24 1998 34.3 1692 2006 84.3 25780 7898 907 992 91.4 635 744 85.3 91 140 65.0 10 9.6 11.2 0.4 102 160 130 13 344 "Matches" 2326 24 244 926 1173 165 2324 81 37 45.7 38 46.9 1563 8190 3514 91 83 37 53 58 1827 202
2850 "Milan Đurić" "baBIH" "FW" "Hellas Verona" "itSerie A" 33 1990 13.4 181 334 54.2 2116 392 130 213 61.0 32 69 46.4 5 12 41.7 1 1.2 0.5 -0.2 16 18 3 0 15 "Matches" 421 14 41 227 153 46 418 3 2 66.7 1 33.3 183 529 151 3 6 1 22 8 352 33
2851 "Milan Đurić" "baBIH" "FW" "Monza" "itSerie A" 33 1990 14.0 140 272 51.5 1700 251 93 156 59.6 35 86 40.7 0 3 0.0 1 0.7 0.2 0.3 5 3 2 0 8 "Matches" 381 32 51 227 104 37 381 4 2 50.0 2 50.0 157 458 108 2 2 1 24 14 323 31
2852 "Mateusz Łęgowski" "plPOL" "MF" "Salernitana" "itSerie A" 20 2003 12.7 292 378 77.2 4268 1261 169 196 86.2 95 120 79.2 16 31 51.6 0 0.2 0.4 -0.2 7 30 5 1 37 "Matches" 504 31 120 283 111 11 504 21 6 28.6 12 57.1 265 1526 757 16 13 3 20 12 294 17

Exactly what I wanted! Now we can iterate this process until there are no dataframes left to merge on.

Code
from functools import reduce

df_stats = reduce(
    lambda left, right: left.join(right, how="left", on=["Player", "Squad"]).drop(
        cs.ends_with("_right")
    ),
    dfs.values()
)
df_stats
shape: (2_852, 133)
Rk Player Nation Position Squad Competition Age Year of birth 90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Matches Played Starts Minutes Goals Against Tackles Tackles Won Tackles (Def 3rd) Tackles (Mid 3rd) Tackles (Att 3rd) Dribblers Tackled Dribbles Challenged % of Dribblers Tackled Challenges Lost Blocks Shots Blocked Interceptions Tkl+Int Clearances Errors Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
i64 str str str str str i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 f64 i64 f64 f64 f64 i64 i64 i64 i64 i64 str i64 i64 str i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
1 "Max Aarons" "engENG" "DF" "Bournemouth" "engPremier League" 23 2000 13.7 450 581 77.5 7402 2789 220 248 88.7 188 235 80.0 34 63 54.0 1 0.8 0.9 0.2 7 25 13 2 43 "Matches" null null null null 29 19 20 7 2 20 34 58.8 14 9 5 8 37 27 0 711 43 252 303 165 11 711 34 14 41.2 12 35.3 364 2174 1121 22 12 7 13 8 371 26
2 "Brenden Aaronson" "usUSA" "MF,FW" "Union Berlin" "deBundesliga" 22 2000 14.1 365 472 77.3 4890 1506 206 240 85.8 105 130 80.8 19 32 59.4 2 1.9 2.0 0.1 22 30 14 3 56 "Matches" null null null null 32 18 13 13 6 16 32 50.0 16 26 1 2 34 4 0 675 11 108 301 293 47 675 77 34 44.2 41 53.2 406 2721 1387 37 29 9 41 38 457 91
3 "Paxten Aaronson" "usUSA" "MF" "Eint Frankfurt" "deBundesliga" 19 2003 1.1 41 50 82.0 576 71 20 25 80.0 20 22 90.9 0 2 0.0 1 0.1 0.1 0.9 1 4 2 0 5 "Matches" null null null null 2 2 1 1 0 1 1 100.0 0 2 0 0 2 0 0 72 0 6 39 28 5 72 7 2 28.6 4 57.1 43 193 50 2 0 1 5 4 46 7
4 "Keyliane Abdallah" "frFRA" "FW" "Marseille" "frLigue 1" 17 2006 0.0 1 1 100.0 8 0 1 1 100.0 0 0 null 0 0 null 0 0.0 0.0 0.0 0 0 0 0 0 "Matches" null null null null 0 0 0 0 0 0 0 null 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 null 0 null 1 3 0 0 0 0 0 0 1 0
5 "Yunis Abdelhamid" "maMAR" "DF" "Reims" "frLigue 1" 35 1987 30.9 1552 1836 84.5 29618 9672 487 548 88.9 893 976 91.5 141 252 56.0 0 0.3 0.6 -0.3 8 129 3 0 137 "Matches" null null null null 64 35 36 23 5 26 45 57.8 19 51 32 39 103 109 2 2185 293 976 1119 114 35 2184 15 8 53.3 7 46.7 1506 8663 4921 36 19 0 23 4 1403 9
2848 "Lovro Zvonarek" "hrCRO" "FW,MF" "Bayern Munich" "deBundesliga" 18 2005 1.8 50 60 83.3 766 123 20 23 87.0 25 29 86.2 2 2 100.0 0 0.2 0.3 -0.2 2 5 1 0 7 "Matches" null null null null 0 0 0 0 0 0 5 0.0 5 3 0 0 0 2 0 73 2 9 38 26 8 73 3 1 33.3 2 66.7 43 207 86 7 1 4 3 1 55 9
2849 "Martin Ødegaard" "noNOR" "MF" "Arsenal" "engPremier League" 24 1998 34.3 1692 2006 84.3 25780 7898 907 992 91.4 635 744 85.3 91 140 65.0 10 9.6 11.2 0.4 102 160 130 13 344 "Matches" null null null null 49 17 18 21 10 17 58 29.3 41 23 1 15 64 5 0 2326 24 244 926 1173 165 2324 81 37 45.7 38 46.9 1563 8190 3514 91 83 37 53 58 1827 202
2850 "Milan Đurić" "baBIH" "FW" "Hellas Verona" "itSerie A" 33 1990 13.4 181 334 54.2 2116 392 130 213 61.0 32 69 46.4 5 12 41.7 1 1.2 0.5 -0.2 16 18 3 0 15 "Matches" null null null null 1 0 0 0 1 0 3 0.0 3 0 0 1 2 13 0 421 14 41 227 153 46 418 3 2 66.7 1 33.3 183 529 151 3 6 1 22 8 352 33
2851 "Milan Đurić" "baBIH" "FW" "Monza" "itSerie A" 33 1990 14.0 140 272 51.5 1700 251 93 156 59.6 35 86 40.7 0 3 0.0 1 0.7 0.2 0.3 5 3 2 0 8 "Matches" null null null null 5 5 0 5 0 2 4 50.0 2 6 2 0 5 27 0 381 32 51 227 104 37 381 4 2 50.0 2 50.0 157 458 108 2 2 1 24 14 323 31
2852 "Mateusz Łęgowski" "plPOL" "MF" "Salernitana" "itSerie A" 20 2003 12.7 292 378 77.2 4268 1261 169 196 86.2 95 120 79.2 16 31 51.6 0 0.2 0.4 -0.2 7 30 5 1 37 "Matches" null null null null 23 12 8 12 3 11 26 42.3 15 13 2 7 30 16 0 504 31 120 283 111 11 504 21 6 28.6 12 57.1 265 1526 757 16 13 3 20 12 294 17

Now it’s time to filter and select relevant features and change them slightly. Before doing alterations, I think it is fair to only include players who have played more than 90 minutes 15 times. Perhaps fair is not the right word, but including players who only played a handful of matches is misleading when evaluating their overall season performance. It does not capture consistency which is what we do want to include.

Code
df_stats.filter(pl.col("90s Played") >= 15)
shape: (1_215, 133)
Rk Player Nation Position Squad Competition Age Year of birth 90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Matches Played Starts Minutes Goals Against Tackles Tackles Won Tackles (Def 3rd) Tackles (Mid 3rd) Tackles (Att 3rd) Dribblers Tackled Dribbles Challenged % of Dribblers Tackled Challenges Lost Blocks Shots Blocked Interceptions Tkl+Int Clearances Errors Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
i64 str str str str str i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 f64 i64 f64 f64 f64 i64 i64 i64 i64 i64 str i64 i64 str i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
5 "Yunis Abdelhamid" "maMAR" "DF" "Reims" "frLigue 1" 35 1987 30.9 1552 1836 84.5 29618 9672 487 548 88.9 893 976 91.5 141 252 56.0 0 0.3 0.6 -0.3 8 129 3 0 137 "Matches" null null null null 64 35 36 23 5 26 45 57.8 19 51 32 39 103 109 2 2185 293 976 1119 114 35 2184 15 8 53.3 7 46.7 1506 8663 4921 36 19 0 23 4 1403 9
6 "Salis Abdul Samed" "ghGHA" "MF" "Lens" "frLigue 1" 23 2000 16.9 796 895 88.9 12470 3008 393 433 90.8 330 360 91.7 41 54 75.9 0 0.5 1.1 -0.5 6 87 5 2 78 "Matches" null null null null 21 14 8 10 3 8 18 44.4 10 12 1 12 33 18 0 1022 21 168 647 218 7 1022 19 7 36.8 11 57.9 823 4361 1683 9 33 0 23 11 780 20
8 "Laurent Abergel" "frFRA" "MF" "Lorient" "frLigue 1" 30 1993 31.8 1551 1836 84.5 27382 8613 629 707 89.0 711 802 88.7 150 233 64.4 1 2.2 1.9 -1.2 30 188 23 3 194 "Matches" null null null null 85 52 43 34 8 38 96 39.6 58 29 6 61 146 61 3 2164 105 553 1288 349 10 2164 65 44 67.7 18 27.7 1649 8707 3892 38 49 3 32 30 1502 51
10 "Abner" "brBRA" "DF" "Betis" "esLa Liga" 23 2000 15.6 615 797 77.2 9211 4552 351 394 89.1 203 265 76.6 37 79 46.8 1 1.0 1.0 0.0 8 20 7 2 33 "Matches" null null null null 25 19 15 9 1 17 34 50.0 17 23 5 15 40 62 0 966 72 344 405 226 28 966 13 7 53.8 5 38.5 447 1802 949 14 10 5 18 11 502 58
12 "Abdel Abqar" "maMAR" "DF" "Alavés" "esLa Liga" 24 1999 25.7 553 762 72.6 12370 4872 141 164 86.0 302 339 89.1 99 225 44.0 0 0.2 0.2 -0.2 3 61 2 0 49 "Matches" null null null null 35 19 28 5 2 21 35 60.0 14 31 26 23 58 115 0 1032 191 570 422 42 21 1032 5 3 60.0 1 20.0 492 2460 1418 7 6 0 15 1 499 1
2839 "Joshua Zirkzee" "nlNED" "FW" "Bologna" "itSerie A" 22 2001 30.7 710 917 77.4 9106 1752 446 539 82.7 166 213 77.9 32 46 69.6 4 5.2 2.5 -1.2 43 40 28 0 86 "Matches" null null null null 32 20 8 15 9 13 24 54.2 11 25 2 3 35 36 0 1374 46 135 628 630 136 1372 128 52 40.6 72 56.3 814 4132 1772 53 46 22 114 64 1033 182
2844 "Kurt Zouma" "frFRA" "DF" "West Ham" "engPremier League" 28 1994 31.5 962 1148 83.8 18843 6791 335 365 91.8 509 558 91.2 111 198 56.1 0 0.2 0.1 -0.2 4 57 1 0 53 "Matches" null null null null 23 17 14 9 0 10 16 62.5 6 38 28 29 52 160 0 1455 275 798 626 34 28 1455 3 2 66.7 1 33.3 794 3751 2019 7 5 1 13 3 873 1
2845 "Igor Zubeldia" "esESP" "DF" "Real Sociedad" "esLa Liga" 26 1997 28.0 1476 1746 84.5 30351 9423 387 450 86.0 913 989 92.3 167 267 62.5 2 1.1 0.8 0.9 10 99 4 0 129 "Matches" null null null null 32 25 20 10 2 19 26 73.1 7 28 19 32 64 97 3 1983 162 908 1034 50 18 1983 8 4 50.0 4 50.0 1229 7537 4725 28 2 0 14 3 1316 0
2846 "Martín Zubimendi" "esESP" "MF" "Real Sociedad" "esLa Liga" 24 1999 29.5 1321 1542 85.7 22165 6269 607 677 89.7 607 677 89.7 86 126 68.3 1 1.5 2.0 -0.5 16 136 15 1 152 "Matches" null null null null 49 27 18 29 2 23 42 54.8 19 33 9 37 86 54 0 1793 91 397 1144 267 32 1793 18 12 66.7 5 27.8 942 5305 2671 34 41 5 26 16 1185 27
2849 "Martin Ødegaard" "noNOR" "MF" "Arsenal" "engPremier League" 24 1998 34.3 1692 2006 84.3 25780 7898 907 992 91.4 635 744 85.3 91 140 65.0 10 9.6 11.2 0.4 102 160 130 13 344 "Matches" null null null null 49 17 18 21 10 17 58 29.3 41 23 1 15 64 5 0 2326 24 244 926 1173 165 2324 81 37 45.7 38 46.9 1563 8190 3514 91 83 37 53 58 1827 202

Although players can play multiple positions, for the plots I prefer we keep it on a down low. The Position column now has these unique values:

Code
df_stats.select(pl.col("Position")).unique().to_series().to_list()
['DF,FW', 'GK', 'DF', 'MF,FW', 'FW,DF', 'MF', 'DF,MF', 'FW,MF', 'FW', 'MF,DF']

For the players that play multiple positions, we take the position specified first assuming it is their primary position. This will bring down the number of different positions to four!

Code
df_stats_relevant = (df_stats
    .filter(pl.col("90s Played") >= 15)
    .with_columns(pl.col("Position").str.slice(0, 2))
)

UMAP requires numerical data, so let’s only select numerical columns, which in this case are all non-string columns.

Code
df_stats_relevant.select(~cs.by_dtype(pl.String))
shape: (1_215, 126)
Rk Age Year of birth 90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Played Starts Goals Against Goals Against/90 Shots on Target Against Saves Save Percentage Wins Draws Losses Tackles Tackles Won Tackles (Def 3rd) Tackles (Mid 3rd) Tackles (Att 3rd) Dribblers Tackled Dribbles Challenged % of Dribblers Tackled Challenges Lost Blocks Shots Blocked Interceptions Tkl+Int Clearances Errors Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
i64 i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 f64 i64 f64 f64 f64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
5 35 1987 30.9 1552 1836 84.5 29618 9672 487 548 88.9 893 976 91.5 141 252 56.0 0 0.3 0.6 -0.3 8 129 3 0 137 null null null null null null null null null null 64 35 36 23 5 26 45 57.8 19 51 32 39 103 109 2 2185 293 976 1119 114 35 2184 15 8 53.3 7 46.7 1506 8663 4921 36 19 0 23 4 1403 9
6 23 2000 16.9 796 895 88.9 12470 3008 393 433 90.8 330 360 91.7 41 54 75.9 0 0.5 1.1 -0.5 6 87 5 2 78 null null null null null null null null null null 21 14 8 10 3 8 18 44.4 10 12 1 12 33 18 0 1022 21 168 647 218 7 1022 19 7 36.8 11 57.9 823 4361 1683 9 33 0 23 11 780 20
8 30 1993 31.8 1551 1836 84.5 27382 8613 629 707 89.0 711 802 88.7 150 233 64.4 1 2.2 1.9 -1.2 30 188 23 3 194 null null null null null null null null null null 85 52 43 34 8 38 96 39.6 58 29 6 61 146 61 3 2164 105 553 1288 349 10 2164 65 44 67.7 18 27.7 1649 8707 3892 38 49 3 32 30 1502 51
10 23 2000 15.6 615 797 77.2 9211 4552 351 394 89.1 203 265 76.6 37 79 46.8 1 1.0 1.0 0.0 8 20 7 2 33 null null null null null null null null null null 25 19 15 9 1 17 34 50.0 17 23 5 15 40 62 0 966 72 344 405 226 28 966 13 7 53.8 5 38.5 447 1802 949 14 10 5 18 11 502 58
12 24 1999 25.7 553 762 72.6 12370 4872 141 164 86.0 302 339 89.1 99 225 44.0 0 0.2 0.2 -0.2 3 61 2 0 49 null null null null null null null null null null 35 19 28 5 2 21 35 60.0 14 31 26 23 58 115 0 1032 191 570 422 42 21 1032 5 3 60.0 1 20.0 492 2460 1418 7 6 0 15 1 499 1
2839 22 2001 30.7 710 917 77.4 9106 1752 446 539 82.7 166 213 77.9 32 46 69.6 4 5.2 2.5 -1.2 43 40 28 0 86 null null null null null null null null null null 32 20 8 15 9 13 24 54.2 11 25 2 3 35 36 0 1374 46 135 628 630 136 1372 128 52 40.6 72 56.3 814 4132 1772 53 46 22 114 64 1033 182
2844 28 1994 31.5 962 1148 83.8 18843 6791 335 365 91.8 509 558 91.2 111 198 56.1 0 0.2 0.1 -0.2 4 57 1 0 53 null null null null null null null null null null 23 17 14 9 0 10 16 62.5 6 38 28 29 52 160 0 1455 275 798 626 34 28 1455 3 2 66.7 1 33.3 794 3751 2019 7 5 1 13 3 873 1
2845 26 1997 28.0 1476 1746 84.5 30351 9423 387 450 86.0 913 989 92.3 167 267 62.5 2 1.1 0.8 0.9 10 99 4 0 129 null null null null null null null null null null 32 25 20 10 2 19 26 73.1 7 28 19 32 64 97 3 1983 162 908 1034 50 18 1983 8 4 50.0 4 50.0 1229 7537 4725 28 2 0 14 3 1316 0
2846 24 1999 29.5 1321 1542 85.7 22165 6269 607 677 89.7 607 677 89.7 86 126 68.3 1 1.5 2.0 -0.5 16 136 15 1 152 null null null null null null null null null null 49 27 18 29 2 23 42 54.8 19 33 9 37 86 54 0 1793 91 397 1144 267 32 1793 18 12 66.7 5 27.8 942 5305 2671 34 41 5 26 16 1185 27
2849 24 1998 34.3 1692 2006 84.3 25780 7898 907 992 91.4 635 744 85.3 91 140 65.0 10 9.6 11.2 0.4 102 160 130 13 344 null null null null null null null null null null 49 17 18 21 10 17 58 29.3 41 23 1 15 64 5 0 2326 24 244 926 1173 165 2324 81 37 45.7 38 46.9 1563 8190 3514 91 83 37 53 58 1827 202

There are still some irrelevant columns, namely Rank, Age and Year of Birth.

Code
df_stats_relevant.select(
    ~cs.by_dtype(pl.String) & ~cs.by_name(["Age", "Rk", "Year of birth"])
)
shape: (1_215, 123)
90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Played Starts Goals Against Goals Against/90 Shots on Target Against Saves Save Percentage Wins Draws Losses Clean Sheets Clean Sheet Percentage Penalty Kicks Attempted Tackles Tackles Won Tackles (Def 3rd) Tackles (Mid 3rd) Tackles (Att 3rd) Dribblers Tackled Dribbles Challenged % of Dribblers Tackled Challenges Lost Blocks Shots Blocked Interceptions Tkl+Int Clearances Errors Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 f64 i64 f64 f64 f64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 i64 f64 i64 i64 i64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 f64 i64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
30.9 1552 1836 84.5 29618 9672 487 548 88.9 893 976 91.5 141 252 56.0 0 0.3 0.6 -0.3 8 129 3 0 137 null null null null null null null null null null null null null 64 35 36 23 5 26 45 57.8 19 51 32 39 103 109 2 2185 293 976 1119 114 35 2184 15 8 53.3 7 46.7 1506 8663 4921 36 19 0 23 4 1403 9
16.9 796 895 88.9 12470 3008 393 433 90.8 330 360 91.7 41 54 75.9 0 0.5 1.1 -0.5 6 87 5 2 78 null null null null null null null null null null null null null 21 14 8 10 3 8 18 44.4 10 12 1 12 33 18 0 1022 21 168 647 218 7 1022 19 7 36.8 11 57.9 823 4361 1683 9 33 0 23 11 780 20
31.8 1551 1836 84.5 27382 8613 629 707 89.0 711 802 88.7 150 233 64.4 1 2.2 1.9 -1.2 30 188 23 3 194 null null null null null null null null null null null null null 85 52 43 34 8 38 96 39.6 58 29 6 61 146 61 3 2164 105 553 1288 349 10 2164 65 44 67.7 18 27.7 1649 8707 3892 38 49 3 32 30 1502 51
15.6 615 797 77.2 9211 4552 351 394 89.1 203 265 76.6 37 79 46.8 1 1.0 1.0 0.0 8 20 7 2 33 null null null null null null null null null null null null null 25 19 15 9 1 17 34 50.0 17 23 5 15 40 62 0 966 72 344 405 226 28 966 13 7 53.8 5 38.5 447 1802 949 14 10 5 18 11 502 58
25.7 553 762 72.6 12370 4872 141 164 86.0 302 339 89.1 99 225 44.0 0 0.2 0.2 -0.2 3 61 2 0 49 null null null null null null null null null null null null null 35 19 28 5 2 21 35 60.0 14 31 26 23 58 115 0 1032 191 570 422 42 21 1032 5 3 60.0 1 20.0 492 2460 1418 7 6 0 15 1 499 1
30.7 710 917 77.4 9106 1752 446 539 82.7 166 213 77.9 32 46 69.6 4 5.2 2.5 -1.2 43 40 28 0 86 null null null null null null null null null null null null null 32 20 8 15 9 13 24 54.2 11 25 2 3 35 36 0 1374 46 135 628 630 136 1372 128 52 40.6 72 56.3 814 4132 1772 53 46 22 114 64 1033 182
31.5 962 1148 83.8 18843 6791 335 365 91.8 509 558 91.2 111 198 56.1 0 0.2 0.1 -0.2 4 57 1 0 53 null null null null null null null null null null null null null 23 17 14 9 0 10 16 62.5 6 38 28 29 52 160 0 1455 275 798 626 34 28 1455 3 2 66.7 1 33.3 794 3751 2019 7 5 1 13 3 873 1
28.0 1476 1746 84.5 30351 9423 387 450 86.0 913 989 92.3 167 267 62.5 2 1.1 0.8 0.9 10 99 4 0 129 null null null null null null null null null null null null null 32 25 20 10 2 19 26 73.1 7 28 19 32 64 97 3 1983 162 908 1034 50 18 1983 8 4 50.0 4 50.0 1229 7537 4725 28 2 0 14 3 1316 0
29.5 1321 1542 85.7 22165 6269 607 677 89.7 607 677 89.7 86 126 68.3 1 1.5 2.0 -0.5 16 136 15 1 152 null null null null null null null null null null null null null 49 27 18 29 2 23 42 54.8 19 33 9 37 86 54 0 1793 91 397 1144 267 32 1793 18 12 66.7 5 27.8 942 5305 2671 34 41 5 26 16 1185 27
34.3 1692 2006 84.3 25780 7898 907 992 91.4 635 744 85.3 91 140 65.0 10 9.6 11.2 0.4 102 160 130 13 344 null null null null null null null null null null null null null 49 17 18 21 10 17 58 29.3 41 23 1 15 64 5 0 2326 24 244 926 1173 165 2324 81 37 45.7 38 46.9 1563 8190 3514 91 83 37 53 58 1827 202

For a good comparison, it is fair to condition all stats per 90 minutes played. I am happy to say there is a column for that, so we can divide each column by that column.

Code
df_stats_relevant.select(
    ~cs.by_dtype(pl.String) & ~cs.by_name(["Age", "Rk", "Year of birth"])
).with_columns(pl.all() / pl.col("90s Played").exclude("90s Played"))
shape: (1_215, 123)
90s Played Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Played Starts Goals Against Goals Against/90 Shots on Target Against Saves Save Percentage Wins Draws Losses Clean Sheets Clean Sheet Percentage Penalty Kicks Attempted Tackles Tackles Won Tackles (Def 3rd) Tackles (Mid 3rd) Tackles (Att 3rd) Dribblers Tackled Dribbles Challenged % of Dribblers Tackled Challenges Lost Blocks Shots Blocked Interceptions Tkl+Int Clearances Errors Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64
30.9 50.226537 59.417476 2.734628 958.511327 313.009709 15.760518 17.734628 2.877023 28.899676 31.585761 2.961165 4.563107 8.15534 1.812298 0.0 0.009709 0.019417 -0.009709 0.2589 4.174757 0.097087 0.0 4.433657 null null null null null null null null null null null null null 2.071197 1.132686 1.165049 0.744337 0.161812 0.841424 1.456311 1.87055 0.614887 1.650485 1.035599 1.262136 3.333333 3.527508 0.064725 70.711974 9.482201 31.585761 36.213592 3.68932 1.132686 70.679612 0.485437 0.2589 1.724919 0.226537 1.511327 48.737864 280.355987 159.255663 1.165049 0.614887 0.0 0.744337 0.12945 45.404531 0.291262
16.9 47.100592 52.95858 5.260355 737.869822 177.988166 23.254438 25.621302 5.372781 19.526627 21.301775 5.426036 2.426036 3.195266 4.491124 0.0 0.029586 0.065089 -0.029586 0.35503 5.147929 0.295858 0.118343 4.615385 null null null null null null null null null null null null null 1.242604 0.828402 0.473373 0.591716 0.177515 0.473373 1.065089 2.627219 0.591716 0.710059 0.059172 0.710059 1.952663 1.065089 0.0 60.473373 1.242604 9.940828 38.284024 12.899408 0.414201 60.473373 1.12426 0.414201 2.177515 0.650888 3.426036 48.698225 258.047337 99.585799 0.532544 1.952663 0.0 1.360947 0.650888 46.153846 1.183432
31.8 48.773585 57.735849 2.657233 861.069182 270.849057 19.779874 22.232704 2.798742 22.358491 25.220126 2.789308 4.716981 7.327044 2.025157 0.031447 0.069182 0.059748 -0.037736 0.943396 5.91195 0.72327 0.09434 6.100629 null null null null null null null null null null null null null 2.672956 1.63522 1.352201 1.069182 0.251572 1.194969 3.018868 1.245283 1.823899 0.91195 0.188679 1.918239 4.591195 1.918239 0.09434 68.050314 3.301887 17.389937 40.503145 10.974843 0.314465 68.050314 2.044025 1.383648 2.128931 0.566038 0.871069 51.855346 273.805031 122.389937 1.194969 1.540881 0.09434 1.006289 0.943396 47.232704 1.603774
15.6 39.423077 51.089744 4.948718 590.448718 291.794872 22.5 25.25641 5.711538 13.012821 16.987179 4.910256 2.371795 5.064103 3.0 0.064103 0.064103 0.064103 0.0 0.512821 1.282051 0.448718 0.128205 2.115385 null null null null null null null null null null null null null 1.602564 1.217949 0.961538 0.576923 0.064103 1.089744 2.179487 3.205128 1.089744 1.474359 0.320513 0.961538 2.564103 3.974359 0.0 61.923077 4.615385 22.051282 25.961538 14.487179 1.794872 61.923077 0.833333 0.448718 3.448718 0.320513 2.467949 28.653846 115.512821 60.833333 0.897436 0.641026 0.320513 1.153846 0.705128 32.179487 3.717949
25.7 21.51751 29.649805 2.824903 481.322957 189.571984 5.486381 6.381323 3.346304 11.750973 13.190661 3.466926 3.85214 8.754864 1.712062 0.0 0.007782 0.007782 -0.007782 0.116732 2.373541 0.077821 0.0 1.906615 null null null null null null null null null null null null null 1.361868 0.7393 1.089494 0.194553 0.077821 0.817121 1.361868 2.33463 0.544747 1.206226 1.011673 0.894942 2.256809 4.474708 0.0 40.155642 7.431907 22.178988 16.420233 1.634241 0.817121 40.155642 0.194553 0.116732 2.33463 0.038911 0.77821 19.143969 95.719844 55.175097 0.272374 0.233463 0.0 0.583658 0.038911 19.416342 0.038911
30.7 23.127036 29.869707 2.521173 296.612378 57.068404 14.527687 17.557003 2.693811 5.407166 6.938111 2.537459 1.042345 1.498371 2.267101 0.130293 0.169381 0.081433 -0.039088 1.400651 1.302932 0.912052 0.0 2.801303 null null null null null null null null null null null null null 1.042345 0.651466 0.260586 0.488599 0.29316 0.423453 0.781759 1.765472 0.358306 0.814332 0.065147 0.09772 1.140065 1.172638 0.0 44.7557 1.498371 4.397394 20.456026 20.521173 4.429967 44.690554 4.169381 1.693811 1.322476 2.345277 1.833876 26.514658 134.592834 57.71987 1.726384 1.498371 0.716612 3.713355 2.084691 33.648208 5.928339
31.5 30.539683 36.444444 2.660317 598.190476 215.587302 10.634921 11.587302 2.914286 16.15873 17.714286 2.895238 3.52381 6.285714 1.780952 0.0 0.006349 0.003175 -0.006349 0.126984 1.809524 0.031746 0.0 1.68254 null null null null null null null null null null null null null 0.730159 0.539683 0.444444 0.285714 0.0 0.31746 0.507937 1.984127 0.190476 1.206349 0.888889 0.920635 1.650794 5.079365 0.0 46.190476 8.730159 25.333333 19.873016 1.079365 0.888889 46.190476 0.095238 0.063492 2.11746 0.031746 1.057143 25.206349 119.079365 64.095238 0.222222 0.15873 0.031746 0.412698 0.095238 27.714286 0.031746
28.0 52.714286 62.357143 3.017857 1083.964286 336.535714 13.821429 16.071429 3.071429 32.607143 35.321429 3.296429 5.964286 9.535714 2.232143 0.071429 0.039286 0.028571 0.032143 0.357143 3.535714 0.142857 0.0 4.607143 null null null null null null null null null null null null null 1.142857 0.892857 0.714286 0.357143 0.071429 0.678571 0.928571 2.610714 0.25 1.0 0.678571 1.142857 2.285714 3.464286 0.107143 70.821429 5.785714 32.428571 36.928571 1.785714 0.642857 70.821429 0.285714 0.142857 1.785714 0.142857 1.785714 43.892857 269.178571 168.75 1.0 0.071429 0.0 0.5 0.107143 47.0 0.0
29.5 44.779661 52.271186 2.905085 751.355932 212.508475 20.576271 22.949153 3.040678 20.576271 22.949153 3.040678 2.915254 4.271186 2.315254 0.033898 0.050847 0.067797 -0.016949 0.542373 4.610169 0.508475 0.033898 5.152542 null null null null null null null null null null null null null 1.661017 0.915254 0.610169 0.983051 0.067797 0.779661 1.423729 1.857627 0.644068 1.118644 0.305085 1.254237 2.915254 1.830508 0.0 60.779661 3.084746 13.457627 38.779661 9.050847 1.084746 60.779661 0.610169 0.40678 2.261017 0.169492 0.942373 31.932203 179.830508 90.542373 1.152542 1.389831 0.169492 0.881356 0.542373 40.169492 0.915254
34.3 49.329446 58.483965 2.457726 751.603499 230.262391 26.443149 28.921283 2.664723 18.51312 21.690962 2.48688 2.653061 4.081633 1.895044 0.291545 0.279883 0.326531 0.011662 2.973761 4.664723 3.790087 0.379009 10.029155 null null null null null null null null null null null null null 1.428571 0.495627 0.524781 0.612245 0.291545 0.495627 1.690962 0.854227 1.195335 0.670554 0.029155 0.437318 1.865889 0.145773 0.0 67.813411 0.699708 7.113703 26.997085 34.198251 4.810496 67.755102 2.361516 1.078717 1.332362 1.107872 1.367347 45.568513 238.77551 102.44898 2.653061 2.419825 1.078717 1.54519 1.690962 53.265306 5.889213

The stat of number of 90s played can be ignored if we disregard ‘fitness’ as a trait Toni Kroos had. This brings us to the following dataframe:

Code
df_stats_relevant_umap = (
    df_stats_relevant.select(
        ~cs.by_dtype(pl.String) & ~cs.by_name(["Age", "Rk", "Year of birth"])
    )
    .with_columns(pl.all() / pl.col("90s Played").exclude("90s Played"))
    .drop(pl.col("90s Played"))
)
df_stats_relevant_umap
shape: (1_215, 122)
Passes Completed Passes Attempted Pass Completion % Total Passing Distance Progressive Passing Distance Passes Completed (Short) Passes Attempted (Short) Pass Completion % (Short) Passes Completed (Medium) Passes Attempted (Medium) Pass Completion % (Medium) Passes Completed (Long) Passes Attempted (Long) Pass Completion % (Long) Assists xAG: Exp. Assisted Goals xA: Expected Assists Assists - xAG Key Passes Passes into Final Third Passes into Penalty Area Crosses into Penalty Area Progressive Passes Matches Played Starts Goals Against Goals Against/90 Shots on Target Against Saves Save Percentage Wins Draws Losses Clean Sheets Clean Sheet Percentage Penalty Kicks Attempted Penalty Kicks Allowed Tackles Tackles Won Tackles (Def 3rd) Tackles (Mid 3rd) Tackles (Att 3rd) Dribblers Tackled Dribbles Challenged % of Dribblers Tackled Challenges Lost Blocks Shots Blocked Interceptions Tkl+Int Clearances Errors Touches Touches (Def Pen) Touches (Def 3rd) Touches (Mid 3rd) Touches (Att 3rd) Touches (Att Pen) Touches (Live-Ball) Take-Ons Attempted Successful Take-Ons Successful Take-On % Times Tackled During Take-On Tackled During Take-On Percentage Carries Total Carrying Distance Progressive Carrying Distance Progressive Carries Carries into Final Third Carries into Penalty Area Miscontrols Dispossessed Passes Received Progressive Passes Rec
f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64
50.226537 59.417476 2.734628 958.511327 313.009709 15.760518 17.734628 2.877023 28.899676 31.585761 2.961165 4.563107 8.15534 1.812298 0.0 0.009709 0.019417 -0.009709 0.2589 4.174757 0.097087 0.0 4.433657 null null null null null null null null null null null null null null 2.071197 1.132686 1.165049 0.744337 0.161812 0.841424 1.456311 1.87055 0.614887 1.650485 1.035599 1.262136 3.333333 3.527508 0.064725 70.711974 9.482201 31.585761 36.213592 3.68932 1.132686 70.679612 0.485437 0.2589 1.724919 0.226537 1.511327 48.737864 280.355987 159.255663 1.165049 0.614887 0.0 0.744337 0.12945 45.404531 0.291262
47.100592 52.95858 5.260355 737.869822 177.988166 23.254438 25.621302 5.372781 19.526627 21.301775 5.426036 2.426036 3.195266 4.491124 0.0 0.029586 0.065089 -0.029586 0.35503 5.147929 0.295858 0.118343 4.615385 null null null null null null null null null null null null null null 1.242604 0.828402 0.473373 0.591716 0.177515 0.473373 1.065089 2.627219 0.591716 0.710059 0.059172 0.710059 1.952663 1.065089 0.0 60.473373 1.242604 9.940828 38.284024 12.899408 0.414201 60.473373 1.12426 0.414201 2.177515 0.650888 3.426036 48.698225 258.047337 99.585799 0.532544 1.952663 0.0 1.360947 0.650888 46.153846 1.183432
48.773585 57.735849 2.657233 861.069182 270.849057 19.779874 22.232704 2.798742 22.358491 25.220126 2.789308 4.716981 7.327044 2.025157 0.031447 0.069182 0.059748 -0.037736 0.943396 5.91195 0.72327 0.09434 6.100629 null null null null null null null null null null null null null null 2.672956 1.63522 1.352201 1.069182 0.251572 1.194969 3.018868 1.245283 1.823899 0.91195 0.188679 1.918239 4.591195 1.918239 0.09434 68.050314 3.301887 17.389937 40.503145 10.974843 0.314465 68.050314 2.044025 1.383648 2.128931 0.566038 0.871069 51.855346 273.805031 122.389937 1.194969 1.540881 0.09434 1.006289 0.943396 47.232704 1.603774
39.423077 51.089744 4.948718 590.448718 291.794872 22.5 25.25641 5.711538 13.012821 16.987179 4.910256 2.371795 5.064103 3.0 0.064103 0.064103 0.064103 0.0 0.512821 1.282051 0.448718 0.128205 2.115385 null null null null null null null null null null null null null null 1.602564 1.217949 0.961538 0.576923 0.064103 1.089744 2.179487 3.205128 1.089744 1.474359 0.320513 0.961538 2.564103 3.974359 0.0 61.923077 4.615385 22.051282 25.961538 14.487179 1.794872 61.923077 0.833333 0.448718 3.448718 0.320513 2.467949 28.653846 115.512821 60.833333 0.897436 0.641026 0.320513 1.153846 0.705128 32.179487 3.717949
21.51751 29.649805 2.824903 481.322957 189.571984 5.486381 6.381323 3.346304 11.750973 13.190661 3.466926 3.85214 8.754864 1.712062 0.0 0.007782 0.007782 -0.007782 0.116732 2.373541 0.077821 0.0 1.906615 null null null null null null null null null null null null null null 1.361868 0.7393 1.089494 0.194553 0.077821 0.817121 1.361868 2.33463 0.544747 1.206226 1.011673 0.894942 2.256809 4.474708 0.0 40.155642 7.431907 22.178988 16.420233 1.634241 0.817121 40.155642 0.194553 0.116732 2.33463 0.038911 0.77821 19.143969 95.719844 55.175097 0.272374 0.233463 0.0 0.583658 0.038911 19.416342 0.038911
23.127036 29.869707 2.521173 296.612378 57.068404 14.527687 17.557003 2.693811 5.407166 6.938111 2.537459 1.042345 1.498371 2.267101 0.130293 0.169381 0.081433 -0.039088 1.400651 1.302932 0.912052 0.0 2.801303 null null null null null null null null null null null null null null 1.042345 0.651466 0.260586 0.488599 0.29316 0.423453 0.781759 1.765472 0.358306 0.814332 0.065147 0.09772 1.140065 1.172638 0.0 44.7557 1.498371 4.397394 20.456026 20.521173 4.429967 44.690554 4.169381 1.693811 1.322476 2.345277 1.833876 26.514658 134.592834 57.71987 1.726384 1.498371 0.716612 3.713355 2.084691 33.648208 5.928339
30.539683 36.444444 2.660317 598.190476 215.587302 10.634921 11.587302 2.914286 16.15873 17.714286 2.895238 3.52381 6.285714 1.780952 0.0 0.006349 0.003175 -0.006349 0.126984 1.809524 0.031746 0.0 1.68254 null null null null null null null null null null null null null null 0.730159 0.539683 0.444444 0.285714 0.0 0.31746 0.507937 1.984127 0.190476 1.206349 0.888889 0.920635 1.650794 5.079365 0.0 46.190476 8.730159 25.333333 19.873016 1.079365 0.888889 46.190476 0.095238 0.063492 2.11746 0.031746 1.057143 25.206349 119.079365 64.095238 0.222222 0.15873 0.031746 0.412698 0.095238 27.714286 0.031746
52.714286 62.357143 3.017857 1083.964286 336.535714 13.821429 16.071429 3.071429 32.607143 35.321429 3.296429 5.964286 9.535714 2.232143 0.071429 0.039286 0.028571 0.032143 0.357143 3.535714 0.142857 0.0 4.607143 null null null null null null null null null null null null null null 1.142857 0.892857 0.714286 0.357143 0.071429 0.678571 0.928571 2.610714 0.25 1.0 0.678571 1.142857 2.285714 3.464286 0.107143 70.821429 5.785714 32.428571 36.928571 1.785714 0.642857 70.821429 0.285714 0.142857 1.785714 0.142857 1.785714 43.892857 269.178571 168.75 1.0 0.071429 0.0 0.5 0.107143 47.0 0.0
44.779661 52.271186 2.905085 751.355932 212.508475 20.576271 22.949153 3.040678 20.576271 22.949153 3.040678 2.915254 4.271186 2.315254 0.033898 0.050847 0.067797 -0.016949 0.542373 4.610169 0.508475 0.033898 5.152542 null null null null null null null null null null null null null null 1.661017 0.915254 0.610169 0.983051 0.067797 0.779661 1.423729 1.857627 0.644068 1.118644 0.305085 1.254237 2.915254 1.830508 0.0 60.779661 3.084746 13.457627 38.779661 9.050847 1.084746 60.779661 0.610169 0.40678 2.261017 0.169492 0.942373 31.932203 179.830508 90.542373 1.152542 1.389831 0.169492 0.881356 0.542373 40.169492 0.915254
49.329446 58.483965 2.457726 751.603499 230.262391 26.443149 28.921283 2.664723 18.51312 21.690962 2.48688 2.653061 4.081633 1.895044 0.291545 0.279883 0.326531 0.011662 2.973761 4.664723 3.790087 0.379009 10.029155 null null null null null null null null null null null null null null 1.428571 0.495627 0.524781 0.612245 0.291545 0.495627 1.690962 0.854227 1.195335 0.670554 0.029155 0.437318 1.865889 0.145773 0.0 67.813411 0.699708 7.113703 26.997085 34.198251 4.810496 67.755102 2.361516 1.078717 1.332362 1.107872 1.367347 45.568513 238.77551 102.44898 2.653061 2.419825 1.078717 1.54519 1.690962 53.265306 5.889213

There is only one thing left I would like to change. For the plots in the next step, it is useful to highlight Toni Kroos. By changing the value of the Position column, we can still use that column to dictate the colours and also have this clear distinction for Kroos.

Code
df_stats_relevant = df_stats_relevant.with_columns(
    pl.when(pl.col("Player") == "Toni Kroos")
    .then(pl.lit("Toni Kroos"))
    .otherwise(pl.col("Position"))
    .alias("Position")
)

Step 3: Interpreting UMAP results and plots

We’ve got the dataframe and it is very tempting to just plug that into UMAP. Yet it generally is wiser to scale the data first. We can use the standard preprocessing tools that we know from sklearn.

Code
from sklearn.preprocessing import StandardScaler

df_stats_numeric_scaled = StandardScaler().fit_transform(df_stats_relevant_umap)

Let’s get on with the first plot!

Code
import umap.plot

reducer = umap.umap_.UMAP(random_state=10)
mapper = reducer.fit(df_stats_numeric_scaled, force_all_finite="allow-nan")
umap.plot.points(mapper, labels=df_stats_relevant["Position"], background="black")

Oh man…

What’s that?!

What’s that?!

What even is this?! Four positions… we would expect four ‘clusters’, right? Of course you can argue about the generality of the positions, but we would at least expect some type of separation between the positions. The current one is all over the place. But hope’s not lost yet! Checking the documentation of the arguments we can provide to UMAP shows ‘euclidean’ as default metric. I would rather not go into too much detail, so essentially: the Euclidean metric in this case does not make sense. It would mean that people that have similar ‘mean’ stats across all stats will be packed closer to each other. This is not what we want: there should be more weights on the qualities that make a player unique, and that should be compared to others. A better metric to use would be the Chebyshev metric. Again without going into too much detail: the Chebyshev metric measures the greatest (absolute) difference between any stat of two players. For example, if player A is a better dribbler and player B is a better passer, these individual qualities will be more prominent with the Chebyshev metric. Let’s see with a plot!

Code
import umap.plot

reducer = umap.umap_.UMAP(random_state=10, metric="chebyshev")
mapper = reducer.fit(df_stats_numeric_scaled, force_all_finite="allow-nan")
umap.plot.points(mapper, labels=df_stats_relevant["Position"], background="black")

Ah, much better.

Phew

Phew

But still something we can improve on! Now what about the other parameters? We have already touched upon the metric. The other frequently used parameters are:

  • n_components;
  • n_neighbors;
  • min_dist.

For the components we can show both a 2D and 3D case. Plotting in higher dimensions is a little more cumbersome. That leaves n_neighbors and min_dist. The former is set at 15 by default. A rule of thumb is that the lower this value, the more the local structure will be preserved. Conversely the higher this number, the more UMAP sees the overall structure of the data. After some tweaking, it was clear from the results that a value of around 30 is quite good for both the local and global structure.

Afs for the min_dist, the rule of thumb is: the lower the value, the clumpier the embeddings. As is stated by the documentation of UMAP, this can be useful for clustering. A default value of 0.1 is fine for our use case, which leads us to the following instances for 2D and 3D UMAP embeddings.

Code
reducer_2d = umap.umap_.UMAP(
    n_components=2, metric="chebyshev", n_neighbors=30, random_state=10
)
reducer_3d = umap.umap_.UMAP(
    n_components=3, metric="chebyshev", n_neighbors=30, random_state=10
)

embeddings_2d = reducer_2d.fit_transform(
    df_stats_numeric_scaled, force_all_finite="allow-nan"
)
embeddings_3d = reducer_3d.fit_transform(
    df_stats_numeric_scaled, force_all_finite="allow-nan"
)

This leads us to two interactive figures underneath!

Code
import plotly.express as px

df_plot_2d = df_stats_relevant.hstack(pl.DataFrame(embeddings_2d, schema=["x", "y"]))
df_plot_3d = df_stats_relevant.hstack(
    pl.DataFrame(embeddings_3d, schema=["x", "y", "z"])
)

fig_2d = px.scatter(
    df_plot_2d,
    title="2D",
    x="x",
    y="y",
    color=df_stats_relevant["Position"],
    hover_data=["Player", "Squad", "Competition"],
    labels={"color": ""},
)
fig_2d.update_layout(
    plot_bgcolor="white",
    margin=dict(l=20, r=20, t=50, b=20),
    xaxis=dict(showticklabels=False, title=None, linecolor="black"),
    yaxis=dict(showticklabels=False, title=None, linecolor="black"),
)
fig_2d.update_traces(
    hovertemplate="<br>".join(
        [
            "Player: %{customdata[0]}",
            "Squad: %{customdata[1]}",
            "Competition: %{customdata[2]}",
        ]
    )
)

fig_3d = px.scatter_3d(
    df_plot_3d,
    title="3D",
    x="x",
    y="y",
    z="z",
    color=df_stats_relevant["Position"],
    hover_data=["Player", "Squad", "Competition"],
    labels={"color": ""},
)
fig_3d.update_traces(
    marker_size=3,
    hovertemplate="<br>".join(
        [
            "Player: %{customdata[0]}",
            "Squad: %{customdata[1]}",
            "Competition: %{customdata[2]}",
        ]
    ),
)
fig_3d.update_layout(
    scene=dict(
        xaxis_title="",
        yaxis_title="",
        zaxis_title="",
        xaxis_showspikes=False,
        yaxis_showspikes=False,
        xaxis=dict(
            backgroundcolor="rgb(200, 200, 230)",
            showticklabels=False,
            linecolor="black",
        ),
        yaxis=dict(
            backgroundcolor="rgb(230, 200,230)",
            showticklabels=False,
            linecolor="black",
        ),
        zaxis=dict(
            backgroundcolor="rgb(230, 230,200)",
            showticklabels=False,
            linecolor="black",
        ),
    ),
    margin=dict(l=20, r=20, t=50, b=20),
)

fig_2d.show()
fig_3d.show()

Aleix Garcia is statistically the most similar player to Toni Kroos, which can be seen in both the 2D and 3D embedding. He is a central midfielder who played for Girona in the 2023/24 season. If you followed LaLiga last leason you should know that Girona was a serious title contender for the majority of the run. Only at the final third part did they not keep up with Real Madrid anymore. There have been many praises for this Girona team, with Aleix Garcia as one of the most prominent players. It does not surprise me the least that he is most similar to Kroos, and for his tremendous season he has earned a transfer to Bayer Leverkusen, the 2023/24 Bundesliga champions.

In the 3D embedding, it is quite extraordinary to see a defenders close to him as well. The one that is closes is Trent Alexander-Arnold, who plays for Liverpool in the Premier League. He is known to be an amazing creator, but usually plays right-back. There are some rumours about him joining Real Madrid as a free agent in 2025, so who knows… 👀

TL;DR

Aleix Garcia. Trent Alexander-Arnold was the defender that was the closest. Keep in mind that the source does not have all statistics. Also we took the entirety of stats. For a more scoped look in the key things that are missing for Real Madrid, we could opt to include only those stats, such as passing and possession.

That’s all for today, thanks for reading!

Back to top

Footnotes

  1. The only reason I put this up is because I spent too much time messing around with Polars, and I just did not get it to work…↩︎

  2. Joining on the player name is trivial, but the squad is less obvious: the data could contain multiple records of the same player. This is due to either late transfers or a transfer in the winter window.↩︎

  3. You can do so by running a simple for loop and inspecting all the plotted results.↩︎

  4. https://umap-learn.readthedocs.io/en/latest/parameters.html#n-neighbors↩︎

  5. According to the Chebyshev metric and our choice for min_dist and n_neighbors.↩︎