If you have watched any of Real Madrid’s first few games in this season (2024/2025) and also previous seasons, you must have noticed a hasty and uncontrolled playing style in recent games. Very clearly this is due to the absence of Toni Kroos in midfield, and you don’t have to be a tactico to observe that. Even though he is truly one of a kind, I would like to use football data to find the most similar player in terms of statistics. A recent trip to PyData Amsterdam 2024 and the talk of Jeroen Janssens about ‘How I hacked UMAP and won at a plotting contest’ has inspired me to try and play around with Polars for a small side project, and this seems like the perfect one! Also the use of UMAP suits this project particularly well.
Finding a Toni Kroos replacement
In an earlier post with football data we used a type of football data which is great for dimensionality reduction, which is what UMAP entails as a technique. The only problem with that data is that it was for the 2022-2023 season, and I want to use the most recent completed season. Fortunately the Kaggle dataset also showed the source and it was quite easy to find the source data for the 2023-2024 season.
Step 1: Obtaining the data
After clicking around on FBREF, I found the player stats of the 2023-2024 season, so let’s get digging!
The data is not yet in an easy format to work with. We could opt for pd.read_html() method, but that’s no fun! Besides, I think doing manual HTML inspecting and then extracting is a good skill to maintain (from time to time). As it is simply a matter of extracting some HTML content, beautifulsoup should be sufficient!
Terminal
pip install bs4 polars
Let’s first inspect the stats. There are several categories with each their own URL:
standard stats
goalkeeping
advanced goalkeeping
shooting
passing
pass types
goal and shot creation
possession
playing time
miscellaneous stats
Even though Toni Kroos is not a goalkeeper, I think it’s good to keep the goalkeeping stats for the final UMAP plot. We will, however, disregard the ‘standard stats’ and ‘miscellaneous stats’. Fortunately the URLs are similar and only differ by a simple substring.
Code
import requestsfrom bs4 import BeautifulSoupimport polars as pldef extract_table_and_columns(stat: str) -> (pl.DataFrame, dict):"""Extract table from url. Args: stat (str): The statistic of the table Returns: pl.DataFrame: A Polars dataframe with the information of the players. dict: A dictionary of columns and their descriptions. """ base_url ="https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats" url = base_url.format(stat=stat) resp = requests.get(url) soup = BeautifulSoup(resp.content, features="lxml")# The HTML of the table stats_table = soup.find_all("table")[-1]# Parsing headers tr_header = stats_table.findChildren(name="tr")[1] column_descriptions = { cell.get("aria-label"): cell.get("data-tip")for cell in tr_header.find_all(["th", "td"]) } columns = [key for key in column_descriptions.keys()]iflen(columns) !=len(set(columns)):raiseAssertionError("Column names must be unique.")# Parsing stats table_body = stats_table.find_next(name="tbody") rows_players = table_body.findChildren(name="tr", attrs={"class": False}) table_data: list= []for row in rows_players: cells = row.find_all( ["th", "td"] ) # Collect all cell data, both headers and regular cells table_data.append([cell.get_text(strip=True) for cell in cells])return pl.DataFrame(table_data, schema=columns, orient="row"), column_descriptions
Damn, looks kinda ugly with the automatically inferred strings for every column. We are actually going to try something quick and nasty for this, like saving these as .csv files and then reading them again for a better inference.1
Code
import osdef parse_columns(df: pl.DataFrame) -> pl.DataFrame:"""Parse columns of a dataframe for better inference. Args: df (pl.DataFrame): A Polars dataframe which has structurally incorrect columns. Returns: pl.DataFrame: A Polars dataframe with better inferred column types. """ df.write_csv("temp.csv", quote_style="necessary") parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)if os.path.exists("temp.csv"): os.remove("temp.csv")return parsed_dfdef extract_and_parse_table(stat) -> pl.DataFrame: df, descriptions = extract_table_and_columns(stat)return parse_columns(df)extract_and_parse_table("passing")
shape: (2_852, 33)
Rk
Player
Nation
Position
Squad
Competition
Age
Year of birth
90s Played
Passes Completed
Passes Attempted
Pass Completion %
Total Passing Distance
Progressive Passing Distance
Passes Completed (Short)
Passes Attempted (Short)
Pass Completion % (Short)
Passes Completed (Medium)
Passes Attempted (Medium)
Pass Completion % (Medium)
Passes Completed (Long)
Passes Attempted (Long)
Pass Completion % (Long)
Assists
xAG: Exp. Assisted Goals
xA: Expected Assists
Assists - xAG
Key Passes
Passes into Final Third
Passes into Penalty Area
Crosses into Penalty Area
Progressive Passes
Matches
i64
str
str
str
str
str
i64
i64
f64
i64
i64
f64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
f64
i64
f64
f64
f64
i64
i64
i64
i64
i64
str
1
"Max Aarons"
"engENG"
"DF"
"Bournemouth"
"engPremier League"
23
2000
13.7
450
581
77.5
7402
2789
220
248
88.7
188
235
80.0
34
63
54.0
1
0.8
0.9
0.2
7
25
13
2
43
"Matches"
2
"Brenden Aaronson"
"usUSA"
"MF,FW"
"Union Berlin"
"deBundesliga"
22
2000
14.1
365
472
77.3
4890
1506
206
240
85.8
105
130
80.8
19
32
59.4
2
1.9
2.0
0.1
22
30
14
3
56
"Matches"
3
"Paxten Aaronson"
"usUSA"
"MF"
"Eint Frankfurt"
"deBundesliga"
19
2003
1.1
41
50
82.0
576
71
20
25
80.0
20
22
90.9
0
2
0.0
1
0.1
0.1
0.9
1
4
2
0
5
"Matches"
4
"Keyliane Abdallah"
"frFRA"
"FW"
"Marseille"
"frLigue 1"
17
2006
0.0
1
1
100.0
8
0
1
1
100.0
0
0
null
0
0
null
0
0.0
0.0
0.0
0
0
0
0
0
"Matches"
5
"Yunis Abdelhamid"
"maMAR"
"DF"
"Reims"
"frLigue 1"
35
1987
30.9
1552
1836
84.5
29618
9672
487
548
88.9
893
976
91.5
141
252
56.0
0
0.3
0.6
-0.3
8
129
3
0
137
"Matches"
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2848
"Lovro Zvonarek"
"hrCRO"
"FW,MF"
"Bayern Munich"
"deBundesliga"
18
2005
1.8
50
60
83.3
766
123
20
23
87.0
25
29
86.2
2
2
100.0
0
0.2
0.3
-0.2
2
5
1
0
7
"Matches"
2849
"Martin Ødegaard"
"noNOR"
"MF"
"Arsenal"
"engPremier League"
24
1998
34.3
1692
2006
84.3
25780
7898
907
992
91.4
635
744
85.3
91
140
65.0
10
9.6
11.2
0.4
102
160
130
13
344
"Matches"
2850
"Milan Đurić"
"baBIH"
"FW"
"Hellas Verona"
"itSerie A"
33
1990
13.4
181
334
54.2
2116
392
130
213
61.0
32
69
46.4
5
12
41.7
1
1.2
0.5
-0.2
16
18
3
0
15
"Matches"
2851
"Milan Đurić"
"baBIH"
"FW"
"Monza"
"itSerie A"
33
1990
14.0
140
272
51.5
1700
251
93
156
59.6
35
86
40.7
0
3
0.0
1
0.7
0.2
0.3
5
3
2
0
8
"Matches"
2852
"Mateusz Łęgowski"
"plPOL"
"MF"
"Salernitana"
"itSerie A"
20
2003
12.7
292
378
77.2
4268
1261
169
196
86.2
95
120
79.2
16
31
51.6
0
0.2
0.4
-0.2
7
30
5
1
37
"Matches"
This is much better! Now let’s refactor a little and put this together into a class:
Code
class StatsExtractor:def__init__(self, stat: str):self.stat: str= statself.base_url: str= ("https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats" )self.url: str=self.base_url.format(stat=self.stat)self.df =Noneself.column_descriptions =Noneself.fetch_data()self.extract_column_descriptions()self.extract_stats()self.parse_columns()def fetch_data(self) ->None:"""Fetch the HTML content from the URL.""" url =self.base_url.format(stat=self.stat) resp = requests.get(url) soup = BeautifulSoup(resp.content, features="lxml")self.html = soup.find_all("table")[-1]def extract_column_descriptions(self) ->None:"""Extract column descriptions""" tr_header =self.html.findChildren(name="tr")[1] column_descriptions = { cell.get("aria-label"): cell.get("data-tip")for cell in tr_header.find_all(["th", "td"]) } columns = [key for key in column_descriptions.keys()]iflen(columns) !=len(set(columns)):raiseAssertionError("Column names must be unique.")self.columns = columnsself.column_descriptions = column_descriptionsdef extract_stats(self) ->None:"""Extract table from URL and store it in a DataFrame.""" table_body =self.html.find_next(name="tbody") rows_players = table_body.findChildren(name="tr", attrs={"class": False}) table_data: list= []for row in rows_players: cells = row.find_all(["th", "td"]) table_data.append([cell.get_text(strip=True) for cell in cells])self.df = pl.DataFrame(table_data, schema=self.columns, orient="row")def parse_columns(self) ->None:"""Parse columns of a DataFrame for better inference."""self.df.write_csv("temp.csv", quote_style="necessary") parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)# if os.path.exists("temp.csv"):# os.remove("temp.csv")self.df = parsed_df
We can now invoke the methods after creating a StatsExtractor instance.
Code
stats_to_extract = ["passing","keepers","shooting","passing_types","gca","defense","possession",]extractors = {stat: StatsExtractor(stat) for stat in stats_to_extract}dfs = {stat: extractor.df for stat, extractor in extractors.items()}
With dfs we only need to create one combined dataframe, and actually get to learn more about the syntax of polars.
Step 2: Exploring polars by data wrangling
After viewing the documentation and some sample code, polars looks very similar to dplyr in R - and I have to say I am quite pleased. My career started with loads of R and in my own time and projects I have transitioned more towards Python. However, the one and only thing I really miss about R is how straightforward and concise the syntax is compared to pandas. Take the following two dataframes
I am so used to not having duplicate columns when performing joins. However in Python using either pandas or polars, I have not found any other way than to drop the duplicate columns. Let’s join these dataframes on the columns Player and Squad2, and drop all the other duplicate columns. In polars we can do so using the selector cs.ends_with().
Code
import polars.selectors as csdf_passing.join(df_possession, how="left", on=["Player", "Squad"]).drop(cs.ends_with("_right"))
shape: (2_852, 55)
Rk
Player
Nation
Position
Squad
Competition
Age
Year of birth
90s Played
Passes Completed
Passes Attempted
Pass Completion %
Total Passing Distance
Progressive Passing Distance
Passes Completed (Short)
Passes Attempted (Short)
Pass Completion % (Short)
Passes Completed (Medium)
Passes Attempted (Medium)
Pass Completion % (Medium)
Passes Completed (Long)
Passes Attempted (Long)
Pass Completion % (Long)
Assists
xAG: Exp. Assisted Goals
xA: Expected Assists
Assists - xAG
Key Passes
Passes into Final Third
Passes into Penalty Area
Crosses into Penalty Area
Progressive Passes
Matches
Touches
Touches (Def Pen)
Touches (Def 3rd)
Touches (Mid 3rd)
Touches (Att 3rd)
Touches (Att Pen)
Touches (Live-Ball)
Take-Ons Attempted
Successful Take-Ons
Successful Take-On %
Times Tackled During Take-On
Tackled During Take-On Percentage
Carries
Total Carrying Distance
Progressive Carrying Distance
Progressive Carries
Carries into Final Third
Carries into Penalty Area
Miscontrols
Dispossessed
Passes Received
Progressive Passes Rec
i64
str
str
str
str
str
i64
i64
f64
i64
i64
f64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
f64
i64
f64
f64
f64
i64
i64
i64
i64
i64
str
i64
i64
i64
i64
i64
i64
i64
i64
i64
f64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
1
"Max Aarons"
"engENG"
"DF"
"Bournemouth"
"engPremier League"
23
2000
13.7
450
581
77.5
7402
2789
220
248
88.7
188
235
80.0
34
63
54.0
1
0.8
0.9
0.2
7
25
13
2
43
"Matches"
711
43
252
303
165
11
711
34
14
41.2
12
35.3
364
2174
1121
22
12
7
13
8
371
26
2
"Brenden Aaronson"
"usUSA"
"MF,FW"
"Union Berlin"
"deBundesliga"
22
2000
14.1
365
472
77.3
4890
1506
206
240
85.8
105
130
80.8
19
32
59.4
2
1.9
2.0
0.1
22
30
14
3
56
"Matches"
675
11
108
301
293
47
675
77
34
44.2
41
53.2
406
2721
1387
37
29
9
41
38
457
91
3
"Paxten Aaronson"
"usUSA"
"MF"
"Eint Frankfurt"
"deBundesliga"
19
2003
1.1
41
50
82.0
576
71
20
25
80.0
20
22
90.9
0
2
0.0
1
0.1
0.1
0.9
1
4
2
0
5
"Matches"
72
0
6
39
28
5
72
7
2
28.6
4
57.1
43
193
50
2
0
1
5
4
46
7
4
"Keyliane Abdallah"
"frFRA"
"FW"
"Marseille"
"frLigue 1"
17
2006
0.0
1
1
100.0
8
0
1
1
100.0
0
0
null
0
0
null
0
0.0
0.0
0.0
0
0
0
0
0
"Matches"
1
0
0
1
0
0
1
0
0
null
0
null
1
3
0
0
0
0
0
0
1
0
5
"Yunis Abdelhamid"
"maMAR"
"DF"
"Reims"
"frLigue 1"
35
1987
30.9
1552
1836
84.5
29618
9672
487
548
88.9
893
976
91.5
141
252
56.0
0
0.3
0.6
-0.3
8
129
3
0
137
"Matches"
2185
293
976
1119
114
35
2184
15
8
53.3
7
46.7
1506
8663
4921
36
19
0
23
4
1403
9
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2848
"Lovro Zvonarek"
"hrCRO"
"FW,MF"
"Bayern Munich"
"deBundesliga"
18
2005
1.8
50
60
83.3
766
123
20
23
87.0
25
29
86.2
2
2
100.0
0
0.2
0.3
-0.2
2
5
1
0
7
"Matches"
73
2
9
38
26
8
73
3
1
33.3
2
66.7
43
207
86
7
1
4
3
1
55
9
2849
"Martin Ødegaard"
"noNOR"
"MF"
"Arsenal"
"engPremier League"
24
1998
34.3
1692
2006
84.3
25780
7898
907
992
91.4
635
744
85.3
91
140
65.0
10
9.6
11.2
0.4
102
160
130
13
344
"Matches"
2326
24
244
926
1173
165
2324
81
37
45.7
38
46.9
1563
8190
3514
91
83
37
53
58
1827
202
2850
"Milan Đurić"
"baBIH"
"FW"
"Hellas Verona"
"itSerie A"
33
1990
13.4
181
334
54.2
2116
392
130
213
61.0
32
69
46.4
5
12
41.7
1
1.2
0.5
-0.2
16
18
3
0
15
"Matches"
421
14
41
227
153
46
418
3
2
66.7
1
33.3
183
529
151
3
6
1
22
8
352
33
2851
"Milan Đurić"
"baBIH"
"FW"
"Monza"
"itSerie A"
33
1990
14.0
140
272
51.5
1700
251
93
156
59.6
35
86
40.7
0
3
0.0
1
0.7
0.2
0.3
5
3
2
0
8
"Matches"
381
32
51
227
104
37
381
4
2
50.0
2
50.0
157
458
108
2
2
1
24
14
323
31
2852
"Mateusz Łęgowski"
"plPOL"
"MF"
"Salernitana"
"itSerie A"
20
2003
12.7
292
378
77.2
4268
1261
169
196
86.2
95
120
79.2
16
31
51.6
0
0.2
0.4
-0.2
7
30
5
1
37
"Matches"
504
31
120
283
111
11
504
21
6
28.6
12
57.1
265
1526
757
16
13
3
20
12
294
17
Exactly what I wanted! Now we can iterate this process until there are no dataframes left to merge on.
Now it’s time to filter and select relevant features and change them slightly. Before doing alterations, I think it is fair to only include players who have played more than 90 minutes 15 times. Perhaps fair is not the right word, but including players who only played a handful of matches is misleading when evaluating their overall season performance. It does not capture consistency which is what we do want to include.
Code
df_stats.filter(pl.col("90s Played") >=15)
shape: (1_215, 133)
Rk
Player
Nation
Position
Squad
Competition
Age
Year of birth
90s Played
Passes Completed
Passes Attempted
Pass Completion %
Total Passing Distance
Progressive Passing Distance
Passes Completed (Short)
Passes Attempted (Short)
Pass Completion % (Short)
Passes Completed (Medium)
Passes Attempted (Medium)
Pass Completion % (Medium)
Passes Completed (Long)
Passes Attempted (Long)
Pass Completion % (Long)
Assists
xAG: Exp. Assisted Goals
xA: Expected Assists
Assists - xAG
Key Passes
Passes into Final Third
Passes into Penalty Area
Crosses into Penalty Area
Progressive Passes
Matches
Matches Played
Starts
Minutes
Goals Against
…
Tackles
Tackles Won
Tackles (Def 3rd)
Tackles (Mid 3rd)
Tackles (Att 3rd)
Dribblers Tackled
Dribbles Challenged
% of Dribblers Tackled
Challenges Lost
Blocks
Shots Blocked
Interceptions
Tkl+Int
Clearances
Errors
Touches
Touches (Def Pen)
Touches (Def 3rd)
Touches (Mid 3rd)
Touches (Att 3rd)
Touches (Att Pen)
Touches (Live-Ball)
Take-Ons Attempted
Successful Take-Ons
Successful Take-On %
Times Tackled During Take-On
Tackled During Take-On Percentage
Carries
Total Carrying Distance
Progressive Carrying Distance
Progressive Carries
Carries into Final Third
Carries into Penalty Area
Miscontrols
Dispossessed
Passes Received
Progressive Passes Rec
i64
str
str
str
str
str
i64
i64
f64
i64
i64
f64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
f64
i64
f64
f64
f64
i64
i64
i64
i64
i64
str
i64
i64
str
i64
…
i64
i64
i64
i64
i64
i64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
f64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
5
"Yunis Abdelhamid"
"maMAR"
"DF"
"Reims"
"frLigue 1"
35
1987
30.9
1552
1836
84.5
29618
9672
487
548
88.9
893
976
91.5
141
252
56.0
0
0.3
0.6
-0.3
8
129
3
0
137
"Matches"
null
null
null
null
…
64
35
36
23
5
26
45
57.8
19
51
32
39
103
109
2
2185
293
976
1119
114
35
2184
15
8
53.3
7
46.7
1506
8663
4921
36
19
0
23
4
1403
9
6
"Salis Abdul Samed"
"ghGHA"
"MF"
"Lens"
"frLigue 1"
23
2000
16.9
796
895
88.9
12470
3008
393
433
90.8
330
360
91.7
41
54
75.9
0
0.5
1.1
-0.5
6
87
5
2
78
"Matches"
null
null
null
null
…
21
14
8
10
3
8
18
44.4
10
12
1
12
33
18
0
1022
21
168
647
218
7
1022
19
7
36.8
11
57.9
823
4361
1683
9
33
0
23
11
780
20
8
"Laurent Abergel"
"frFRA"
"MF"
"Lorient"
"frLigue 1"
30
1993
31.8
1551
1836
84.5
27382
8613
629
707
89.0
711
802
88.7
150
233
64.4
1
2.2
1.9
-1.2
30
188
23
3
194
"Matches"
null
null
null
null
…
85
52
43
34
8
38
96
39.6
58
29
6
61
146
61
3
2164
105
553
1288
349
10
2164
65
44
67.7
18
27.7
1649
8707
3892
38
49
3
32
30
1502
51
10
"Abner"
"brBRA"
"DF"
"Betis"
"esLa Liga"
23
2000
15.6
615
797
77.2
9211
4552
351
394
89.1
203
265
76.6
37
79
46.8
1
1.0
1.0
0.0
8
20
7
2
33
"Matches"
null
null
null
null
…
25
19
15
9
1
17
34
50.0
17
23
5
15
40
62
0
966
72
344
405
226
28
966
13
7
53.8
5
38.5
447
1802
949
14
10
5
18
11
502
58
12
"Abdel Abqar"
"maMAR"
"DF"
"Alavés"
"esLa Liga"
24
1999
25.7
553
762
72.6
12370
4872
141
164
86.0
302
339
89.1
99
225
44.0
0
0.2
0.2
-0.2
3
61
2
0
49
"Matches"
null
null
null
null
…
35
19
28
5
2
21
35
60.0
14
31
26
23
58
115
0
1032
191
570
422
42
21
1032
5
3
60.0
1
20.0
492
2460
1418
7
6
0
15
1
499
1
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2839
"Joshua Zirkzee"
"nlNED"
"FW"
"Bologna"
"itSerie A"
22
2001
30.7
710
917
77.4
9106
1752
446
539
82.7
166
213
77.9
32
46
69.6
4
5.2
2.5
-1.2
43
40
28
0
86
"Matches"
null
null
null
null
…
32
20
8
15
9
13
24
54.2
11
25
2
3
35
36
0
1374
46
135
628
630
136
1372
128
52
40.6
72
56.3
814
4132
1772
53
46
22
114
64
1033
182
2844
"Kurt Zouma"
"frFRA"
"DF"
"West Ham"
"engPremier League"
28
1994
31.5
962
1148
83.8
18843
6791
335
365
91.8
509
558
91.2
111
198
56.1
0
0.2
0.1
-0.2
4
57
1
0
53
"Matches"
null
null
null
null
…
23
17
14
9
0
10
16
62.5
6
38
28
29
52
160
0
1455
275
798
626
34
28
1455
3
2
66.7
1
33.3
794
3751
2019
7
5
1
13
3
873
1
2845
"Igor Zubeldia"
"esESP"
"DF"
"Real Sociedad"
"esLa Liga"
26
1997
28.0
1476
1746
84.5
30351
9423
387
450
86.0
913
989
92.3
167
267
62.5
2
1.1
0.8
0.9
10
99
4
0
129
"Matches"
null
null
null
null
…
32
25
20
10
2
19
26
73.1
7
28
19
32
64
97
3
1983
162
908
1034
50
18
1983
8
4
50.0
4
50.0
1229
7537
4725
28
2
0
14
3
1316
0
2846
"Martín Zubimendi"
"esESP"
"MF"
"Real Sociedad"
"esLa Liga"
24
1999
29.5
1321
1542
85.7
22165
6269
607
677
89.7
607
677
89.7
86
126
68.3
1
1.5
2.0
-0.5
16
136
15
1
152
"Matches"
null
null
null
null
…
49
27
18
29
2
23
42
54.8
19
33
9
37
86
54
0
1793
91
397
1144
267
32
1793
18
12
66.7
5
27.8
942
5305
2671
34
41
5
26
16
1185
27
2849
"Martin Ødegaard"
"noNOR"
"MF"
"Arsenal"
"engPremier League"
24
1998
34.3
1692
2006
84.3
25780
7898
907
992
91.4
635
744
85.3
91
140
65.0
10
9.6
11.2
0.4
102
160
130
13
344
"Matches"
null
null
null
null
…
49
17
18
21
10
17
58
29.3
41
23
1
15
64
5
0
2326
24
244
926
1173
165
2324
81
37
45.7
38
46.9
1563
8190
3514
91
83
37
53
58
1827
202
Although players can play multiple positions, for the plots I prefer we keep it on a down low. The Position column now has these unique values:
For the players that play multiple positions, we take the position specified first assuming it is their primary position. This will bring down the number of different positions to four!
UMAP requires numerical data, so let’s only select numerical columns, which in this case are all non-string columns.
Code
df_stats_relevant.select(~cs.by_dtype(pl.String))
shape: (1_215, 126)
Rk
Age
Year of birth
90s Played
Passes Completed
Passes Attempted
Pass Completion %
Total Passing Distance
Progressive Passing Distance
Passes Completed (Short)
Passes Attempted (Short)
Pass Completion % (Short)
Passes Completed (Medium)
Passes Attempted (Medium)
Pass Completion % (Medium)
Passes Completed (Long)
Passes Attempted (Long)
Pass Completion % (Long)
Assists
xAG: Exp. Assisted Goals
xA: Expected Assists
Assists - xAG
Key Passes
Passes into Final Third
Passes into Penalty Area
Crosses into Penalty Area
Progressive Passes
Matches Played
Starts
Goals Against
Goals Against/90
Shots on Target Against
Saves
Save Percentage
Wins
Draws
Losses
…
Tackles
Tackles Won
Tackles (Def 3rd)
Tackles (Mid 3rd)
Tackles (Att 3rd)
Dribblers Tackled
Dribbles Challenged
% of Dribblers Tackled
Challenges Lost
Blocks
Shots Blocked
Interceptions
Tkl+Int
Clearances
Errors
Touches
Touches (Def Pen)
Touches (Def 3rd)
Touches (Mid 3rd)
Touches (Att 3rd)
Touches (Att Pen)
Touches (Live-Ball)
Take-Ons Attempted
Successful Take-Ons
Successful Take-On %
Times Tackled During Take-On
Tackled During Take-On Percentage
Carries
Total Carrying Distance
Progressive Carrying Distance
Progressive Carries
Carries into Final Third
Carries into Penalty Area
Miscontrols
Dispossessed
Passes Received
Progressive Passes Rec
i64
i64
i64
f64
i64
i64
f64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
f64
i64
f64
f64
f64
i64
i64
i64
i64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
i64
…
i64
i64
i64
i64
i64
i64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
f64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
5
35
1987
30.9
1552
1836
84.5
29618
9672
487
548
88.9
893
976
91.5
141
252
56.0
0
0.3
0.6
-0.3
8
129
3
0
137
null
null
null
null
null
null
null
null
null
null
…
64
35
36
23
5
26
45
57.8
19
51
32
39
103
109
2
2185
293
976
1119
114
35
2184
15
8
53.3
7
46.7
1506
8663
4921
36
19
0
23
4
1403
9
6
23
2000
16.9
796
895
88.9
12470
3008
393
433
90.8
330
360
91.7
41
54
75.9
0
0.5
1.1
-0.5
6
87
5
2
78
null
null
null
null
null
null
null
null
null
null
…
21
14
8
10
3
8
18
44.4
10
12
1
12
33
18
0
1022
21
168
647
218
7
1022
19
7
36.8
11
57.9
823
4361
1683
9
33
0
23
11
780
20
8
30
1993
31.8
1551
1836
84.5
27382
8613
629
707
89.0
711
802
88.7
150
233
64.4
1
2.2
1.9
-1.2
30
188
23
3
194
null
null
null
null
null
null
null
null
null
null
…
85
52
43
34
8
38
96
39.6
58
29
6
61
146
61
3
2164
105
553
1288
349
10
2164
65
44
67.7
18
27.7
1649
8707
3892
38
49
3
32
30
1502
51
10
23
2000
15.6
615
797
77.2
9211
4552
351
394
89.1
203
265
76.6
37
79
46.8
1
1.0
1.0
0.0
8
20
7
2
33
null
null
null
null
null
null
null
null
null
null
…
25
19
15
9
1
17
34
50.0
17
23
5
15
40
62
0
966
72
344
405
226
28
966
13
7
53.8
5
38.5
447
1802
949
14
10
5
18
11
502
58
12
24
1999
25.7
553
762
72.6
12370
4872
141
164
86.0
302
339
89.1
99
225
44.0
0
0.2
0.2
-0.2
3
61
2
0
49
null
null
null
null
null
null
null
null
null
null
…
35
19
28
5
2
21
35
60.0
14
31
26
23
58
115
0
1032
191
570
422
42
21
1032
5
3
60.0
1
20.0
492
2460
1418
7
6
0
15
1
499
1
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2839
22
2001
30.7
710
917
77.4
9106
1752
446
539
82.7
166
213
77.9
32
46
69.6
4
5.2
2.5
-1.2
43
40
28
0
86
null
null
null
null
null
null
null
null
null
null
…
32
20
8
15
9
13
24
54.2
11
25
2
3
35
36
0
1374
46
135
628
630
136
1372
128
52
40.6
72
56.3
814
4132
1772
53
46
22
114
64
1033
182
2844
28
1994
31.5
962
1148
83.8
18843
6791
335
365
91.8
509
558
91.2
111
198
56.1
0
0.2
0.1
-0.2
4
57
1
0
53
null
null
null
null
null
null
null
null
null
null
…
23
17
14
9
0
10
16
62.5
6
38
28
29
52
160
0
1455
275
798
626
34
28
1455
3
2
66.7
1
33.3
794
3751
2019
7
5
1
13
3
873
1
2845
26
1997
28.0
1476
1746
84.5
30351
9423
387
450
86.0
913
989
92.3
167
267
62.5
2
1.1
0.8
0.9
10
99
4
0
129
null
null
null
null
null
null
null
null
null
null
…
32
25
20
10
2
19
26
73.1
7
28
19
32
64
97
3
1983
162
908
1034
50
18
1983
8
4
50.0
4
50.0
1229
7537
4725
28
2
0
14
3
1316
0
2846
24
1999
29.5
1321
1542
85.7
22165
6269
607
677
89.7
607
677
89.7
86
126
68.3
1
1.5
2.0
-0.5
16
136
15
1
152
null
null
null
null
null
null
null
null
null
null
…
49
27
18
29
2
23
42
54.8
19
33
9
37
86
54
0
1793
91
397
1144
267
32
1793
18
12
66.7
5
27.8
942
5305
2671
34
41
5
26
16
1185
27
2849
24
1998
34.3
1692
2006
84.3
25780
7898
907
992
91.4
635
744
85.3
91
140
65.0
10
9.6
11.2
0.4
102
160
130
13
344
null
null
null
null
null
null
null
null
null
null
…
49
17
18
21
10
17
58
29.3
41
23
1
15
64
5
0
2326
24
244
926
1173
165
2324
81
37
45.7
38
46.9
1563
8190
3514
91
83
37
53
58
1827
202
There are still some irrelevant columns, namely Rank, Age and Year of Birth.
Code
df_stats_relevant.select(~cs.by_dtype(pl.String) &~cs.by_name(["Age", "Rk", "Year of birth"]))
shape: (1_215, 123)
90s Played
Passes Completed
Passes Attempted
Pass Completion %
Total Passing Distance
Progressive Passing Distance
Passes Completed (Short)
Passes Attempted (Short)
Pass Completion % (Short)
Passes Completed (Medium)
Passes Attempted (Medium)
Pass Completion % (Medium)
Passes Completed (Long)
Passes Attempted (Long)
Pass Completion % (Long)
Assists
xAG: Exp. Assisted Goals
xA: Expected Assists
Assists - xAG
Key Passes
Passes into Final Third
Passes into Penalty Area
Crosses into Penalty Area
Progressive Passes
Matches Played
Starts
Goals Against
Goals Against/90
Shots on Target Against
Saves
Save Percentage
Wins
Draws
Losses
Clean Sheets
Clean Sheet Percentage
Penalty Kicks Attempted
…
Tackles
Tackles Won
Tackles (Def 3rd)
Tackles (Mid 3rd)
Tackles (Att 3rd)
Dribblers Tackled
Dribbles Challenged
% of Dribblers Tackled
Challenges Lost
Blocks
Shots Blocked
Interceptions
Tkl+Int
Clearances
Errors
Touches
Touches (Def Pen)
Touches (Def 3rd)
Touches (Mid 3rd)
Touches (Att 3rd)
Touches (Att Pen)
Touches (Live-Ball)
Take-Ons Attempted
Successful Take-Ons
Successful Take-On %
Times Tackled During Take-On
Tackled During Take-On Percentage
Carries
Total Carrying Distance
Progressive Carrying Distance
Progressive Carries
Carries into Final Third
Carries into Penalty Area
Miscontrols
Dispossessed
Passes Received
Progressive Passes Rec
f64
i64
i64
f64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
f64
i64
f64
f64
f64
i64
i64
i64
i64
i64
i64
i64
i64
f64
i64
i64
f64
i64
i64
i64
i64
f64
i64
…
i64
i64
i64
i64
i64
i64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
f64
i64
f64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
30.9
1552
1836
84.5
29618
9672
487
548
88.9
893
976
91.5
141
252
56.0
0
0.3
0.6
-0.3
8
129
3
0
137
null
null
null
null
null
null
null
null
null
null
null
null
null
…
64
35
36
23
5
26
45
57.8
19
51
32
39
103
109
2
2185
293
976
1119
114
35
2184
15
8
53.3
7
46.7
1506
8663
4921
36
19
0
23
4
1403
9
16.9
796
895
88.9
12470
3008
393
433
90.8
330
360
91.7
41
54
75.9
0
0.5
1.1
-0.5
6
87
5
2
78
null
null
null
null
null
null
null
null
null
null
null
null
null
…
21
14
8
10
3
8
18
44.4
10
12
1
12
33
18
0
1022
21
168
647
218
7
1022
19
7
36.8
11
57.9
823
4361
1683
9
33
0
23
11
780
20
31.8
1551
1836
84.5
27382
8613
629
707
89.0
711
802
88.7
150
233
64.4
1
2.2
1.9
-1.2
30
188
23
3
194
null
null
null
null
null
null
null
null
null
null
null
null
null
…
85
52
43
34
8
38
96
39.6
58
29
6
61
146
61
3
2164
105
553
1288
349
10
2164
65
44
67.7
18
27.7
1649
8707
3892
38
49
3
32
30
1502
51
15.6
615
797
77.2
9211
4552
351
394
89.1
203
265
76.6
37
79
46.8
1
1.0
1.0
0.0
8
20
7
2
33
null
null
null
null
null
null
null
null
null
null
null
null
null
…
25
19
15
9
1
17
34
50.0
17
23
5
15
40
62
0
966
72
344
405
226
28
966
13
7
53.8
5
38.5
447
1802
949
14
10
5
18
11
502
58
25.7
553
762
72.6
12370
4872
141
164
86.0
302
339
89.1
99
225
44.0
0
0.2
0.2
-0.2
3
61
2
0
49
null
null
null
null
null
null
null
null
null
null
null
null
null
…
35
19
28
5
2
21
35
60.0
14
31
26
23
58
115
0
1032
191
570
422
42
21
1032
5
3
60.0
1
20.0
492
2460
1418
7
6
0
15
1
499
1
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
30.7
710
917
77.4
9106
1752
446
539
82.7
166
213
77.9
32
46
69.6
4
5.2
2.5
-1.2
43
40
28
0
86
null
null
null
null
null
null
null
null
null
null
null
null
null
…
32
20
8
15
9
13
24
54.2
11
25
2
3
35
36
0
1374
46
135
628
630
136
1372
128
52
40.6
72
56.3
814
4132
1772
53
46
22
114
64
1033
182
31.5
962
1148
83.8
18843
6791
335
365
91.8
509
558
91.2
111
198
56.1
0
0.2
0.1
-0.2
4
57
1
0
53
null
null
null
null
null
null
null
null
null
null
null
null
null
…
23
17
14
9
0
10
16
62.5
6
38
28
29
52
160
0
1455
275
798
626
34
28
1455
3
2
66.7
1
33.3
794
3751
2019
7
5
1
13
3
873
1
28.0
1476
1746
84.5
30351
9423
387
450
86.0
913
989
92.3
167
267
62.5
2
1.1
0.8
0.9
10
99
4
0
129
null
null
null
null
null
null
null
null
null
null
null
null
null
…
32
25
20
10
2
19
26
73.1
7
28
19
32
64
97
3
1983
162
908
1034
50
18
1983
8
4
50.0
4
50.0
1229
7537
4725
28
2
0
14
3
1316
0
29.5
1321
1542
85.7
22165
6269
607
677
89.7
607
677
89.7
86
126
68.3
1
1.5
2.0
-0.5
16
136
15
1
152
null
null
null
null
null
null
null
null
null
null
null
null
null
…
49
27
18
29
2
23
42
54.8
19
33
9
37
86
54
0
1793
91
397
1144
267
32
1793
18
12
66.7
5
27.8
942
5305
2671
34
41
5
26
16
1185
27
34.3
1692
2006
84.3
25780
7898
907
992
91.4
635
744
85.3
91
140
65.0
10
9.6
11.2
0.4
102
160
130
13
344
null
null
null
null
null
null
null
null
null
null
null
null
null
…
49
17
18
21
10
17
58
29.3
41
23
1
15
64
5
0
2326
24
244
926
1173
165
2324
81
37
45.7
38
46.9
1563
8190
3514
91
83
37
53
58
1827
202
For a good comparison, it is fair to condition all stats per 90 minutes played. I am happy to say there is a column for that, so we can divide each column by that column.
Code
df_stats_relevant.select(~cs.by_dtype(pl.String) &~cs.by_name(["Age", "Rk", "Year of birth"])).with_columns(pl.all() / pl.col("90s Played").exclude("90s Played"))
shape: (1_215, 123)
90s Played
Passes Completed
Passes Attempted
Pass Completion %
Total Passing Distance
Progressive Passing Distance
Passes Completed (Short)
Passes Attempted (Short)
Pass Completion % (Short)
Passes Completed (Medium)
Passes Attempted (Medium)
Pass Completion % (Medium)
Passes Completed (Long)
Passes Attempted (Long)
Pass Completion % (Long)
Assists
xAG: Exp. Assisted Goals
xA: Expected Assists
Assists - xAG
Key Passes
Passes into Final Third
Passes into Penalty Area
Crosses into Penalty Area
Progressive Passes
Matches Played
Starts
Goals Against
Goals Against/90
Shots on Target Against
Saves
Save Percentage
Wins
Draws
Losses
Clean Sheets
Clean Sheet Percentage
Penalty Kicks Attempted
…
Tackles
Tackles Won
Tackles (Def 3rd)
Tackles (Mid 3rd)
Tackles (Att 3rd)
Dribblers Tackled
Dribbles Challenged
% of Dribblers Tackled
Challenges Lost
Blocks
Shots Blocked
Interceptions
Tkl+Int
Clearances
Errors
Touches
Touches (Def Pen)
Touches (Def 3rd)
Touches (Mid 3rd)
Touches (Att 3rd)
Touches (Att Pen)
Touches (Live-Ball)
Take-Ons Attempted
Successful Take-Ons
Successful Take-On %
Times Tackled During Take-On
Tackled During Take-On Percentage
Carries
Total Carrying Distance
Progressive Carrying Distance
Progressive Carries
Carries into Final Third
Carries into Penalty Area
Miscontrols
Dispossessed
Passes Received
Progressive Passes Rec
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
…
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
30.9
50.226537
59.417476
2.734628
958.511327
313.009709
15.760518
17.734628
2.877023
28.899676
31.585761
2.961165
4.563107
8.15534
1.812298
0.0
0.009709
0.019417
-0.009709
0.2589
4.174757
0.097087
0.0
4.433657
null
null
null
null
null
null
null
null
null
null
null
null
null
…
2.071197
1.132686
1.165049
0.744337
0.161812
0.841424
1.456311
1.87055
0.614887
1.650485
1.035599
1.262136
3.333333
3.527508
0.064725
70.711974
9.482201
31.585761
36.213592
3.68932
1.132686
70.679612
0.485437
0.2589
1.724919
0.226537
1.511327
48.737864
280.355987
159.255663
1.165049
0.614887
0.0
0.744337
0.12945
45.404531
0.291262
16.9
47.100592
52.95858
5.260355
737.869822
177.988166
23.254438
25.621302
5.372781
19.526627
21.301775
5.426036
2.426036
3.195266
4.491124
0.0
0.029586
0.065089
-0.029586
0.35503
5.147929
0.295858
0.118343
4.615385
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.242604
0.828402
0.473373
0.591716
0.177515
0.473373
1.065089
2.627219
0.591716
0.710059
0.059172
0.710059
1.952663
1.065089
0.0
60.473373
1.242604
9.940828
38.284024
12.899408
0.414201
60.473373
1.12426
0.414201
2.177515
0.650888
3.426036
48.698225
258.047337
99.585799
0.532544
1.952663
0.0
1.360947
0.650888
46.153846
1.183432
31.8
48.773585
57.735849
2.657233
861.069182
270.849057
19.779874
22.232704
2.798742
22.358491
25.220126
2.789308
4.716981
7.327044
2.025157
0.031447
0.069182
0.059748
-0.037736
0.943396
5.91195
0.72327
0.09434
6.100629
null
null
null
null
null
null
null
null
null
null
null
null
null
…
2.672956
1.63522
1.352201
1.069182
0.251572
1.194969
3.018868
1.245283
1.823899
0.91195
0.188679
1.918239
4.591195
1.918239
0.09434
68.050314
3.301887
17.389937
40.503145
10.974843
0.314465
68.050314
2.044025
1.383648
2.128931
0.566038
0.871069
51.855346
273.805031
122.389937
1.194969
1.540881
0.09434
1.006289
0.943396
47.232704
1.603774
15.6
39.423077
51.089744
4.948718
590.448718
291.794872
22.5
25.25641
5.711538
13.012821
16.987179
4.910256
2.371795
5.064103
3.0
0.064103
0.064103
0.064103
0.0
0.512821
1.282051
0.448718
0.128205
2.115385
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.602564
1.217949
0.961538
0.576923
0.064103
1.089744
2.179487
3.205128
1.089744
1.474359
0.320513
0.961538
2.564103
3.974359
0.0
61.923077
4.615385
22.051282
25.961538
14.487179
1.794872
61.923077
0.833333
0.448718
3.448718
0.320513
2.467949
28.653846
115.512821
60.833333
0.897436
0.641026
0.320513
1.153846
0.705128
32.179487
3.717949
25.7
21.51751
29.649805
2.824903
481.322957
189.571984
5.486381
6.381323
3.346304
11.750973
13.190661
3.466926
3.85214
8.754864
1.712062
0.0
0.007782
0.007782
-0.007782
0.116732
2.373541
0.077821
0.0
1.906615
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.361868
0.7393
1.089494
0.194553
0.077821
0.817121
1.361868
2.33463
0.544747
1.206226
1.011673
0.894942
2.256809
4.474708
0.0
40.155642
7.431907
22.178988
16.420233
1.634241
0.817121
40.155642
0.194553
0.116732
2.33463
0.038911
0.77821
19.143969
95.719844
55.175097
0.272374
0.233463
0.0
0.583658
0.038911
19.416342
0.038911
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
30.7
23.127036
29.869707
2.521173
296.612378
57.068404
14.527687
17.557003
2.693811
5.407166
6.938111
2.537459
1.042345
1.498371
2.267101
0.130293
0.169381
0.081433
-0.039088
1.400651
1.302932
0.912052
0.0
2.801303
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.042345
0.651466
0.260586
0.488599
0.29316
0.423453
0.781759
1.765472
0.358306
0.814332
0.065147
0.09772
1.140065
1.172638
0.0
44.7557
1.498371
4.397394
20.456026
20.521173
4.429967
44.690554
4.169381
1.693811
1.322476
2.345277
1.833876
26.514658
134.592834
57.71987
1.726384
1.498371
0.716612
3.713355
2.084691
33.648208
5.928339
31.5
30.539683
36.444444
2.660317
598.190476
215.587302
10.634921
11.587302
2.914286
16.15873
17.714286
2.895238
3.52381
6.285714
1.780952
0.0
0.006349
0.003175
-0.006349
0.126984
1.809524
0.031746
0.0
1.68254
null
null
null
null
null
null
null
null
null
null
null
null
null
…
0.730159
0.539683
0.444444
0.285714
0.0
0.31746
0.507937
1.984127
0.190476
1.206349
0.888889
0.920635
1.650794
5.079365
0.0
46.190476
8.730159
25.333333
19.873016
1.079365
0.888889
46.190476
0.095238
0.063492
2.11746
0.031746
1.057143
25.206349
119.079365
64.095238
0.222222
0.15873
0.031746
0.412698
0.095238
27.714286
0.031746
28.0
52.714286
62.357143
3.017857
1083.964286
336.535714
13.821429
16.071429
3.071429
32.607143
35.321429
3.296429
5.964286
9.535714
2.232143
0.071429
0.039286
0.028571
0.032143
0.357143
3.535714
0.142857
0.0
4.607143
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.142857
0.892857
0.714286
0.357143
0.071429
0.678571
0.928571
2.610714
0.25
1.0
0.678571
1.142857
2.285714
3.464286
0.107143
70.821429
5.785714
32.428571
36.928571
1.785714
0.642857
70.821429
0.285714
0.142857
1.785714
0.142857
1.785714
43.892857
269.178571
168.75
1.0
0.071429
0.0
0.5
0.107143
47.0
0.0
29.5
44.779661
52.271186
2.905085
751.355932
212.508475
20.576271
22.949153
3.040678
20.576271
22.949153
3.040678
2.915254
4.271186
2.315254
0.033898
0.050847
0.067797
-0.016949
0.542373
4.610169
0.508475
0.033898
5.152542
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.661017
0.915254
0.610169
0.983051
0.067797
0.779661
1.423729
1.857627
0.644068
1.118644
0.305085
1.254237
2.915254
1.830508
0.0
60.779661
3.084746
13.457627
38.779661
9.050847
1.084746
60.779661
0.610169
0.40678
2.261017
0.169492
0.942373
31.932203
179.830508
90.542373
1.152542
1.389831
0.169492
0.881356
0.542373
40.169492
0.915254
34.3
49.329446
58.483965
2.457726
751.603499
230.262391
26.443149
28.921283
2.664723
18.51312
21.690962
2.48688
2.653061
4.081633
1.895044
0.291545
0.279883
0.326531
0.011662
2.973761
4.664723
3.790087
0.379009
10.029155
null
null
null
null
null
null
null
null
null
null
null
null
null
…
1.428571
0.495627
0.524781
0.612245
0.291545
0.495627
1.690962
0.854227
1.195335
0.670554
0.029155
0.437318
1.865889
0.145773
0.0
67.813411
0.699708
7.113703
26.997085
34.198251
4.810496
67.755102
2.361516
1.078717
1.332362
1.107872
1.367347
45.568513
238.77551
102.44898
2.653061
2.419825
1.078717
1.54519
1.690962
53.265306
5.889213
The stat of number of 90s played can be ignored if we disregard ‘fitness’ as a trait Toni Kroos had. This brings us to the following dataframe:
There is only one thing left I would like to change. For the plots in the next step, it is useful to highlight Toni Kroos. By changing the value of the Position column, we can still use that column to dictate the colours and also have this clear distinction for Kroos.
We’ve got the dataframe and it is very tempting to just plug that into UMAP. Yet it generally is wiser to scale the data first. We can use the standard preprocessing tools that we know from sklearn.
Code
from sklearn.preprocessing import StandardScalerdf_stats_numeric_scaled = StandardScaler().fit_transform(df_stats_relevant_umap)
What even is this?! Four positions… we would expect four ‘clusters’, right? Of course you can argue about the generality of the positions, but we would at least expect some type of separation between the positions. The current one is all over the place. But hope’s not lost yet! Checking the documentation of the arguments we can provide to UMAP shows ‘euclidean’ as default metric. I would rather not go into too much detail, so essentially: the Euclidean metric in this case does not make sense. It would mean that people that have similar ‘mean’ stats across all stats will be packed closer to each other. This is not what we want: there should be more weights on the qualities that make a player unique, and that should be compared to others. A better metric to use would be the Chebyshev metric. Again without going into too much detail: the Chebyshev metric measures the greatest (absolute) difference between any stat of two players. For example, if player A is a better dribbler and player B is a better passer, these individual qualities will be more prominent with the Chebyshev metric. Let’s see with a plot!
But still something we can improve on! Now what about the other parameters? We have already touched upon the metric. The other frequently used parameters are:
n_components;
n_neighbors;
min_dist.
For the components we can show both a 2D and 3D case. Plotting in higher dimensions is a little more cumbersome. That leaves n_neighbors and min_dist. The former is set at 15 by default. A rule of thumb is that the lower this value, the more the local structure will be preserved. Conversely the higher this number, the more UMAP sees the overall structure of the data. After some tweaking3, it was clear from the results that a value of around 30 is quite good for both the local and global structure.
Afs for the min_dist, the rule of thumb is: the lower the value, the clumpier the embeddings. As is stated by the documentation of UMAP4, this can be useful for clustering. A default value of 0.1 is fine for our use case, which leads us to the following instances for 2D and 3D UMAP embeddings.
Aleix Garcia is statistically5 the most similar player to Toni Kroos, which can be seen in both the 2D and 3D embedding. He is a central midfielder who played for Girona in the 2023/24 season. If you followed LaLiga last leason you should know that Girona was a serious title contender for the majority of the run. Only at the final third part did they not keep up with Real Madrid anymore. There have been many praises for this Girona team, with Aleix Garcia as one of the most prominent players. It does not surprise me the least that he is most similar to Kroos, and for his tremendous season he has earned a transfer to Bayer Leverkusen, the 2023/24 Bundesliga champions.
In the 3D embedding, it is quite extraordinary to see a defenders close to him as well. The one that is closes is Trent Alexander-Arnold, who plays for Liverpool in the Premier League. He is known to be an amazing creator, but usually plays right-back. There are some rumours about him joining Real Madrid as a free agent in 2025, so who knows… 👀
TL;DR
Aleix Garcia. Trent Alexander-Arnold was the defender that was the closest. Keep in mind that the source does not have all statistics. Also we took the entirety of stats. For a more scoped look in the key things that are missing for Real Madrid, we could opt to include only those stats, such as passing and possession.
The only reason I put this up is because I spent too much time messing around with Polars, and I just did not get it to work…↩︎
Joining on the player name is trivial, but the squad is less obvious: the data could contain multiple records of the same player. This is due to either late transfers or a transfer in the winter window.↩︎
You can do so by running a simple for loop and inspecting all the plotted results.↩︎
According to the Chebyshev metric and our choice for min_dist and n_neighbors.↩︎
Source Code
---title: "Visually finding a statistical *Toni Kroos* replacement"date: "2024-10-07"categories: [visualisation]description: "Irreplacable"code-fold: show---If you have watched any of Real Madrid's first few games in this season (2024/2025) and also previous seasons, you must have noticed a hasty and uncontrolled playing style in recent games.Very clearly this is due to the absence of Toni Kroos in midfield, and you don't have to be a [*tactico*](https://www.urbandictionary.com/define.php?term=tactico) to observe that.Even though he is truly one of a kind, I would like to use football data to find the most similar player in terms of statistics.A recent trip to [PyData Amsterdam 2024](https://amsterdam.pydata.org/) and the talk of Jeroen Janssens about ['How I hacked UMAP and won at a plotting contest'](https://amsterdam2024.pydata.org/cfp/talk/SMB7J3/) has inspired me to try and play around with [Polars](https://pola.rs/) for a small side project, and this seems like the perfect one!Also the use of [UMAP](https://umap-learn.readthedocs.io/en/latest/) suits this project particularly well.In an earlier [post with football data](../006_football/index.qmd) we used a type of football data which is great for dimensionality reduction, which is what UMAP entails as a technique.The only problem with that data is that it was for the 2022-2023 season, and I want to use the most recent completed season.Fortunately the Kaggle dataset also showed the source and it was quite easy to find the source data for the 2023-2024 season.## Step 1: Obtaining the dataAfter clicking around on [FBREF](https://fbref.com/en/comps/Big5/stats/players/Big-5-European-Leagues-Stats), I found the player stats of the 2023-2024 season, so let's get digging!The data is not yet in an easy format to work with.We could opt for `pd.read_html()` method, but that's no fun!Besides, I think doing manual HTML inspecting and then extracting is a good skill to maintain (from time to time).As it is simply a matter of extracting some HTML content, `beautifulsoup` should be sufficient!``` {.bash filename="Terminal"}pip install bs4 polars```Let's first inspect the stats.There are several categories with each their own URL:- standard stats- goalkeeping- advanced goalkeeping- shooting- passing- pass types- goal and shot creation- possession- playing time- miscellaneous statsEven though Toni Kroos is not a goalkeeper, I think it's good to keep the goalkeeping stats for the final UMAP plot.We will, however, disregard the 'standard stats' and 'miscellaneous stats'.Fortunately the URLs are similar and only differ by a simple substring.```{python}import requestsfrom bs4 import BeautifulSoupimport polars as pldef extract_table_and_columns(stat: str) -> (pl.DataFrame, dict):"""Extract table from url. Args: stat (str): The statistic of the table Returns: pl.DataFrame: A Polars dataframe with the information of the players. dict: A dictionary of columns and their descriptions. """ base_url ="https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats" url = base_url.format(stat=stat) resp = requests.get(url) soup = BeautifulSoup(resp.content, features="lxml")# The HTML of the table stats_table = soup.find_all("table")[-1]# Parsing headers tr_header = stats_table.findChildren(name="tr")[1] column_descriptions = { cell.get("aria-label"): cell.get("data-tip")for cell in tr_header.find_all(["th", "td"]) } columns = [key for key in column_descriptions.keys()]iflen(columns) !=len(set(columns)):raiseAssertionError("Column names must be unique.")# Parsing stats table_body = stats_table.find_next(name="tbody") rows_players = table_body.findChildren(name="tr", attrs={"class": False}) table_data: list= []for row in rows_players: cells = row.find_all( ["th", "td"] ) # Collect all cell data, both headers and regular cells table_data.append([cell.get_text(strip=True) for cell in cells])return pl.DataFrame(table_data, schema=columns, orient="row"), column_descriptions```Let's look at the passing stats!::: column-screen-inset```{python}df, descriptions = extract_table_and_columns("passing")df```:::Damn, looks kinda ugly with the automatically inferred strings for every column.We are actually going to try something quick and nasty for this, like saving these as `.csv` files and then reading them again for a better inference.[^1][^1]: The only reason I put this up is because I spent too much time messing around with Polars, and I just did not get it to work...::: column-screen-inset```{python}import osdef parse_columns(df: pl.DataFrame) -> pl.DataFrame:"""Parse columns of a dataframe for better inference. Args: df (pl.DataFrame): A Polars dataframe which has structurally incorrect columns. Returns: pl.DataFrame: A Polars dataframe with better inferred column types. """ df.write_csv("temp.csv", quote_style="necessary") parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)if os.path.exists("temp.csv"): os.remove("temp.csv")return parsed_dfdef extract_and_parse_table(stat) -> pl.DataFrame: df, descriptions = extract_table_and_columns(stat)return parse_columns(df)extract_and_parse_table("passing")```:::This is much better!Now let's refactor a little and put this together into a class:```{python}class StatsExtractor:def__init__(self, stat: str):self.stat: str= statself.base_url: str= ("https://fbref.com/en/comps/Big5/2023-2024/{stat}/players/2023-2024-Big-5-European-Leagues-Stats" )self.url: str=self.base_url.format(stat=self.stat)self.df =Noneself.column_descriptions =Noneself.fetch_data()self.extract_column_descriptions()self.extract_stats()self.parse_columns()def fetch_data(self) ->None:"""Fetch the HTML content from the URL.""" url =self.base_url.format(stat=self.stat) resp = requests.get(url) soup = BeautifulSoup(resp.content, features="lxml")self.html = soup.find_all("table")[-1]def extract_column_descriptions(self) ->None:"""Extract column descriptions""" tr_header =self.html.findChildren(name="tr")[1] column_descriptions = { cell.get("aria-label"): cell.get("data-tip")for cell in tr_header.find_all(["th", "td"]) } columns = [key for key in column_descriptions.keys()]iflen(columns) !=len(set(columns)):raiseAssertionError("Column names must be unique.")self.columns = columnsself.column_descriptions = column_descriptionsdef extract_stats(self) ->None:"""Extract table from URL and store it in a DataFrame.""" table_body =self.html.find_next(name="tbody") rows_players = table_body.findChildren(name="tr", attrs={"class": False}) table_data: list= []for row in rows_players: cells = row.find_all(["th", "td"]) table_data.append([cell.get_text(strip=True) for cell in cells])self.df = pl.DataFrame(table_data, schema=self.columns, orient="row")def parse_columns(self) ->None:"""Parse columns of a DataFrame for better inference."""self.df.write_csv("temp.csv", quote_style="necessary") parsed_df = pl.read_csv("temp.csv", use_pyarrow=True)# if os.path.exists("temp.csv"):# os.remove("temp.csv")self.df = parsed_df```We can now invoke the methods after creating a `StatsExtractor` instance.```{python}stats_to_extract = ["passing","keepers","shooting","passing_types","gca","defense","possession",]extractors = {stat: StatsExtractor(stat) for stat in stats_to_extract}dfs = {stat: extractor.df for stat, extractor in extractors.items()}```With `dfs` we only need to create one combined dataframe, and actually get to learn more about the syntax of `polars`.## Step 2: Exploring `polars` by data wranglingAfter viewing the documentation and some sample code, `polars` looks very similar to `dplyr` in *R* - and I have to say I am quite pleased.My career started with loads of R and in my own time and projects I have transitioned more towards Python.However, the one and only thing I really miss about R is how straightforward and concise the syntax is compared to `pandas`.Take the following two dataframes```{python}df_passing = dfs["passing"]df_possession = dfs["possession"]```I am so used to not having duplicate columns when performing joins.However in Python using either `pandas` or `polars`, I have not found any other way than to drop the duplicate columns.Let's join these dataframes on the columns `Player` and `Squad`[^2], and drop all the other duplicate columns.In `polars` we can do so using the selector `cs.ends_with()`.[^2]: Joining on the player name is trivial, but the squad is less obvious: the data could contain multiple records of the same player. This is due to either late transfers or a transfer in the winter window.::: column-screen-inset```{python}import polars.selectors as csdf_passing.join(df_possession, how="left", on=["Player", "Squad"]).drop(cs.ends_with("_right"))```:::Exactly what I wanted!Now we can iterate this process until there are no dataframes left to merge on.::: column-screen-inset```{python}from functools importreducedf_stats =reduce(lambda left, right: left.join(right, how="left", on=["Player", "Squad"]).drop( cs.ends_with("_right") ), dfs.values())df_stats```:::Now it's time to filter and select relevant features and change them slightly.Before doing alterations, I think it is fair to only include players who have played more than 90 minutes 15 times.Perhaps *fair* is not the right word, but including players who only played a handful of matches is misleading when evaluating their overall season performance.It does not capture consistency which is what we *do* want to include.::: column-screen-inset```{python}df_stats.filter(pl.col("90s Played") >=15)```:::Although players can play multiple positions, for the plots I prefer we keep it on a down low.The `Position` column now has these unique values:<div>```{python}df_stats.select(pl.col("Position")).unique().to_series().to_list()```</div>For the players that play multiple positions, we take the position specified first assuming it is their primary position.This will bring down the number of different positions to four!<div>```{python}df_stats_relevant = (df_stats .filter(pl.col("90s Played") >=15) .with_columns(pl.col("Position").str.slice(0, 2)))```</div>UMAP requires numerical data, so let's only select numerical columns, which in this case are all non-string columns.::: column-screen-inset```{python}df_stats_relevant.select(~cs.by_dtype(pl.String))```:::There are still some irrelevant columns, namely `Rank`, `Age` and `Year of Birth`.::: column-screen-inset```{python}df_stats_relevant.select(~cs.by_dtype(pl.String) &~cs.by_name(["Age", "Rk", "Year of birth"]))```:::For a good comparison, it is fair to condition all stats per 90 minutes played.I am happy to say there is a column for that, so we can divide each column by that column.::: column-screen-inset```{python}df_stats_relevant.select(~cs.by_dtype(pl.String) &~cs.by_name(["Age", "Rk", "Year of birth"])).with_columns(pl.all() / pl.col("90s Played").exclude("90s Played"))```:::The stat of number of 90s played can be ignored if we disregard 'fitness' as a trait Toni Kroos had.This brings us to the following dataframe:::: column-screen-inset```{python}df_stats_relevant_umap = ( df_stats_relevant.select(~cs.by_dtype(pl.String) &~cs.by_name(["Age", "Rk", "Year of birth"]) ) .with_columns(pl.all() / pl.col("90s Played").exclude("90s Played")) .drop(pl.col("90s Played")))df_stats_relevant_umap```:::There is only one thing left I would like to change.For the plots in the next step, it is useful to highlight Toni Kroos.By changing the value of the `Position` column, we can still use that column to dictate the colours and also have this clear distinction for Kroos.```{python}df_stats_relevant = df_stats_relevant.with_columns( pl.when(pl.col("Player") =="Toni Kroos") .then(pl.lit("Toni Kroos")) .otherwise(pl.col("Position")) .alias("Position"))```## Step 3: Interpreting UMAP results and plotsWe've got the dataframe and it is very tempting to just plug that into UMAP.Yet it generally is wiser to scale the data first.We can use the standard preprocessing tools that we know from `sklearn`.```{python}from sklearn.preprocessing import StandardScalerdf_stats_numeric_scaled = StandardScaler().fit_transform(df_stats_relevant_umap)```Let's get on with the first plot!```{python}#| warning: false#| message: falseimport umap.plotreducer = umap.umap_.UMAP(random_state=10)mapper = reducer.fit(df_stats_numeric_scaled, force_all_finite="allow-nan")umap.plot.points(mapper, labels=df_stats_relevant["Position"], background="black")```Oh man...What even is this?!Four positions...we would expect four 'clusters', right?Of course you can argue about the generality of the positions, but we would at least expect some type of separation between the positions.The current one is all over the place.But hope's not lost yet!Checking the documentation of the arguments we can provide to UMAP shows 'euclidean' as default metric.I would rather not go into too much detail, so essentially: the Euclidean metric in this case does not make sense.It would mean that people that have similar 'mean' stats across all stats will be packed closer to each other.This is not what we want: there should be more weights on the qualities that make a player unique, and that should be compared to others.A better metric to use would be the *Chebyshev* metric.Again without going into too much detail: the Chebyshev metric measures the greatest (absolute) difference between any stat of two players.For example, if player A is a better dribbler and player B is a better passer, these individual qualities will be more prominent with the Chebyshev metric.Let's see with a plot!```{python}#| warning: false#| classes: preview-imageimport umap.plotreducer = umap.umap_.UMAP(random_state=10, metric="chebyshev")mapper = reducer.fit(df_stats_numeric_scaled, force_all_finite="allow-nan")umap.plot.points(mapper, labels=df_stats_relevant["Position"], background="black")```Ah, much better.But still something we can improve on!Now what about the other parameters?We have already touched upon the `metric`.The other frequently used parameters are:- `n_components`;- `n_neighbors`;- `min_dist`.For the components we can show both a 2D and 3D case.Plotting in higher dimensions is a little more cumbersome.That leaves `n_neighbors` and `min_dist`.The former is set at 15 by default.A rule of thumb is that the lower this value, the more the local structure will be preserved.Conversely the higher this number, the more UMAP sees the overall structure of the data.After some tweaking[^3], it was clear from the results that a value of around 30 is quite good for both the local and global structure.[^3]: You can do so by running a simple for loop and inspecting all the plotted results.Afs for the `min_dist`, the rule of thumb is: the lower the value, the clumpier the embeddings.As is stated by the documentation of UMAP[^4], this can be useful for clustering.A default value of 0.1 is fine for our use case, which leads us to the following instances for 2D and 3D UMAP embeddings.[^4]: <https://umap-learn.readthedocs.io/en/latest/parameters.html#n-neighbors>```{python}#| error: false#| warning: falsereducer_2d = umap.umap_.UMAP( n_components=2, metric="chebyshev", n_neighbors=30, random_state=10)reducer_3d = umap.umap_.UMAP( n_components=3, metric="chebyshev", n_neighbors=30, random_state=10)embeddings_2d = reducer_2d.fit_transform( df_stats_numeric_scaled, force_all_finite="allow-nan")embeddings_3d = reducer_3d.fit_transform( df_stats_numeric_scaled, force_all_finite="allow-nan")```This leads us to two interactive figures underneath!```{python}#| column: screen-inset-shaded#| layout-nrow: 1#| code-fold: trueimport plotly.express as pxdf_plot_2d = df_stats_relevant.hstack(pl.DataFrame(embeddings_2d, schema=["x", "y"]))df_plot_3d = df_stats_relevant.hstack( pl.DataFrame(embeddings_3d, schema=["x", "y", "z"]))fig_2d = px.scatter( df_plot_2d, title="2D", x="x", y="y", color=df_stats_relevant["Position"], hover_data=["Player", "Squad", "Competition"], labels={"color": ""},)fig_2d.update_layout( plot_bgcolor="white", margin=dict(l=20, r=20, t=50, b=20), xaxis=dict(showticklabels=False, title=None, linecolor="black"), yaxis=dict(showticklabels=False, title=None, linecolor="black"),)fig_2d.update_traces( hovertemplate="<br>".join( ["Player: %{customdata[0]}","Squad: %{customdata[1]}","Competition: %{customdata[2]}", ] ))fig_3d = px.scatter_3d( df_plot_3d, title="3D", x="x", y="y", z="z", color=df_stats_relevant["Position"], hover_data=["Player", "Squad", "Competition"], labels={"color": ""},)fig_3d.update_traces( marker_size=3, hovertemplate="<br>".join( ["Player: %{customdata[0]}","Squad: %{customdata[1]}","Competition: %{customdata[2]}", ] ),)fig_3d.update_layout( scene=dict( xaxis_title="", yaxis_title="", zaxis_title="", xaxis_showspikes=False, yaxis_showspikes=False, xaxis=dict( backgroundcolor="rgb(200, 200, 230)", showticklabels=False, linecolor="black", ), yaxis=dict( backgroundcolor="rgb(230, 200,230)", showticklabels=False, linecolor="black", ), zaxis=dict( backgroundcolor="rgb(230, 230,200)", showticklabels=False, linecolor="black", ), ), margin=dict(l=20, r=20, t=50, b=20),)fig_2d.show()fig_3d.show()```*Aleix Garcia* is statistically[^5] the most similar player to Toni Kroos, which can be seen in both the 2D and 3D embedding.He is a central midfielder who played for Girona in the 2023/24 season.If you followed *LaLiga* last leason you should know that Girona was a serious title contender for the majority of the run.Only at the final third part did they not keep up with Real Madrid anymore.There have been many praises for this Girona team, with Aleix Garcia as one of the most prominent players.It does not surprise me the least that he is most similar to Kroos, and for his tremendous season he has earned a transfer to Bayer Leverkusen, the 2023/24 *Bundesliga* champions.[^5]: According to the Chebyshev metric and our choice for `min_dist` and `n_neighbors`.In the 3D embedding, it is quite extraordinary to see a defenders close to him as well.The one that is closes is *Trent Alexander-Arnold*, who plays for Liverpool in the Premier League.He is known to be an amazing creator, but usually plays right-back.There are some rumours about him joining Real Madrid as a free agent in 2025, so who knows...👀## TL;DRAleix Garcia.Trent Alexander-Arnold was the defender that was the closest.Keep in mind that the source does not have *all* statistics.Also we took the entirety of stats.For a more scoped look in the key things that are missing for Real Madrid, we could opt to include only those stats, such as passing and possession.That's all for today, thanks for reading!