Deploying an MLflow Model with Streamlit: predicting footballer positions
GK, DF, MF, or FW?
deployment
mlflow
classification
machine learning
Author
Li Yong Pan
Published
April 18, 2024
In the past I have done quite a few classification problems, but most of them were a code-along or done with trivial datasets that nearly every data scientist has done. With the expanded responsibilities as a data scientist, which already are broadly defined across the industry, I wanted to write about deploying a model with an interface. It is not uncommon to deploy models by serving them as a REST API, but I would only do that for models that could be used at larges scale. We will be using MLflow to log our model, both PySpark (on Databricks) and the scikit-learn framework to train a model, Streamlit for the interface of the app, and deployment using Docker on my Raspberry Pi. Because we use an interface and do not serve this as an API, some additional and or different steps need to be taken compared to only serving it using FastAPI or Flask on a remote server.
Today we will train and deploy a model that can predict based on footballing stats whether a player is a goalkeeper, defender, midfielder or a forward. I have always had the idea to incorporate football into these types of projects and found this project to be the perfect use case. Our aim in this post is to create a model decent enough to deploy, but ot does not necessarily have to be groundbreaking across all metrics. Let’s do it!
Step 1: Finding a dataset
Kaggle is the perfect platform to find for these niche datasets. What is Kaggle?
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
There is also an API to download datasets from Kaggle, but it is a little overkill for a small project like this. It would make more sense if this dataset was continuously updated, but it will not be as it only concerns 2022-2023 season data.
Step 2: Exploring context of the data
This dataset contains 2022-2023 football player stats per 90 minutes.
Only players of Premier League, Ligue 1, Bundesliga, Serie A and La Liga are listed.
These competitions are usually denoted as the Top 5 European leagues, which is based on the UEFA rankings. Comparing a player that has played 90 minutes and has attempted 50 passes to a player that has played 45 minutes and also attempted 50 passes would be unfair. Let’s read the data
The dataset contains 124 columns, and we will not use all of them for the model.
Column descriptions
Rk : Rank
Player : Player’s name
Nation : Player’s nation
Pos : Position
Squad : Squad’s name
Comp : League that squat occupies
Age : Player’s age
Born : Year of birth
MP : Matches played
Starts : Matches started
Min : Minutes played
90s : Minutes played divided by 90
Goals : Goals scored or allowed
Shots : Shots total (Does not include penalty kicks)
SoT : Shots on target (Does not include penalty kicks)
SoT% : Shots on target percentage (Does not include penalty kicks)
G/Sh : Goals per shot
G/SoT : Goals per shot on target (Does not include penalty kicks)
ShoDist : Average distance, in yards, from goal of all shots taken (Does not include penalty kicks)
ShoFK : Shots from free kicks
ShoPK : Penalty kicks made
PKatt : Penalty kicks attempted
PasTotCmp : Passes completed
PasTotAtt : Passes attempted
PasTotCmp% : Pass completion percentage
PasTotDist : Total distance, in yards, that completed passes have traveled in any direction
PasTotPrgDist : Total distance, in yards, that completed passes have traveled towards the opponent’s goal
PasShoCmp : Passes completed (Passes between 5 and 15 yards)
PasShoAtt : Passes attempted (Passes between 5 and 15 yards)
PasShoCmp% : Pass completion percentage (Passes between 5 and 15 yards)
PasMedCmp : Passes completed (Passes between 15 and 30 yards)
PasMedAtt : Passes attempted (Passes between 15 and 30 yards)
PasMedCmp% : Pass completion percentage (Passes between 15 and 30 yards)
PasLonCmp : Passes completed (Passes longer than 30 yards)
PasLonAtt : Passes attempted (Passes longer than 30 yards)
PasLonCmp% : Pass completion percentage (Passes longer than 30 yards)
Assists : Assists
PasAss : Passes that directly lead to a shot (assisted shots)
Pas3rd : Completed passes that enter the 1/3 of the pitch closest to the goal
PPA : Completed passes into the 18-yard box
CrsPA : Completed crosses into the 18-yard box
PasProg : Completed passes that move the ball towards the opponent’s goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area
PasAtt : Passes attempted
PasLive : Live-ball passes
PasDead : Dead-ball passes
PasFK : Passes attempted from free kicks
TB : Completed pass sent between back defenders into open space
Sw : Passes that travel more than 40 yards of the width of the pitch
PasCrs : Crosses
TI : Throw-Ins taken
CK : Corner kicks
CkIn : Inswinging corner kicks
CkOut : Outswinging corner kicks
CkStr : Straight corner kicks
PasCmp : Passes completed
PasOff : Offsides
PasBlocks : Blocked by the opponent who was standing it the path
SCA : Shot-creating actions
ScaPassLive : Completed live-ball passes that lead to a shot attempt
ScaPassDead : Completed dead-ball passes that lead to a shot attempt
ScaDrib : Successful dribbles that lead to a shot attempt
ScaSh : Shots that lead to another shot attempt
ScaFld : Fouls drawn that lead to a shot attempt
ScaDef : Defensive actions that lead to a shot attempt
GCA : Goal-creating actions
GcaPassLive : Completed live-ball passes that lead to a goal
GcaPassDead : Completed dead-ball passes that lead to a goal
GcaDrib : Successful dribbles that lead to a goal
GcaSh : Shots that lead to another goal-scoring shot
GcaFld : Fouls drawn that lead to a goal
GcaDef : Defensive actions that lead to a goal
Tkl : Number of players tackled
TklWon : Tackles in which the tackler’s team won possession of the ball
TklDef3rd : Tackles in defensive 1/3
TklMid3rd : Tackles in middle 1/3
TklAtt3rd : Tackles in attacking 1/3
TklDri : Number of dribblers tackled
TklDriAtt : Number of times dribbled past plus number of tackles
TklDri% : Percentage of dribblers tackled
TklDriPast : Number of times dribbled past by an opposing player
Blocks : Number of times blocking the ball by standing in its path
BlkSh : Number of times blocking a shot by standing in its path
BlkPass : Number of times blocking a pass by standing in its path
Int : Interceptions
Tkl+Int : Number of players tackled plus number of interceptions
Clr : Clearances
Err : Mistakes leading to an opponent’s shot
Touches : Number of times a player touched the ball. Note: Receiving a pass, then dribbling, then sending a pass counts as one touch
TouDefPen : Touches in defensive penalty area
TouDef3rd : Touches in defensive 1/3
TouMid3rd : Touches in middle 1/3
TouAtt3rd : Touches in attacking 1/3
TouAttPen : Touches in attacking penalty area
TouLive : Live-ball touches. Does not include corner kicks, free kicks, throw-ins, kick-offs, goal kicks or penalty kicks.
ToAtt : Number of attempts to take on defenders while dribbling
ToSuc : Number of defenders taken on successfully, by dribbling past them
ToSuc% : Percentage of take-ons Completed Successfully
ToTkl : Number of times tackled by a defender during a take-on attempt
ToTkl% : Percentage of time tackled by a defender during a take-on attempt
Carries : Number of times the player controlled the ball with their feet
CarTotDist : Total distance, in yards, a player moved the ball while controlling it with their feet, in any direction
CarPrgDist : Total distance, in yards, a player moved the ball while controlling it with their feet towards the opponent’s goal
CarProg : Carries that move the ball towards the opponent’s goal at least 5 yards, or any carry into the penalty area
Car3rd : Carries that enter the 1/3 of the pitch closest to the goal
CPA : Carries into the 18-yard box
CarMis : Number of times a player failed when attempting to gain control of a ball
CarDis : Number of times a player loses control of the ball after being tackled by an opposing player
Rec : Number of times a player successfully received a pass
RecProg : Completed passes that move the ball towards the opponent’s goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area
CrdY : Yellow cards
CrdR : Red cards
2CrdY : Second yellow card
Fls : Fouls committed
Fld : Fouls drawn
Off : Offsides
Crs : Crosses
TklW : Tackles in which the tackler’s team won possession of the ball
PKwon : Penalty kicks won
PKcon : Penalty kicks conceded
OG : Own goals
Recov : Number of loose balls recovered
AerWon : Aerials won
AerLost : Aerials lost
AerWon% : Percentage of aerials won
The variable that we want to predict is in the Pos column. However, a quick glance shows that some players have more than one position defined.
Some players play as midfielder (MF) but also occasionally as a defender (DF), denoted by MFDF. And what about players that have not played many matches? On to some feature engineering!
Step 3: Feature engineering
First thing we will do is filter the players that have played at least 5 matches.
from pyspark.sql.functions import lengthdf.withColumn("PosMod", substring(col("Pos"), 1, 2)).select("PosMod").distinct().collect()
The output tells us that Pos has either length 2 or 4. For the sake of simplicity we will use the first two letters and define that as the position of the player.
from sklearn.preprocessing import LabelEncoderlabel_encoder = LabelEncoder()df["Class"] = label_encoder.fit_transform(df["PosMod"])label_encoder.classes_.tolist()
['DF', 'FW', 'GK', 'MF']
from pyspark.ml.feature import StringIndexerindexer = StringIndexer(inputCol="PosMod", outputCol="Class")df = indexer.fit(df).transform(df)
The encoding is as follows
Position (Class)
Encoding
DF = Defender
0
FW = Forward
1
GK = Goalkeeper
2
MF = Midfielder
3
It’s extremely frustrating and it grinds my gears, but I don’t think there’s a way to insert a custom mapping. I would have preferred that the order would be GK - DF - MF - FW. Makes sense right?
What remains is to define the columns to use and split the data for training and testing. With some trial and error I found a selection that leads to a decent model. Remember, the interface should be user-friendly which means that I do not want to many input variables.
Now comes the part where we integrate MLflow. Again, although MLflow allows us to evaluate iterations (runs) of models, for now we will use it to log a .pkl file which we can load into our Streamlit app. We will be using a Random Forest classifier.
MLflow autologging in Databricks
For PySpark in Databricks, we do not need to explicitly call the MLflow functions. Databricks has MLflow autologging enabled by default on ML clusters, which we are using.
These metrics seem fine!1 Not something to write home about if you’d ask me, but good enough. It could definitely be improved upon, but again: not the scope of this post. The model and the corresponding scaler are saved under a local directory /mlruns/0/<run_id>/artifacts/. We extract the corresponding .pkl files and put them in the top-level, where the other .py files reside as well.
Continuing with scikit-learn results
As I am using the Databricks Community version, I found it very hard (or perhaps it is impossible) to extract the files to a local machine. Therefore we continue with scikit-learn results and code output.
Step 5: Creating a Streamlit app
In other projects I have fiddled around with Streamlit and I really enjoy its simplicity. It allows for quick deployment and has a decent default interface. Rather than building the app.py piece by piece, I will provide the file and explain parts of the code that are not self-explanatory.
app.py (1/2)
import pickleimport pandas as pdimport streamlit as st# Load MLflow model and scalerwithopen("rf_model.pkl", "rb") as f: model = pickle.load(f)withopen("rf_scaler.pkl", "rb") as f: scaler = pickle.load(f)# Load datadf = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin1")# Positionspositions = {0: "Defender", 1: "Forward", 2: "Goalkeeper", 3: "Midfielder"}# Helper function to preprocess input datadef preprocess_input(data):# Scale data data = scaler.transform(data)return data# Function to predict using the modeldef predict(data):# Preprocess input data X = preprocess_input(data)# Make predictions predictions = model.predict(X)return predictions# Function to compare input with existing playersdef compare(data, n): cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"] mse = ((df[["Shots", "PasMedAtt", "Pas3rd", "Clr"]] - data.iloc[0]) **2).mean(axis=1) idxs = mse.nsmallest(n).indexreturn df.loc[idxs][["Player", "Age", "Pos", "Squad", "Comp"] + cols]# Function to format a line of player informationdef format_player(row):returnf"*{row['Player']}*, a {row['Age']} year old {row['Pos']} playing for {row['Squad']} in {row['Comp']}."# Function that provides text for players similar to inputdef text_similar_players(data): data = data.reset_index() text ='The players closest to your input are:\n'for index, row in data.iterrows(): player_info = format_player(row) text +=f"{index +1}. {player_info}\n"return text# ...
1
Compare user input with all players in the dataset. Use the MSE as the metric to minimise and find the n closest/most similar players with respect to input statistics.
2
Provide a list of n players that are most similar to user input.
Honestly not very much to explain.
app.py (1/2)
# Streamlit UIdef main(): st.set_page_config(page_title="Classifying Footballers", layout="wide") st.markdown(""" <style> .reportview-container { margin-top: -2em; } #MainMenu {visibility: hidden;} .stDeployButton {display:none;} footer {visibility: hidden;} #stDecoration {display:none;} </style> """, unsafe_allow_html=True) st.title("Predicting Football player positions based on 2022/23 stats") st.sidebar.success(f"Visit [blog.panliyong.nl](https://blog.panliyong.nl/posts/006_football) for the post!")# Create input form st.sidebar.header("Input Features") shots = st.sidebar.number_input("Shots", min_value=0.0, max_value=10.0, step=0.1, value=1.0) pas_med_att = st.sidebar.number_input("Passes Medium Att", min_value=0.0, max_value=70.0, step=1.0, value=20.0) pas_3rd = st.sidebar.number_input("3rd Passes", min_value=0.0, max_value=20.0, step=1.0, value=3.0) clr = st.sidebar.number_input("Clearances", min_value=0.0, max_value=15.0, step=0.1, value=3.0)# Create column layout left_column, right_column = st.columns(2)with left_column:# Players compare st.markdown("### Find similar players") n_players = st.slider("How many similar players do you want to find?", min_value=1, max_value=20, step=1, value=5)# Create prediction buttonif st.sidebar.button("Predict"):# Create a dataframe from the input input_data = pd.DataFrame({"Shots": [shots],"PasMedAtt": [pas_med_att],"Pas3rd": [pas_3rd],"Clr": [clr] })# Predict prediction = predict(input_data)# Display prediction position = positions.get(prediction[0]) position_styled =f"<span style='color:purple'>{position}</span>" st.sidebar.info(f"Predicted Position: {position}")# Check which player is closest to input players = compare(input_data, n=n_players) text = text_similar_players(players)with left_column: st.info(text)with right_column: st.markdown(f"# **Predicted a** {position_styled}", unsafe_allow_html=True) st.markdown("## Data of similar players:") st.dataframe(players.reset_index(drop=True))# Run the appif__name__=="__main__": main()
1
Remove developer settings for Streamlit.
2
Define user input interactivity.
We can try and see if the Streamlit app runs with the following command
Terminal
streamlit run app.py
This is what the app looks like:
Now it’s time to prepare it for deployment!
Step 6: Make a Dockerfile
A few things to keep in mind when writing this Dockerfile is that the requirements.txt and the other environment files stored by MLflow have many irrelevant dependencies. The only things we need are pandas, streamlit and scikit-learn. That makes building the Dockerfile quicker as there are less dependencies that have to be installed.
However, another important factor is that all the time I have been developing on a Windows (64-bit) machine. The Raspberry Pi which the app is going to run on is a Linux (arm64) machine. As these platforms are not the same, we cannot simply use docker build commands and expect them to work on the Linux device.
The default port of Streamlit apps is 8501 and we did not change that, so we have to expose this port. Furthermore only the .pkl files and app.py and .csv file are all the files we need. Then we can install the libraries we need to run the Streamlit app.
Now comes the part where we build the image and push it to Docker Hub under panliyong/football-positions. Up for grabs on the Raspberry Pi!
The image is now on Docker Hub and we can pull the image now. Let’s name the container football-positions and map port 8501 of the Pi to the exposed port. I have also had issues with containers running out of memory, so in order to keep this container running we can update the restart policy to always restart the container.2
My domain panliyong.nl and subdomains are managed through Cloudflare and are hosted on the Apache web server on the same Pi. We can serve this on football-positions.panliyong.nl for the app. If we go to the DNS settings and add a CNAME record with name football-positions and target panliyong.nl, we enable this new subdomain.
We need to add a virtual host football-positions.conf and enable it.
football-positions.conf
<VirtualHost *:80>ServerName football-positions.panliyong.nlServerAlias www.football-positions.panliyong.nlProxyPass / http://localhost:8501/ProxyPassReverse / http://localhost:8501/# This was the fix mentioned below!<Location "/_stcore/stream">ProxyPass ws://localhost:8501/_stcore/streamProxyPassReverse ws://localhost:8501/_stcore/stream</Location># ...</VirtualHost><IfModule mod_ssl.c><VirtualHost *:443>ServerAlias www.football-positions.panliyong.nlServerName football-positions.panliyong.nlSSLProxyEngine onProxyPass / http://localhost:8501/ProxyPassReverse / http://localhost:8501/# ...# Redirect HTTP to HTTPSRewriteEngine OnRewriteCond %{HTTPS} offRewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]</VirtualHost></IfModule>
It’s not as difficult as it sounds to prepare a model for deployment. Due to unforeseen changes in the streamlit==1.33 package, the app did not load due to websocket erros. We will be looking back on this soon!
Results of scikit-learn and PySpark can differ, but I am talking about the scikit-learn metrics.↩︎
This could also be done in the initial run command, but I forgot and was too lazy to delete it…↩︎
Source Code
---title: "Deploying an MLflow Model with Streamlit: predicting footballer positions"date: "2024-04-18"categories: [deployment, mlflow, classification, machine learning]description: "GK, DF, MF, or FW?"code-fold: falsecode-annotations: hover---In the past I have done quite a few classification problems, but most of them were a code-along or done with trivial datasets that nearly every data scientist has done.With the expanded responsibilities as a data scientist, which already are broadly defined across the industry, I wanted to write about deploying a model with an interface.It is not uncommon to deploy models by serving them as a REST API, but I would only do that for models that could be used at larges scale.We will be using *MLflow* to log our model, both *PySpark* (on Databricks) and the *scikit-learn* framework to train a model, *Streamlit* for the interface of the app, and deployment using *Docker* on my Raspberry Pi.Because we use an interface and do not serve this as an API, some additional and or different steps need to be taken compared to only serving it using *FastAPI* or *Flask* on a remote server.Today we will train and deploy a model that can predict based on footballing stats whether a player is a goalkeeper, defender, midfielder or a forward.I have always had the idea to incorporate football into these types of projects and found this project to be the perfect use case.Our aim in this post is to create a model decent enough to deploy, but ot does not necessarily have to be groundbreaking across all metrics.Let's do it!## Step 1: Finding a dataset*Kaggle* is the perfect platform to find for these niche datasets.What is Kaggle?> Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals.It also has various user-contributed datasets.The one we will be using today is [2023-2023 Football Player Stats](https://www.kaggle.com/datasets/vivovinco/20222023-football-player-stats).::: callout-noteThere is also an API to download datasets from Kaggle, but it is a little overkill for a small project like this.It would make more sense if this dataset was continuously updated, but it will not be as it only concerns 2022-2023 season data.:::## Step 2: Exploring context of the data> This dataset contains 2022-2023 football player stats per 90 minutes.\>> Only players of Premier League, Ligue 1, Bundesliga, Serie A and La Liga are listed.These competitions are usually denoted as the Top 5 European leagues, which is based on the [UEFA rankings](https://www.uefa.com/nationalassociations/uefarankings/country/?year=2023).Comparing a player that has played 90 minutes and has attempted 50 passes to a player that has played 45 minutes and also attempted 50 passes would be unfair.Let's read the data::: panel-tabset## Pandas```{python}import pandas as pddf = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin-1")df.head()```## PySpark```{python}#| eval: falsefrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName("football").getOrCreate()df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/tables/2022_2023_Football_Player_Stats.csv", sep=";")df.head()```:::The dataset contains 124 columns, and we will not use all of them for the model.::: {.callout-note collapse="true"}## Column descriptions- Rk : Rank- Player : Player's name- Nation : Player's nation- Pos : Position- Squad : Squad’s name- Comp : League that squat occupies- Age : Player's age- Born : Year of birth- MP : Matches played- Starts : Matches started- Min : Minutes played- 90s : Minutes played divided by 90- Goals : Goals scored or allowed- Shots : Shots total (Does not include penalty kicks)- SoT : Shots on target (Does not include penalty kicks)- SoT% : Shots on target percentage (Does not include penalty kicks)- G/Sh : Goals per shot- G/SoT : Goals per shot on target (Does not include penalty kicks)- ShoDist : Average distance, in yards, from goal of all shots taken (Does not include penalty kicks)- ShoFK : Shots from free kicks- ShoPK : Penalty kicks made- PKatt : Penalty kicks attempted- PasTotCmp : Passes completed- PasTotAtt : Passes attempted- PasTotCmp% : Pass completion percentage- PasTotDist : Total distance, in yards, that completed passes have traveled in any direction- PasTotPrgDist : Total distance, in yards, that completed passes have traveled towards the opponent's goal- PasShoCmp : Passes completed (Passes between 5 and 15 yards)- PasShoAtt : Passes attempted (Passes between 5 and 15 yards)- PasShoCmp% : Pass completion percentage (Passes between 5 and 15 yards)- PasMedCmp : Passes completed (Passes between 15 and 30 yards)- PasMedAtt : Passes attempted (Passes between 15 and 30 yards)- PasMedCmp% : Pass completion percentage (Passes between 15 and 30 yards)- PasLonCmp : Passes completed (Passes longer than 30 yards)- PasLonAtt : Passes attempted (Passes longer than 30 yards)- PasLonCmp% : Pass completion percentage (Passes longer than 30 yards)- Assists : Assists- PasAss : Passes that directly lead to a shot (assisted shots)- Pas3rd : Completed passes that enter the 1/3 of the pitch closest to the goal- PPA : Completed passes into the 18-yard box- CrsPA : Completed crosses into the 18-yard box- PasProg : Completed passes that move the ball towards the opponent's goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area- PasAtt : Passes attempted- PasLive : Live-ball passes- PasDead : Dead-ball passes- PasFK : Passes attempted from free kicks- TB : Completed pass sent between back defenders into open space- Sw : Passes that travel more than 40 yards of the width of the pitch- PasCrs : Crosses- TI : Throw-Ins taken- CK : Corner kicks- CkIn : Inswinging corner kicks- CkOut : Outswinging corner kicks- CkStr : Straight corner kicks- PasCmp : Passes completed- PasOff : Offsides- PasBlocks : Blocked by the opponent who was standing it the path- SCA : Shot-creating actions- ScaPassLive : Completed live-ball passes that lead to a shot attempt- ScaPassDead : Completed dead-ball passes that lead to a shot attempt- ScaDrib : Successful dribbles that lead to a shot attempt- ScaSh : Shots that lead to another shot attempt- ScaFld : Fouls drawn that lead to a shot attempt- ScaDef : Defensive actions that lead to a shot attempt- GCA : Goal-creating actions- GcaPassLive : Completed live-ball passes that lead to a goal- GcaPassDead : Completed dead-ball passes that lead to a goal- GcaDrib : Successful dribbles that lead to a goal- GcaSh : Shots that lead to another goal-scoring shot- GcaFld : Fouls drawn that lead to a goal- GcaDef : Defensive actions that lead to a goal- Tkl : Number of players tackled- TklWon : Tackles in which the tackler's team won possession of the ball- TklDef3rd : Tackles in defensive 1/3- TklMid3rd : Tackles in middle 1/3- TklAtt3rd : Tackles in attacking 1/3- TklDri : Number of dribblers tackled- TklDriAtt : Number of times dribbled past plus number of tackles- TklDri% : Percentage of dribblers tackled- TklDriPast : Number of times dribbled past by an opposing player- Blocks : Number of times blocking the ball by standing in its path- BlkSh : Number of times blocking a shot by standing in its path- BlkPass : Number of times blocking a pass by standing in its path- Int : Interceptions- Tkl+Int : Number of players tackled plus number of interceptions- Clr : Clearances- Err : Mistakes leading to an opponent's shot- Touches : Number of times a player touched the ball. Note: Receiving a pass, then dribbling, then sending a pass counts as one touch- TouDefPen : Touches in defensive penalty area- TouDef3rd : Touches in defensive 1/3- TouMid3rd : Touches in middle 1/3- TouAtt3rd : Touches in attacking 1/3- TouAttPen : Touches in attacking penalty area- TouLive : Live-ball touches. Does not include corner kicks, free kicks, throw-ins, kick-offs, goal kicks or penalty kicks.- ToAtt : Number of attempts to take on defenders while dribbling- ToSuc : Number of defenders taken on successfully, by dribbling past them- ToSuc% : Percentage of take-ons Completed Successfully- ToTkl : Number of times tackled by a defender during a take-on attempt- ToTkl% : Percentage of time tackled by a defender during a take-on attempt- Carries : Number of times the player controlled the ball with their feet- CarTotDist : Total distance, in yards, a player moved the ball while controlling it with their feet, in any direction- CarPrgDist : Total distance, in yards, a player moved the ball while controlling it with their feet towards the opponent's goal- CarProg : Carries that move the ball towards the opponent's goal at least 5 yards, or any carry into the penalty area- Car3rd : Carries that enter the 1/3 of the pitch closest to the goal- CPA : Carries into the 18-yard box- CarMis : Number of times a player failed when attempting to gain control of a ball- CarDis : Number of times a player loses control of the ball after being tackled by an opposing player- Rec : Number of times a player successfully received a pass- RecProg : Completed passes that move the ball towards the opponent's goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area- CrdY : Yellow cards- CrdR : Red cards- 2CrdY : Second yellow card- Fls : Fouls committed- Fld : Fouls drawn- Off : Offsides- Crs : Crosses- TklW : Tackles in which the tackler's team won possession of the ball- PKwon : Penalty kicks won- PKcon : Penalty kicks conceded- OG : Own goals- Recov : Number of loose balls recovered- AerWon : Aerials won- AerLost : Aerials lost- AerWon% : Percentage of aerials won:::The variable that we want to predict is in the `Pos` column.However, a quick glance shows that some players have more than one position defined.::: panel-tabset## Pandas```{python}df["Pos"].unique().tolist()```## PySpark```{python}#| eval: falsedf.select("Pos").distinct().show()```:::Some players play as midfielder (`MF`) but also occasionally as a defender (`DF`), denoted by `MFDF`.And what about players that have not played many matches?On to some feature engineering!## Step 3: Feature engineeringFirst thing we will do is filter the players that have played at least 5 matches.::: panel-tabset## Pandas```{python}df = df[df["MP"] >=5]df.shape```## PySpark```{python}#| eval: falsedf = df.filter(df["MP"] >=5)df.count()```:::We still have 2066 observations after filtering, which is fine.Second thing to do is get rid of multiple positions for one player.::: panel-tabset## Pandas```{python}df["Pos"].apply(len).unique().tolist()```## PySpark```{python}#| eval: falsefrom pyspark.sql.functions import lengthdf.withColumn("PosMod", substring(col("Pos"), 1, 2)).select("PosMod").distinct().collect()```:::The output tells us that `Pos` has either length 2 or 4.For the sake of simplicity we will use the first two letters and define that as the position of the player.::: panel-tabset## Pandas```{python}df.loc[:, "PosMod"] = df["Pos"].str[:2]df["PosMod"].unique().tolist()```## PySpark```{python}#| eval: falsefrom pyspark.sql.functions import coldf = df.withColumn("PosMod", substring(col("Pos"), 1, 2))df.select("PosMod").distinct().show()```:::4 positions, perfect!The final thing to do with the features is encode the positions into numerical values.::: panel-tabset## Scikit-learn```{python}from sklearn.preprocessing import LabelEncoderlabel_encoder = LabelEncoder()df["Class"] = label_encoder.fit_transform(df["PosMod"])label_encoder.classes_.tolist()```## PySpark```{python}#| eval: falsefrom pyspark.ml.feature import StringIndexerindexer = StringIndexer(inputCol="PosMod", outputCol="Class")df = indexer.fit(df).transform(df)```:::The encoding is as follows| Position (Class) | Encoding ||------------------|----------|| DF = Defender | 0 || FW = Forward | 1 || GK = Goalkeeper | 2 || MF = Midfielder | 3 |It's extremely frustrating and it grinds my gears, but I don't think there's a way to insert a custom mapping.I would have preferred that the order would be GK - DF - MF - FW.Makes sense right?What remains is to define the columns to use and split the data for training and testing.With some trial and error I found a selection that leads to a decent model.Remember, the interface should be user-friendly which means that I do not want to many input variables.::: panel-tabset## Scikit-learn```{python}cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"]X = df[cols]y = df["Class"]from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)```## PySpark```{python}#| eval: falsecols = ["Shots", "PasMedAtt", "Pas3rd", "Clr", "Class"]train, test = df.select(cols).randomSplit([0.7, 0.3], seed=10)```:::| Column | Description ||-------------------------|-----------------------------------------------|| Shots | Shots total (Does not include penalty kicks) per 90 minutes || PassMedAtt | Passes attempted (Passes between 15 and 30 yards) per 90 minutes || Pas3rd | Completed passes that enter the 1/3 of the pitch closest to the goal per 90 minutes || Clr | Clearances per 90 minutes |In another post I would like to revise this selection with grid optimisation.For the current post the selection suffices.## Step 4: Training the modelBefore training the model, we should scale the columns as players usually have many more passes attempted than for example shots or clearances.::: panel-tabset## Scikit-learn```{python}from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)```## PySpark```{python}#| eval: falsefrom pyspark.ml.feature import StandardScaler, VectorAssemblerassembler = VectorAssembler(inputCols=cols, outputCol="features_to_scale")train = assembler.transform(train)test = assembler.transform(test)scaler = StandardScaler(inputCol="features_to_scale", outputCol="scaled_features")scaler_model = scaler.fit(train)train_scaled = scaler_model.transform(train)test_scaled = scaler_model.transform(test)```:::Now comes the part where we integrate MLflow.Again, although MLflow allows us to evaluate iterations (runs) of models, for now we will use it to log a `.pkl` file which we can load into our Streamlit app.We will be using a Random Forest classifier.::: callout-note### MLflow autologging in DatabricksFor PySpark in Databricks, we do not need to explicitly call the MLflow functions.Databricks has MLflow autologging enabled by default on ML clusters, which we are using.:::::: panel-tabset## Scikit-learn```{python}#| warning: false#| message: falsefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, f1_score, recall_scoreimport mlflow.sklearnwith mlflow.start_run(): mlflow.sklearn.log_model(scaler, "rf_scaler")# Train Random Forest classifier rf = RandomForestClassifier(random_state=10) rf.fit(X_train_scaled, y_train)# Predict on test set y_pred = rf.predict(X_test_scaled)# Evaluate the model f1 = f1_score(y_test, y_pred, average="weighted") accuracy = accuracy_score(y_test, y_pred) recall = recall_score(y_test, y_pred, average="weighted")# Log metrics to MLflow mlflow.log_metric("f1_score", f1) mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("recall", recall)# Log final model to MLflow mlflow.sklearn.log_model(rf, "rf_model")print(f"F1 Score: {f1}")print(f"Accuracy: {accuracy}")print(f"Recall: {recall}")```## PySpark```{python}#| eval: falsefrom pyspark.ml.evaluation import MulticlassClassificationEvaluatorfrom pyspark.ml.classification import RandomForestClassifierrf = RandomForestClassifier(featuresCol="scaled_features", labelCol="Class")rf_model = rf.fit(train_scaled)rf_results = rf_model.transform(test_scaled)f1 = MulticlassClassificationEvaluator(labelCol="Class")accuracy = MulticlassClassificationEvaluator( labelCol="Class", metricName="accuracy")recall = MulticlassClassificationEvaluator( labelCol="Class", metricName="recallByLabel")f1.evaluate(rf_results)accuracy.evaluate(rf_results)recall.evaluate(rf_results)```:::These metrics seem fine![^1]Not something to write home about if you'd ask me, but good enough.It could definitely be improved upon, but again: not the scope of this post.The model and the corresponding scaler are saved under a local directory `/mlruns/0/<run_id>/artifacts/`. We extract the corresponding `.pkl` files and put them in the top-level, where the other `.py` files reside as well.[^1]: Results of scikit-learn and PySpark can differ, but I am talking about the scikit-learn metrics.::: callout-note## Continuing with scikit-learn resultsAs I am using the Databricks Community version, I found it very hard (or perhaps it is impossible) to extract the files to a local machine.Therefore we continue with scikit-learn results and code output.:::## Step 5: Creating a Streamlit appIn other projects I have fiddled around with Streamlit and I really enjoy its simplicity.It allows for quick deployment and has a decent default interface.Rather than building the `app.py` piece by piece, I will provide the file and explain parts of the code that are not self-explanatory.``` {.python filename="app.py (1/2)"}import pickleimport pandas as pdimport streamlit as st# Load MLflow model and scalerwith open("rf_model.pkl", "rb") as f: model = pickle.load(f)with open("rf_scaler.pkl", "rb") as f: scaler = pickle.load(f)# Load datadf = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin1")# Positionspositions = {0: "Defender", 1: "Forward", 2: "Goalkeeper", 3: "Midfielder"}# Helper function to preprocess input datadef preprocess_input(data): # Scale data data = scaler.transform(data) return data# Function to predict using the modeldef predict(data): # Preprocess input data X = preprocess_input(data) # Make predictions predictions = model.predict(X) return predictions# Function to compare input with existing players # <1>def compare(data, n): # <1> cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"] # <1> mse = ((df[["Shots", "PasMedAtt", "Pas3rd", "Clr"]] - data.iloc[0]) ** 2).mean(axis=1) # <1> idxs = mse.nsmallest(n).index # <1> return df.loc[idxs][["Player", "Age", "Pos", "Squad", "Comp"] + cols] # <1># Function to format a line of player informationdef format_player(row): return f"*{row['Player']}*, a {row['Age']} year old {row['Pos']} playing for {row['Squad']} in {row['Comp']}."# Function that provides text for players similar to input # <2>def text_similar_players(data): # <2> data = data.reset_index() # <2> text = 'The players closest to your input are:\n' # <2> for index, row in data.iterrows(): # <2> player_info = format_player(row) # <2> text += f"{index + 1}. {player_info}\n" # <2> return text # <2># ...```1. Compare user input with all players in the dataset. Use the MSE as the metric to minimise and find the *n* closest/most similar players with respect to input statistics.2. Provide a list of *n* players that are most similar to user input.Honestly not very much to explain.``` {.python filename="app.py (1/2)"}# Streamlit UIdef main(): st.set_page_config(page_title="Classifying Footballers", layout="wide") st.markdown(""" # <1> <style> # <1> .reportview-container { # <1> margin-top: -2em; # <1> } # <1> #MainMenu {visibility: hidden;} # <1> .stDeployButton {display:none;} # <1> footer {visibility: hidden;} # <1> #stDecoration {display:none;} # <1> </style> # <1> """, unsafe_allow_html=True) # <1> st.title("Predicting Football player positions based on 2022/23 stats") st.sidebar.success(f"Visit [blog.panliyong.nl](https://blog.panliyong.nl/posts/006_football) for the post!") # Create input form # <2> st.sidebar.header("Input Features") # <2> shots = st.sidebar.number_input("Shots", min_value=0.0, max_value=10.0, step=0.1, value=1.0) # <2> pas_med_att = st.sidebar.number_input("Passes Medium Att", min_value=0.0, max_value=70.0, step=1.0, value=20.0) # <2> pas_3rd = st.sidebar.number_input("3rd Passes", min_value=0.0, max_value=20.0, step=1.0, value=3.0) # <2> clr = st.sidebar.number_input("Clearances", min_value=0.0, max_value=15.0, step=0.1, value=3.0) # <2> # Create column layout left_column, right_column = st.columns(2) with left_column: # Players compare st.markdown("### Find similar players") n_players = st.slider("How many similar players do you want to find?", min_value=1, max_value=20, step=1, value=5) # Create prediction button if st.sidebar.button("Predict"): # Create a dataframe from the input input_data = pd.DataFrame({ "Shots": [shots], "PasMedAtt": [pas_med_att], "Pas3rd": [pas_3rd], "Clr": [clr] }) # Predict prediction = predict(input_data) # Display prediction position = positions.get(prediction[0]) position_styled = f"<span style='color:purple'>{position}</span>" st.sidebar.info(f"Predicted Position: {position}") # Check which player is closest to input players = compare(input_data, n=n_players) text = text_similar_players(players) with left_column: st.info(text) with right_column: st.markdown(f"# **Predicted a** {position_styled}", unsafe_allow_html=True) st.markdown("## Data of similar players:") st.dataframe(players.reset_index(drop=True))# Run the appif __name__ == "__main__": main()```1. Remove developer settings for Streamlit.2. Define user input interactivity.We can try and see if the Streamlit app runs with the following command``` {.bash filename="Terminal"}streamlit run app.py```This is what the app looks like:Now it's time to prepare it for deployment!## Step 6: Make a DockerfileA few things to keep in mind when writing this Dockerfile is that the `requirements.txt` and the other environment files stored by MLflow have many irrelevant dependencies.The only things we need are `pandas`, `streamlit` and `scikit-learn`.That makes building the Dockerfile quicker as there are less dependencies that have to be installed.However, another important factor is that all the time I have been developing on a Windows (64-bit) machine.The Raspberry Pi which the app is going to run on is a Linux (arm64) machine.As these platforms are not the same, we cannot simply use `docker build` commands and expect them to work on the Linux device.Let's first consider our Dockerfile:``` dockerfileFROM python:3.10-slimWORKDIR /appCOPY app.py .COPY 2022_2023_Football_Player_Stats.csv .COPY rf_model.pkl .COPY rf_scaler.pkl .RUN pip install streamlit pandas scikit-learn==1.4.0EXPOSE 8501CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]```The default port of Streamlit apps is 8501 and we did not change that, so we have to expose this port.Furthermore only the `.pkl` files and `app.py` and `.csv` file are all the files we need.Then we can install the libraries we need to run the Streamlit app.Now comes the part where we build the image and push it to Docker Hub under `panliyong/football-positions`.Up for grabs on the Raspberry Pi!``` {.bash filename="Terminal"}docker buildx build --platform linux/arm64 -t panliyong/football-positions -f Dockerfile --push .```## Step 7: Deploying on Raspberry PiThe image is now on Docker Hub and we can pull the image now.Let's name the container `football-positions` and map port 8501 of the Pi to the exposed port.I have also had issues with containers running out of memory, so in order to keep this container running we can update the restart policy to always restart the container.[^2][^2]: This could also be done in the initial run command, but I forgot and was too lazy to delete it...``` {.bash filename="Terminal"}docker pull panliyong/football-positionsdocker run -d --name football-positions -p 8501:8501 panliyong/football-positionsdocker update --restart=always football-positions```My domain `panliyong.nl` and subdomains are managed through [Cloudflare](https://www.cloudflare.com/) and are hosted on the Apache web server on the same Pi.We can serve this on [football-positions.panliyong.nl](https://football-positions.panliyong.nl) for the app.If we go to the DNS settings and add a `CNAME` record with name `football-positions` and target `panliyong.nl`, we enable this new subdomain.We need to add a virtual host `football-positions.conf` and enable it.``` {.bash filename="football-positions.conf"}<VirtualHost *:80> ServerName football-positions.panliyong.nl ServerAlias www.football-positions.panliyong.nl ProxyPass / http://localhost:8501/ ProxyPassReverse / http://localhost:8501/ # This was the fix mentioned below! <Location "/_stcore/stream"> ProxyPass ws://localhost:8501/_stcore/stream ProxyPassReverse ws://localhost:8501/_stcore/stream </Location> # ...</VirtualHost><IfModule mod_ssl.c> <VirtualHost *:443> ServerAlias www.football-positions.panliyong.nl ServerName football-positions.panliyong.nl SSLProxyEngine on ProxyPass / http://localhost:8501/ ProxyPassReverse / http://localhost:8501/ # ... # Redirect HTTP to HTTPS RewriteEngine On RewriteCond %{HTTPS} off RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301] </VirtualHost></IfModule>```Enable it with these commands:``` bashsudo a2ensite football-positions.confsudo systemctl reload apache2```If everything is correctly set up the app should be accessible through [football-positions.panliyong.nl](https://football-positions.panliyong.nl).::: callout-caution## Inaccessible due to Streamlit updatesIt appears that the app is not loading on the Apache web server due to websocket errors and some API changes.I'll be looking for a fix...**(2024-04-22):** The app is hosted on [football-positions.streamlit.app](https://football-positions.streamlit.app/) for the time being!**(2024-04-29):** I have found the fix! The app is hosted on [football-positions.panliyong.nl](https://football-positions.panliyong.nl/)!:::## TL;DRIt's not as difficult as it sounds to prepare a model for deployment.Due to unforeseen changes in the `streamlit==1.33` package, the app did not load due to websocket erros.We will be looking back on this soon!::: {.callout-caution appearance="simple"}**(2024-04-22):** The app is hosted on [football-positions.streamlit.app](https://football-positions.streamlit.app/) for the time being!**(2024-04-29):** I have found the fix! The app is hosted on [football-positions.panliyong.nl](https://football-positions.panliyong.nl/)!:::That's all for today, thanks for reading!