Deploying an MLflow Model with Streamlit: predicting footballer positions

In the past I have done quite a few classification problems, but most of them were a code-along or done with trivial datasets that nearly every data scientist has done. With the expanded responsibilities as a data scientist, which already are broadly defined across the industry, I wanted to write about deploying a model with an interface. It is not uncommon to deploy models by serving them as a REST API, but I would only do that for models that could be used at larges scale. We will be using MLflow to log our model, both PySpark (on Databricks) and the scikit-learn framework to train a model, Streamlit for the interface of the app, and deployment using Docker on my Raspberry Pi. Because we use an interface and do not serve this as an API, some additional and or different steps need to be taken compared to only serving it using FastAPI or Flask on a remote server.

Today we will train and deploy a model that can predict based on footballing stats whether a player is a goalkeeper, defender, midfielder or a forward. I have always had the idea to incorporate football into these types of projects and found this project to be the perfect use case. Our aim in this post is to create a model decent enough to deploy, but ot does not necessarily have to be groundbreaking across all metrics. Let’s do it!

Step 1: Finding a dataset

Kaggle is the perfect platform to find for these niche datasets. What is Kaggle?

Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.

It also has various user-contributed datasets. The one we will be using today is 2023-2023 Football Player Stats.

Note

There is also an API to download datasets from Kaggle, but it is a little overkill for a small project like this. It would make more sense if this dataset was continuously updated, but it will not be as it only concerns 2022-2023 season data.

Step 2: Exploring context of the data

This dataset contains 2022-2023 football player stats per 90 minutes.

Only players of Premier League, Ligue 1, Bundesliga, Serie A and La Liga are listed.

These competitions are usually denoted as the Top 5 European leagues, which is based on the UEFA rankings. Comparing a player that has played 90 minutes and has attempted 50 passes to a player that has played 45 minutes and also attempted 50 passes would be unfair. Let’s read the data

Pandas
PySpark

import pandas as pd
df = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin-1")
df.head()

	Rk	Player	Nation	Pos	Squad	Comp	Age	Born	MP	Starts	...	Off	Crs	TklW	OG	Recov	AerWon	AerLost	AerWon%
0	1	Brenden Aaronson	USA	MFFW	Leeds United	Premier League	22	2000	20	19	...	0.17	2.54	0.51	0.00	4.86	0.34	1.19	22.2
1	2	Yunis Abdelhamid	MAR	DF	Reims	Ligue 1	35	1987	22	22	...	0.05	0.18	1.59	0.00	6.64	2.18	1.23	64.0
2	3	Himad Abdelli	FRA	MFFW	Angers	Ligue 1	23	1999	14	8	...	0.00	1.05	1.40	0.00	8.14	0.93	1.05	47.1
3	4	Salis Abdul Samed	GHA	MF	Lens	Ligue 1	22	2000	20	20	...	0.00	0.35	0.80	0.05	6.60	0.50	0.50	50.0
4	5	Laurent Abergel	FRA	MF	Lorient	Ligue 1	30	1993	15	15	...	0.00	0.23	2.02	0.00	6.51	0.31	0.39	44.4

5 rows × 124 columns

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("football").getOrCreate()

df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/tables/2022_2023_Football_Player_Stats.csv", sep=";")
df.head()

The dataset contains 124 columns, and we will not use all of them for the model.

Column descriptions

Rk : Rank
Player : Player’s name
Nation : Player’s nation
Pos : Position
Squad : Squad’s name
Comp : League that squat occupies
Age : Player’s age
Born : Year of birth
MP : Matches played
Starts : Matches started
Min : Minutes played
90s : Minutes played divided by 90
Goals : Goals scored or allowed
Shots : Shots total (Does not include penalty kicks)
SoT : Shots on target (Does not include penalty kicks)
SoT% : Shots on target percentage (Does not include penalty kicks)
G/Sh : Goals per shot
G/SoT : Goals per shot on target (Does not include penalty kicks)
ShoDist : Average distance, in yards, from goal of all shots taken (Does not include penalty kicks)
ShoFK : Shots from free kicks
ShoPK : Penalty kicks made
PKatt : Penalty kicks attempted
PasTotCmp : Passes completed
PasTotAtt : Passes attempted
PasTotCmp% : Pass completion percentage
PasTotDist : Total distance, in yards, that completed passes have traveled in any direction
PasTotPrgDist : Total distance, in yards, that completed passes have traveled towards the opponent’s goal
PasShoCmp : Passes completed (Passes between 5 and 15 yards)
PasShoAtt : Passes attempted (Passes between 5 and 15 yards)
PasShoCmp% : Pass completion percentage (Passes between 5 and 15 yards)
PasMedCmp : Passes completed (Passes between 15 and 30 yards)
PasMedAtt : Passes attempted (Passes between 15 and 30 yards)
PasMedCmp% : Pass completion percentage (Passes between 15 and 30 yards)
PasLonCmp : Passes completed (Passes longer than 30 yards)
PasLonAtt : Passes attempted (Passes longer than 30 yards)
PasLonCmp% : Pass completion percentage (Passes longer than 30 yards)
Assists : Assists
PasAss : Passes that directly lead to a shot (assisted shots)
Pas3rd : Completed passes that enter the 1/3 of the pitch closest to the goal
PPA : Completed passes into the 18-yard box
CrsPA : Completed crosses into the 18-yard box
PasProg : Completed passes that move the ball towards the opponent’s goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area
PasAtt : Passes attempted
PasLive : Live-ball passes
PasDead : Dead-ball passes
PasFK : Passes attempted from free kicks
TB : Completed pass sent between back defenders into open space
Sw : Passes that travel more than 40 yards of the width of the pitch
PasCrs : Crosses
TI : Throw-Ins taken
CK : Corner kicks
CkIn : Inswinging corner kicks
CkOut : Outswinging corner kicks
CkStr : Straight corner kicks
PasCmp : Passes completed
PasOff : Offsides
PasBlocks : Blocked by the opponent who was standing it the path
SCA : Shot-creating actions
ScaPassLive : Completed live-ball passes that lead to a shot attempt
ScaPassDead : Completed dead-ball passes that lead to a shot attempt
ScaDrib : Successful dribbles that lead to a shot attempt
ScaSh : Shots that lead to another shot attempt
ScaFld : Fouls drawn that lead to a shot attempt
ScaDef : Defensive actions that lead to a shot attempt
GCA : Goal-creating actions
GcaPassLive : Completed live-ball passes that lead to a goal
GcaPassDead : Completed dead-ball passes that lead to a goal
GcaDrib : Successful dribbles that lead to a goal
GcaSh : Shots that lead to another goal-scoring shot
GcaFld : Fouls drawn that lead to a goal
GcaDef : Defensive actions that lead to a goal
Tkl : Number of players tackled
TklWon : Tackles in which the tackler’s team won possession of the ball
TklDef3rd : Tackles in defensive 1/3
TklMid3rd : Tackles in middle 1/3
TklAtt3rd : Tackles in attacking 1/3
TklDri : Number of dribblers tackled
TklDriAtt : Number of times dribbled past plus number of tackles
TklDri% : Percentage of dribblers tackled
TklDriPast : Number of times dribbled past by an opposing player
Blocks : Number of times blocking the ball by standing in its path
BlkSh : Number of times blocking a shot by standing in its path
BlkPass : Number of times blocking a pass by standing in its path
Int : Interceptions
Tkl+Int : Number of players tackled plus number of interceptions
Clr : Clearances
Err : Mistakes leading to an opponent’s shot
Touches : Number of times a player touched the ball. Note: Receiving a pass, then dribbling, then sending a pass counts as one touch
TouDefPen : Touches in defensive penalty area
TouDef3rd : Touches in defensive 1/3
TouMid3rd : Touches in middle 1/3
TouAtt3rd : Touches in attacking 1/3
TouAttPen : Touches in attacking penalty area
TouLive : Live-ball touches. Does not include corner kicks, free kicks, throw-ins, kick-offs, goal kicks or penalty kicks.
ToAtt : Number of attempts to take on defenders while dribbling
ToSuc : Number of defenders taken on successfully, by dribbling past them
ToSuc% : Percentage of take-ons Completed Successfully
ToTkl : Number of times tackled by a defender during a take-on attempt
ToTkl% : Percentage of time tackled by a defender during a take-on attempt
Carries : Number of times the player controlled the ball with their feet
CarTotDist : Total distance, in yards, a player moved the ball while controlling it with their feet, in any direction
CarPrgDist : Total distance, in yards, a player moved the ball while controlling it with their feet towards the opponent’s goal
CarProg : Carries that move the ball towards the opponent’s goal at least 5 yards, or any carry into the penalty area
Car3rd : Carries that enter the 1/3 of the pitch closest to the goal
CPA : Carries into the 18-yard box
CarMis : Number of times a player failed when attempting to gain control of a ball
CarDis : Number of times a player loses control of the ball after being tackled by an opposing player
Rec : Number of times a player successfully received a pass
RecProg : Completed passes that move the ball towards the opponent’s goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area
CrdY : Yellow cards
CrdR : Red cards
2CrdY : Second yellow card
Fls : Fouls committed
Fld : Fouls drawn
Off : Offsides
Crs : Crosses
TklW : Tackles in which the tackler’s team won possession of the ball
PKwon : Penalty kicks won
PKcon : Penalty kicks conceded
OG : Own goals
Recov : Number of loose balls recovered
AerWon : Aerials won
AerLost : Aerials lost
AerWon% : Percentage of aerials won

The variable that we want to predict is in the Pos column. However, a quick glance shows that some players have more than one position defined.

Pandas
PySpark

df["Pos"].unique().tolist()

['MFFW', 'DF', 'MF', 'FWMF', 'FW', 'DFFW', 'MFDF', 'GK', 'DFMF', 'FWDF']

df.select("Pos").distinct().show()

Some players play as midfielder (MF) but also occasionally as a defender (DF), denoted by MFDF. And what about players that have not played many matches? On to some feature engineering!

Step 3: Feature engineering

First thing we will do is filter the players that have played at least 5 matches.

Pandas
PySpark

df = df[df["MP"] >= 5]
df.shape

(2066, 124)

df = df.filter(df["MP"] >= 5)
df.count()

We still have 2066 observations after filtering, which is fine. Second thing to do is get rid of multiple positions for one player.

Pandas
PySpark

df["Pos"].apply(len).unique().tolist()

[4, 2]

from pyspark.sql.functions import length

df.withColumn("PosMod", substring(col("Pos"), 1, 2)).select("PosMod").distinct().collect()

The output tells us that Pos has either length 2 or 4. For the sake of simplicity we will use the first two letters and define that as the position of the player.

Pandas
PySpark

df.loc[:, "PosMod"] = df["Pos"].str[:2]
df["PosMod"].unique().tolist()

['MF', 'DF', 'FW', 'GK']

from pyspark.sql.functions import col

df = df.withColumn("PosMod", substring(col("Pos"), 1, 2))
df.select("PosMod").distinct().show()

4 positions, perfect! The final thing to do with the features is encode the positions into numerical values.

Scikit-learn
PySpark

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df["Class"] = label_encoder.fit_transform(df["PosMod"])
label_encoder.classes_.tolist()

['DF', 'FW', 'GK', 'MF']

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="PosMod", outputCol="Class")
df = indexer.fit(df).transform(df)

The encoding is as follows

Position (Class)	Encoding
DF = Defender	0
FW = Forward	1
GK = Goalkeeper	2
MF = Midfielder	3

It’s extremely frustrating and it grinds my gears, but I don’t think there’s a way to insert a custom mapping. I would have preferred that the order would be GK - DF - MF - FW. Makes sense right?

What remains is to define the columns to use and split the data for training and testing. With some trial and error I found a selection that leads to a decent model. Remember, the interface should be user-friendly which means that I do not want to many input variables.

Scikit-learn
PySpark

cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"]
X = df[cols]
y = df["Class"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr", "Class"]
train, test = df.select(cols).randomSplit([0.7, 0.3], seed=10)

Column	Description
Shots	Shots total (Does not include penalty kicks) per 90 minutes
PassMedAtt	Passes attempted (Passes between 15 and 30 yards) per 90 minutes
Pas3rd	Completed passes that enter the 1/3 of the pitch closest to the goal per 90 minutes
Clr	Clearances per 90 minutes

In another post I would like to revise this selection with grid optimisation. For the current post the selection suffices.

Step 4: Training the model

Before training the model, we should scale the columns as players usually have many more passes attempted than for example shots or clearances.

Scikit-learn
PySpark

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from pyspark.ml.feature import StandardScaler, VectorAssembler

assembler = VectorAssembler(inputCols=cols, outputCol="features_to_scale")
train = assembler.transform(train)
test = assembler.transform(test)

scaler = StandardScaler(inputCol="features_to_scale", outputCol="scaled_features")
scaler_model = scaler.fit(train)
train_scaled = scaler_model.transform(train)
test_scaled = scaler_model.transform(test)

Now comes the part where we integrate MLflow. Again, although MLflow allows us to evaluate iterations (runs) of models, for now we will use it to log a .pkl file which we can load into our Streamlit app. We will be using a Random Forest classifier.

MLflow autologging in Databricks

For PySpark in Databricks, we do not need to explicitly call the MLflow functions. Databricks has MLflow autologging enabled by default on ML clusters, which we are using.

Scikit-learn
PySpark

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score
import mlflow.sklearn

with mlflow.start_run():
    
    mlflow.sklearn.log_model(scaler, "rf_scaler")
    
    # Train Random Forest classifier
    rf = RandomForestClassifier(random_state=10)
    rf.fit(X_train_scaled, y_train)

    # Predict on test set
    y_pred = rf.predict(X_test_scaled)

    # Evaluate the model
    f1 = f1_score(y_test, y_pred, average="weighted")
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred, average="weighted")

    # Log metrics to MLflow
    mlflow.log_metric("f1_score", f1)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("recall", recall)

    # Log final model to MLflow
    mlflow.sklearn.log_model(rf, "rf_model")

print(f"F1 Score: {f1}")
print(f"Accuracy: {accuracy}")
print(f"Recall: {recall}")

F1 Score: 0.8171709415334705
Accuracy: 0.817741935483871
Recall: 0.817741935483871

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol="scaled_features", labelCol="Class")
rf_model = rf.fit(train_scaled)
rf_results = rf_model.transform(test_scaled)


f1 = MulticlassClassificationEvaluator(labelCol="Class")
accuracy = MulticlassClassificationEvaluator(
    labelCol="Class", metricName="accuracy")
recall = MulticlassClassificationEvaluator(
    labelCol="Class", metricName="recallByLabel")

f1.evaluate(rf_results)
accuracy.evaluate(rf_results)
recall.evaluate(rf_results)

These metrics seem fine!¹ Not something to write home about if you’d ask me, but good enough. It could definitely be improved upon, but again: not the scope of this post. The model and the corresponding scaler are saved under a local directory /mlruns/0/<run_id>/artifacts/. We extract the corresponding .pkl files and put them in the top-level, where the other .py files reside as well.

Continuing with scikit-learn results

As I am using the Databricks Community version, I found it very hard (or perhaps it is impossible) to extract the files to a local machine. Therefore we continue with scikit-learn results and code output.

Step 5: Creating a Streamlit app

In other projects I have fiddled around with Streamlit and I really enjoy its simplicity. It allows for quick deployment and has a decent default interface. Rather than building the app.py piece by piece, I will provide the file and explain parts of the code that are not self-explanatory.

app.py (1/2)

import pickle
import pandas as pd
import streamlit as st

# Load MLflow model and scaler
with open("rf_model.pkl", "rb") as f:
    model = pickle.load(f)

with open("rf_scaler.pkl", "rb") as f:
    scaler = pickle.load(f)

# Load data
df = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin1")

# Positions
positions = {0: "Defender", 1: "Forward", 2: "Goalkeeper", 3: "Midfielder"}


# Helper function to preprocess input data
def preprocess_input(data):
    # Scale data
    data = scaler.transform(data)
    
    return data


# Function to predict using the model
def predict(data):
    # Preprocess input data
    X = preprocess_input(data)
    
    # Make predictions
    predictions = model.predict(X)
    
    return predictions


# Function to compare input with existing players
def compare(data, n):
    cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"]
    mse = ((df[["Shots", "PasMedAtt", "Pas3rd", "Clr"]] - data.iloc[0]) ** 2).mean(axis=1)
    
    idxs = mse.nsmallest(n).index
    return df.loc[idxs][["Player", "Age", "Pos", "Squad", "Comp"] + cols]


# Function to format a line of player information
def format_player(row):
    return f"*{row['Player']}*, a {row['Age']} year old {row['Pos']} playing for {row['Squad']} in {row['Comp']}."


# Function that provides text for players similar to input
def text_similar_players(data):
    data = data.reset_index()

    text = 'The players closest to your input are:\n'
    for index, row in data.iterrows():
        player_info = format_player(row)
        text += f"{index + 1}. {player_info}\n"
    
    return text

# ...

1: Compare user input with all players in the dataset. Use the MSE as the metric to minimise and find the n closest/most similar players with respect to input statistics.
2: Provide a list of n players that are most similar to user input.

Honestly not very much to explain.

app.py (1/2)

# Streamlit UI
def main():
    st.set_page_config(page_title="Classifying Footballers", layout="wide")
    st.markdown("""
        <style>
            .reportview-container {
                margin-top: -2em;
            }
            #MainMenu {visibility: hidden;}
            .stDeployButton {display:none;}
            footer {visibility: hidden;}
            #stDecoration {display:none;}
        </style>
    """, unsafe_allow_html=True)
    
    st.title("Predicting Football player positions based on 2022/23 stats")
    
    st.sidebar.success(f"Visit [blog.panliyong.nl](https://blog.panliyong.nl/posts/006_football) for the post!")
    
    # Create input form
    st.sidebar.header("Input Features")
    shots = st.sidebar.number_input("Shots", min_value=0.0, max_value=10.0, step=0.1, value=1.0)
    pas_med_att = st.sidebar.number_input("Passes Medium Att", min_value=0.0, max_value=70.0, step=1.0, value=20.0)
    pas_3rd = st.sidebar.number_input("3rd Passes", min_value=0.0, max_value=20.0, step=1.0, value=3.0)
    clr = st.sidebar.number_input("Clearances", min_value=0.0, max_value=15.0, step=0.1, value=3.0)
    
    # Create column layout
    left_column, right_column = st.columns(2)
    
    with left_column:
        # Players compare
        st.markdown("### Find similar players")
        n_players = st.slider("How many similar players do you want to find?", min_value=1, max_value=20, step=1,
                              value=5)
    
    # Create prediction button
    if st.sidebar.button("Predict"):
        # Create a dataframe from the input
        input_data = pd.DataFrame({
            "Shots": [shots],
            "PasMedAtt": [pas_med_att],
            "Pas3rd": [pas_3rd],
            "Clr": [clr]
        })
        
        # Predict
        prediction = predict(input_data)
        
        # Display prediction
        position = positions.get(prediction[0])
        position_styled = f"<span style='color:purple'>{position}</span>"
        
        st.sidebar.info(f"Predicted Position: {position}")
        
        # Check which player is closest to input
        players = compare(input_data, n=n_players)
        
        text = text_similar_players(players)
        
        with left_column:
            st.info(text)
        
        with right_column:
            st.markdown(f"# **Predicted a** {position_styled}",
                        unsafe_allow_html=True)
            st.markdown("## Data of similar players:")
            st.dataframe(players.reset_index(drop=True))


# Run the app
if __name__ == "__main__":
    main()

1: Remove developer settings for Streamlit.
2: Define user input interactivity.

We can try and see if the Streamlit app runs with the following command

Terminal

streamlit run app.py

This is what the app looks like:

Now it’s time to prepare it for deployment!

Step 6: Make a Dockerfile

A few things to keep in mind when writing this Dockerfile is that the requirements.txt and the other environment files stored by MLflow have many irrelevant dependencies. The only things we need are pandas, streamlit and scikit-learn. That makes building the Dockerfile quicker as there are less dependencies that have to be installed.

However, another important factor is that all the time I have been developing on a Windows (64-bit) machine. The Raspberry Pi which the app is going to run on is a Linux (arm64) machine. As these platforms are not the same, we cannot simply use docker build commands and expect them to work on the Linux device.

Let’s first consider our Dockerfile:

FROM python:3.10-slim

WORKDIR /app

COPY app.py .
COPY 2022_2023_Football_Player_Stats.csv .
COPY rf_model.pkl .
COPY rf_scaler.pkl .

RUN pip install streamlit pandas scikit-learn==1.4.0

EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]

The default port of Streamlit apps is 8501 and we did not change that, so we have to expose this port. Furthermore only the .pkl files and app.py and .csv file are all the files we need. Then we can install the libraries we need to run the Streamlit app.

Now comes the part where we build the image and push it to Docker Hub under panliyong/football-positions. Up for grabs on the Raspberry Pi!

Terminal

docker buildx build --platform linux/arm64 -t panliyong/football-positions -f Dockerfile --push .

Step 7: Deploying on Raspberry Pi

The image is now on Docker Hub and we can pull the image now. Let’s name the container football-positions and map port 8501 of the Pi to the exposed port. I have also had issues with containers running out of memory, so in order to keep this container running we can update the restart policy to always restart the container.²

Terminal

docker pull panliyong/football-positions
docker run -d --name football-positions -p 8501:8501 panliyong/football-positions
docker update --restart=always football-positions

My domain panliyong.nl and subdomains are managed through Cloudflare and are hosted on the Apache web server on the same Pi. We can serve this on football-positions.panliyong.nl for the app. If we go to the DNS settings and add a CNAME record with name football-positions and target panliyong.nl, we enable this new subdomain.

We need to add a virtual host football-positions.conf and enable it.

football-positions.conf

<VirtualHost *:80>
    ServerName football-positions.panliyong.nl
    ServerAlias www.football-positions.panliyong.nl

    ProxyPass / http://localhost:8501/
    ProxyPassReverse / http://localhost:8501/

    # This was the fix mentioned below!
    <Location "/_stcore/stream">
        ProxyPass               ws://localhost:8501/_stcore/stream
        ProxyPassReverse        ws://localhost:8501/_stcore/stream
    </Location>


    # ...
</VirtualHost>

<IfModule mod_ssl.c>
    <VirtualHost *:443>
        ServerAlias www.football-positions.panliyong.nl
        ServerName football-positions.panliyong.nl

        SSLProxyEngine on
        ProxyPass / http://localhost:8501/
        ProxyPassReverse / http://localhost:8501/

        # ...

        # Redirect HTTP to HTTPS
        RewriteEngine On
        RewriteCond %{HTTPS} off
        RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
    </VirtualHost>
</IfModule>

Enable it with these commands:

sudo a2ensite football-positions.conf
sudo systemctl reload apache2

If everything is correctly set up the app should be accessible through football-positions.panliyong.nl.

Inaccessible due to Streamlit updates

It appears that the app is not loading on the Apache web server due to websocket errors and some API changes. I’ll be looking for a fix…

(2024-04-22): The app is hosted on football-positions.streamlit.app for the time being!

(2024-04-29): I have found the fix! The app is hosted on football-positions.panliyong.nl!

TL;DR

It’s not as difficult as it sounds to prepare a model for deployment. Due to unforeseen changes in the streamlit==1.33 package, the app did not load due to websocket erros. We will be looking back on this soon!

(2024-04-22): The app is hosted on football-positions.streamlit.app for the time being!

(2024-04-29): I have found the fix! The app is hosted on football-positions.panliyong.nl!

That’s all for today, thanks for reading!

Footnotes

Results of scikit-learn and PySpark can differ, but I am talking about the scikit-learn metrics.↩︎
This could also be done in the initial run command, but I forgot and was too lazy to delete it…↩︎

--- title: "Deploying an MLflow Model with Streamlit: predicting footballer positions" date: "2024-04-18" categories: [deployment, mlflow, classification, machine learning] description: "GK, DF, MF, or FW?" code-fold: false code-annotations: hover --- In the past I have done quite a few classification problems, but most of them were a code-along or done with trivial datasets that nearly every data scientist has done. With the expanded responsibilities as a data scientist, which already are broadly defined across the industry, I wanted to write about deploying a model with an interface. It is not uncommon to deploy models by serving them as a REST API, but I would only do that for models that could be used at larges scale. We will be using *MLflow* to log our model, both *PySpark* (on Databricks) and the *scikit-learn* framework to train a model, *Streamlit* for the interface of the app, and deployment using *Docker* on my Raspberry Pi. Because we use an interface and do not serve this as an API, some additional and or different steps need to be taken compared to only serving it using *FastAPI* or *Flask* on a remote server. Today we will train and deploy a model that can predict based on footballing stats whether a player is a goalkeeper, defender, midfielder or a forward. I have always had the idea to incorporate football into these types of projects and found this project to be the perfect use case. Our aim in this post is to create a model decent enough to deploy, but ot does not necessarily have to be groundbreaking across all metrics. Let's do it! ## Step 1: Finding a dataset *Kaggle* is the perfect platform to find for these niche datasets. What is Kaggle? > Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals. It also has various user-contributed datasets. The one we will be using today is [2023-2023 Football Player Stats](https://www.kaggle.com/datasets/vivovinco/20222023-football-player-stats). ::: callout-note There is also an API to download datasets from Kaggle, but it is a little overkill for a small project like this. It would make more sense if this dataset was continuously updated, but it will not be as it only concerns 2022-2023 season data. ::: ## Step 2: Exploring context of the data > This dataset contains 2022-2023 football player stats per 90 minutes.\ > > Only players of Premier League, Ligue 1, Bundesliga, Serie A and La Liga are listed. These competitions are usually denoted as the Top 5 European leagues, which is based on the [UEFA rankings](https://www.uefa.com/nationalassociations/uefarankings/country/?year=2023). Comparing a player that has played 90 minutes and has attempted 50 passes to a player that has played 45 minutes and also attempted 50 passes would be unfair. Let's read the data ::: panel-tabset ## Pandas ```{python} import pandas as pd df = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin-1") df.head() ``` ## PySpark ```{python} #| eval: false from pyspark.sql import SparkSession spark = SparkSession.builder.appName("football").getOrCreate() df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/tables/2022_2023_Football_Player_Stats.csv", sep=";") df.head() ``` ::: The dataset contains 124 columns, and we will not use all of them for the model. ::: {.callout-note collapse="true"} ## Column descriptions - Rk : Rank - Player : Player's name - Nation : Player's nation - Pos : Position - Squad : Squad’s name - Comp : League that squat occupies - Age : Player's age - Born : Year of birth - MP : Matches played - Starts : Matches started - Min : Minutes played - 90s : Minutes played divided by 90 - Goals : Goals scored or allowed - Shots : Shots total (Does not include penalty kicks) - SoT : Shots on target (Does not include penalty kicks) - SoT% : Shots on target percentage (Does not include penalty kicks) - G/Sh : Goals per shot - G/SoT : Goals per shot on target (Does not include penalty kicks) - ShoDist : Average distance, in yards, from goal of all shots taken (Does not include penalty kicks) - ShoFK : Shots from free kicks - ShoPK : Penalty kicks made - PKatt : Penalty kicks attempted - PasTotCmp : Passes completed - PasTotAtt : Passes attempted - PasTotCmp% : Pass completion percentage - PasTotDist : Total distance, in yards, that completed passes have traveled in any direction - PasTotPrgDist : Total distance, in yards, that completed passes have traveled towards the opponent's goal - PasShoCmp : Passes completed (Passes between 5 and 15 yards) - PasShoAtt : Passes attempted (Passes between 5 and 15 yards) - PasShoCmp% : Pass completion percentage (Passes between 5 and 15 yards) - PasMedCmp : Passes completed (Passes between 15 and 30 yards) - PasMedAtt : Passes attempted (Passes between 15 and 30 yards) - PasMedCmp% : Pass completion percentage (Passes between 15 and 30 yards) - PasLonCmp : Passes completed (Passes longer than 30 yards) - PasLonAtt : Passes attempted (Passes longer than 30 yards) - PasLonCmp% : Pass completion percentage (Passes longer than 30 yards) - Assists : Assists - PasAss : Passes that directly lead to a shot (assisted shots) - Pas3rd : Completed passes that enter the 1/3 of the pitch closest to the goal - PPA : Completed passes into the 18-yard box - CrsPA : Completed crosses into the 18-yard box - PasProg : Completed passes that move the ball towards the opponent's goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area - PasAtt : Passes attempted - PasLive : Live-ball passes - PasDead : Dead-ball passes - PasFK : Passes attempted from free kicks - TB : Completed pass sent between back defenders into open space - Sw : Passes that travel more than 40 yards of the width of the pitch - PasCrs : Crosses - TI : Throw-Ins taken - CK : Corner kicks - CkIn : Inswinging corner kicks - CkOut : Outswinging corner kicks - CkStr : Straight corner kicks - PasCmp : Passes completed - PasOff : Offsides - PasBlocks : Blocked by the opponent who was standing it the path - SCA : Shot-creating actions - ScaPassLive : Completed live-ball passes that lead to a shot attempt - ScaPassDead : Completed dead-ball passes that lead to a shot attempt - ScaDrib : Successful dribbles that lead to a shot attempt - ScaSh : Shots that lead to another shot attempt - ScaFld : Fouls drawn that lead to a shot attempt - ScaDef : Defensive actions that lead to a shot attempt - GCA : Goal-creating actions - GcaPassLive : Completed live-ball passes that lead to a goal - GcaPassDead : Completed dead-ball passes that lead to a goal - GcaDrib : Successful dribbles that lead to a goal - GcaSh : Shots that lead to another goal-scoring shot - GcaFld : Fouls drawn that lead to a goal - GcaDef : Defensive actions that lead to a goal - Tkl : Number of players tackled - TklWon : Tackles in which the tackler's team won possession of the ball - TklDef3rd : Tackles in defensive 1/3 - TklMid3rd : Tackles in middle 1/3 - TklAtt3rd : Tackles in attacking 1/3 - TklDri : Number of dribblers tackled - TklDriAtt : Number of times dribbled past plus number of tackles - TklDri% : Percentage of dribblers tackled - TklDriPast : Number of times dribbled past by an opposing player - Blocks : Number of times blocking the ball by standing in its path - BlkSh : Number of times blocking a shot by standing in its path - BlkPass : Number of times blocking a pass by standing in its path - Int : Interceptions - Tkl+Int : Number of players tackled plus number of interceptions - Clr : Clearances - Err : Mistakes leading to an opponent's shot - Touches : Number of times a player touched the ball. Note: Receiving a pass, then dribbling, then sending a pass counts as one touch - TouDefPen : Touches in defensive penalty area - TouDef3rd : Touches in defensive 1/3 - TouMid3rd : Touches in middle 1/3 - TouAtt3rd : Touches in attacking 1/3 - TouAttPen : Touches in attacking penalty area - TouLive : Live-ball touches. Does not include corner kicks, free kicks, throw-ins, kick-offs, goal kicks or penalty kicks. - ToAtt : Number of attempts to take on defenders while dribbling - ToSuc : Number of defenders taken on successfully, by dribbling past them - ToSuc% : Percentage of take-ons Completed Successfully - ToTkl : Number of times tackled by a defender during a take-on attempt - ToTkl% : Percentage of time tackled by a defender during a take-on attempt - Carries : Number of times the player controlled the ball with their feet - CarTotDist : Total distance, in yards, a player moved the ball while controlling it with their feet, in any direction - CarPrgDist : Total distance, in yards, a player moved the ball while controlling it with their feet towards the opponent's goal - CarProg : Carries that move the ball towards the opponent's goal at least 5 yards, or any carry into the penalty area - Car3rd : Carries that enter the 1/3 of the pitch closest to the goal - CPA : Carries into the 18-yard box - CarMis : Number of times a player failed when attempting to gain control of a ball - CarDis : Number of times a player loses control of the ball after being tackled by an opposing player - Rec : Number of times a player successfully received a pass - RecProg : Completed passes that move the ball towards the opponent's goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area - CrdY : Yellow cards - CrdR : Red cards - 2CrdY : Second yellow card - Fls : Fouls committed - Fld : Fouls drawn - Off : Offsides - Crs : Crosses - TklW : Tackles in which the tackler's team won possession of the ball - PKwon : Penalty kicks won - PKcon : Penalty kicks conceded - OG : Own goals - Recov : Number of loose balls recovered - AerWon : Aerials won - AerLost : Aerials lost - AerWon% : Percentage of aerials won ::: The variable that we want to predict is in the `Pos` column. However, a quick glance shows that some players have more than one position defined. ::: panel-tabset ## Pandas ```{python} df["Pos"].unique().tolist() ``` ## PySpark ```{python} #| eval: false df.select("Pos").distinct().show() ``` ::: Some players play as midfielder (`MF`) but also occasionally as a defender (`DF`), denoted by `MFDF`. And what about players that have not played many matches? On to some feature engineering! ## Step 3: Feature engineering First thing we will do is filter the players that have played at least 5 matches. ::: panel-tabset ## Pandas ```{python} df = df[df["MP"] >= 5] df.shape ``` ## PySpark ```{python} #| eval: false df = df.filter(df["MP"] >= 5) df.count() ``` ::: We still have 2066 observations after filtering, which is fine. Second thing to do is get rid of multiple positions for one player. ::: panel-tabset ## Pandas ```{python} df["Pos"].apply(len).unique().tolist() ``` ## PySpark ```{python} #| eval: false from pyspark.sql.functions import length df.withColumn("PosMod", substring(col("Pos"), 1, 2)).select("PosMod").distinct().collect() ``` ::: The output tells us that `Pos` has either length 2 or 4. For the sake of simplicity we will use the first two letters and define that as the position of the player. ::: panel-tabset ## Pandas ```{python} df.loc[:, "PosMod"] = df["Pos"].str[:2] df["PosMod"].unique().tolist() ``` ## PySpark ```{python} #| eval: false from pyspark.sql.functions import col df = df.withColumn("PosMod", substring(col("Pos"), 1, 2)) df.select("PosMod").distinct().show() ``` ::: 4 positions, perfect! The final thing to do with the features is encode the positions into numerical values. ::: panel-tabset ## Scikit-learn ```{python} from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() df["Class"] = label_encoder.fit_transform(df["PosMod"]) label_encoder.classes_.tolist() ``` ## PySpark ```{python} #| eval: false from pyspark.ml.feature import StringIndexer indexer = StringIndexer(inputCol="PosMod", outputCol="Class") df = indexer.fit(df).transform(df) ``` ::: The encoding is as follows | Position (Class) | Encoding | |------------------|----------| | DF = Defender | 0 | | FW = Forward | 1 | | GK = Goalkeeper | 2 | | MF = Midfielder | 3 | It's extremely frustrating and it grinds my gears, but I don't think there's a way to insert a custom mapping. I would have preferred that the order would be GK - DF - MF - FW. Makes sense right? What remains is to define the columns to use and split the data for training and testing. With some trial and error I found a selection that leads to a decent model. Remember, the interface should be user-friendly which means that I do not want to many input variables. ::: panel-tabset ## Scikit-learn ```{python} cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"] X = df[cols] y = df["Class"] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10) ``` ## PySpark ```{python} #| eval: false cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr", "Class"] train, test = df.select(cols).randomSplit([0.7, 0.3], seed=10) ``` ::: | Column | Description | |-------------------------|-----------------------------------------------| | Shots | Shots total (Does not include penalty kicks) per 90 minutes | | PassMedAtt | Passes attempted (Passes between 15 and 30 yards) per 90 minutes | | Pas3rd | Completed passes that enter the 1/3 of the pitch closest to the goal per 90 minutes | | Clr | Clearances per 90 minutes | In another post I would like to revise this selection with grid optimisation. For the current post the selection suffices. ## Step 4: Training the model Before training the model, we should scale the columns as players usually have many more passes attempted than for example shots or clearances. ::: panel-tabset ## Scikit-learn ```{python} from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) ``` ## PySpark ```{python} #| eval: false from pyspark.ml.feature import StandardScaler, VectorAssembler assembler = VectorAssembler(inputCols=cols, outputCol="features_to_scale") train = assembler.transform(train) test = assembler.transform(test) scaler = StandardScaler(inputCol="features_to_scale", outputCol="scaled_features") scaler_model = scaler.fit(train) train_scaled = scaler_model.transform(train) test_scaled = scaler_model.transform(test) ``` ::: Now comes the part where we integrate MLflow. Again, although MLflow allows us to evaluate iterations (runs) of models, for now we will use it to log a `.pkl` file which we can load into our Streamlit app. We will be using a Random Forest classifier. ::: callout-note ### MLflow autologging in Databricks For PySpark in Databricks, we do not need to explicitly call the MLflow functions. Databricks has MLflow autologging enabled by default on ML clusters, which we are using. ::: ::: panel-tabset ## Scikit-learn ```{python} #| warning: false #| message: false from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, f1_score, recall_score import mlflow.sklearn with mlflow.start_run(): mlflow.sklearn.log_model(scaler, "rf_scaler") # Train Random Forest classifier rf = RandomForestClassifier(random_state=10) rf.fit(X_train_scaled, y_train) # Predict on test set y_pred = rf.predict(X_test_scaled) # Evaluate the model f1 = f1_score(y_test, y_pred, average="weighted") accuracy = accuracy_score(y_test, y_pred) recall = recall_score(y_test, y_pred, average="weighted") # Log metrics to MLflow mlflow.log_metric("f1_score", f1) mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("recall", recall) # Log final model to MLflow mlflow.sklearn.log_model(rf, "rf_model") print(f"F1 Score: {f1}") print(f"Accuracy: {accuracy}") print(f"Recall: {recall}") ``` ## PySpark ```{python} #| eval: false from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier(featuresCol="scaled_features", labelCol="Class") rf_model = rf.fit(train_scaled) rf_results = rf_model.transform(test_scaled) f1 = MulticlassClassificationEvaluator(labelCol="Class") accuracy = MulticlassClassificationEvaluator( labelCol="Class", metricName="accuracy") recall = MulticlassClassificationEvaluator( labelCol="Class", metricName="recallByLabel") f1.evaluate(rf_results) accuracy.evaluate(rf_results) recall.evaluate(rf_results) ``` ::: These metrics seem fine![^1] Not something to write home about if you'd ask me, but good enough. It could definitely be improved upon, but again: not the scope of this post. The model and the corresponding scaler are saved under a local directory `/mlruns/0/<run_id>/artifacts/`. We extract the corresponding `.pkl` files and put them in the top-level, where the other `.py` files reside as well. [^1]: Results of scikit-learn and PySpark can differ, but I am talking about the scikit-learn metrics. ::: callout-note ## Continuing with scikit-learn results As I am using the Databricks Community version, I found it very hard (or perhaps it is impossible) to extract the files to a local machine. Therefore we continue with scikit-learn results and code output. ::: ## Step 5: Creating a Streamlit app In other projects I have fiddled around with Streamlit and I really enjoy its simplicity. It allows for quick deployment and has a decent default interface. Rather than building the `app.py` piece by piece, I will provide the file and explain parts of the code that are not self-explanatory. ``` {.python filename="app.py (1/2)"} import pickle import pandas as pd import streamlit as st # Load MLflow model and scaler with open("rf_model.pkl", "rb") as f: model = pickle.load(f) with open("rf_scaler.pkl", "rb") as f: scaler = pickle.load(f) # Load data df = pd.read_csv("2022_2023_Football_Player_Stats.csv", sep=";", encoding="latin1") # Positions positions = {0: "Defender", 1: "Forward", 2: "Goalkeeper", 3: "Midfielder"} # Helper function to preprocess input data def preprocess_input(data): # Scale data data = scaler.transform(data) return data # Function to predict using the model def predict(data): # Preprocess input data X = preprocess_input(data) # Make predictions predictions = model.predict(X) return predictions # Function to compare input with existing players # <1> def compare(data, n): # <1> cols = ["Shots", "PasMedAtt", "Pas3rd", "Clr"] # <1> mse = ((df[["Shots", "PasMedAtt", "Pas3rd", "Clr"]] - data.iloc[0]) ** 2).mean(axis=1) # <1> idxs = mse.nsmallest(n).index # <1> return df.loc[idxs][["Player", "Age", "Pos", "Squad", "Comp"] + cols] # <1> # Function to format a line of player information def format_player(row): return f"*{row['Player']}*, a {row['Age']} year old {row['Pos']} playing for {row['Squad']} in {row['Comp']}." # Function that provides text for players similar to input # <2> def text_similar_players(data): # <2> data = data.reset_index() # <2> text = 'The players closest to your input are:\n' # <2> for index, row in data.iterrows(): # <2> player_info = format_player(row) # <2> text += f"{index + 1}. {player_info}\n" # <2> return text # <2> # ... ``` 1. Compare user input with all players in the dataset. Use the MSE as the metric to minimise and find the *n* closest/most similar players with respect to input statistics. 2. Provide a list of *n* players that are most similar to user input. Honestly not very much to explain. ``` {.python filename="app.py (1/2)"} # Streamlit UI def main(): st.set_page_config(page_title="Classifying Footballers", layout="wide") st.markdown(""" # <1> <style> # <1> .reportview-container { # <1> margin-top: -2em; # <1> } # <1> #MainMenu {visibility: hidden;} # <1> .stDeployButton {display:none;} # <1> footer {visibility: hidden;} # <1> #stDecoration {display:none;} # <1> </style> # <1> """, unsafe_allow_html=True) # <1> st.title("Predicting Football player positions based on 2022/23 stats") st.sidebar.success(f"Visit [blog.panliyong.nl](https://blog.panliyong.nl/posts/006_football) for the post!") # Create input form # <2> st.sidebar.header("Input Features") # <2> shots = st.sidebar.number_input("Shots", min_value=0.0, max_value=10.0, step=0.1, value=1.0) # <2> pas_med_att = st.sidebar.number_input("Passes Medium Att", min_value=0.0, max_value=70.0, step=1.0, value=20.0) # <2> pas_3rd = st.sidebar.number_input("3rd Passes", min_value=0.0, max_value=20.0, step=1.0, value=3.0) # <2> clr = st.sidebar.number_input("Clearances", min_value=0.0, max_value=15.0, step=0.1, value=3.0) # <2> # Create column layout left_column, right_column = st.columns(2) with left_column: # Players compare st.markdown("### Find similar players") n_players = st.slider("How many similar players do you want to find?", min_value=1, max_value=20, step=1, value=5) # Create prediction button if st.sidebar.button("Predict"): # Create a dataframe from the input input_data = pd.DataFrame({ "Shots": [shots], "PasMedAtt": [pas_med_att], "Pas3rd": [pas_3rd], "Clr": [clr] }) # Predict prediction = predict(input_data) # Display prediction position = positions.get(prediction[0]) position_styled = f"<span style='color:purple'>{position}</span>" st.sidebar.info(f"Predicted Position: {position}") # Check which player is closest to input players = compare(input_data, n=n_players) text = text_similar_players(players) with left_column: st.info(text) with right_column: st.markdown(f"# **Predicted a** {position_styled}", unsafe_allow_html=True) st.markdown("## Data of similar players:") st.dataframe(players.reset_index(drop=True)) # Run the app if __name__ == "__main__": main() ``` 1. Remove developer settings for Streamlit. 2. Define user input interactivity. We can try and see if the Streamlit app runs with the following command ``` {.bash filename="Terminal"} streamlit run app.py ``` This is what the app looks like: ![](app.png) Now it's time to prepare it for deployment! ## Step 6: Make a Dockerfile A few things to keep in mind when writing this Dockerfile is that the `requirements.txt` and the other environment files stored by MLflow have many irrelevant dependencies. The only things we need are `pandas`, `streamlit` and `scikit-learn`. That makes building the Dockerfile quicker as there are less dependencies that have to be installed. However, another important factor is that all the time I have been developing on a Windows (64-bit) machine. The Raspberry Pi which the app is going to run on is a Linux (arm64) machine. As these platforms are not the same, we cannot simply use `docker build` commands and expect them to work on the Linux device. Let's first consider our Dockerfile: ``` dockerfile FROM python:3.10-slim WORKDIR /app COPY app.py . COPY 2022_2023_Football_Player_Stats.csv . COPY rf_model.pkl . COPY rf_scaler.pkl . RUN pip install streamlit pandas scikit-learn==1.4.0 EXPOSE 8501 CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"] ``` The default port of Streamlit apps is 8501 and we did not change that, so we have to expose this port. Furthermore only the `.pkl` files and `app.py` and `.csv` file are all the files we need. Then we can install the libraries we need to run the Streamlit app. Now comes the part where we build the image and push it to Docker Hub under `panliyong/football-positions`. Up for grabs on the Raspberry Pi! ``` {.bash filename="Terminal"} docker buildx build --platform linux/arm64 -t panliyong/football-positions -f Dockerfile --push . ``` ## Step 7: Deploying on Raspberry Pi The image is now on Docker Hub and we can pull the image now. Let's name the container `football-positions` and map port 8501 of the Pi to the exposed port. I have also had issues with containers running out of memory, so in order to keep this container running we can update the restart policy to always restart the container.[^2] [^2]: This could also be done in the initial run command, but I forgot and was too lazy to delete it... ``` {.bash filename="Terminal"} docker pull panliyong/football-positions docker run -d --name football-positions -p 8501:8501 panliyong/football-positions docker update --restart=always football-positions ``` My domain `panliyong.nl` and subdomains are managed through [Cloudflare](https://www.cloudflare.com/) and are hosted on the Apache web server on the same Pi. We can serve this on [football-positions.panliyong.nl](https://football-positions.panliyong.nl) for the app. If we go to the DNS settings and add a `CNAME` record with name `football-positions` and target `panliyong.nl`, we enable this new subdomain. We need to add a virtual host `football-positions.conf` and enable it. ``` {.bash filename="football-positions.conf"} <VirtualHost *:80> ServerName football-positions.panliyong.nl ServerAlias www.football-positions.panliyong.nl ProxyPass / http://localhost:8501/ ProxyPassReverse / http://localhost:8501/ # This was the fix mentioned below! <Location "/_stcore/stream"> ProxyPass ws://localhost:8501/_stcore/stream ProxyPassReverse ws://localhost:8501/_stcore/stream </Location> # ... </VirtualHost> <IfModule mod_ssl.c> <VirtualHost *:443> ServerAlias www.football-positions.panliyong.nl ServerName football-positions.panliyong.nl SSLProxyEngine on ProxyPass / http://localhost:8501/ ProxyPassReverse / http://localhost:8501/ # ... # Redirect HTTP to HTTPS RewriteEngine On RewriteCond %{HTTPS} off RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301] </VirtualHost> </IfModule> ``` Enable it with these commands: ``` bash sudo a2ensite football-positions.conf sudo systemctl reload apache2 ``` If everything is correctly set up the app should be accessible through [football-positions.panliyong.nl](https://football-positions.panliyong.nl). ::: callout-caution ## Inaccessible due to Streamlit updates It appears that the app is not loading on the Apache web server due to websocket errors and some API changes. I'll be looking for a fix... **(2024-04-22):** The app is hosted on [football-positions.streamlit.app](https://football-positions.streamlit.app/) for the time being! **(2024-04-29):** I have found the fix! The app is hosted on [football-positions.panliyong.nl](https://football-positions.panliyong.nl/)! ::: ## TL;DR It's not as difficult as it sounds to prepare a model for deployment. Due to unforeseen changes in the `streamlit==1.33` package, the app did not load due to websocket erros. We will be looking back on this soon! ::: {.callout-caution appearance="simple"} **(2024-04-22):** The app is hosted on [football-positions.streamlit.app](https://football-positions.streamlit.app/) for the time being! **(2024-04-29):** I have found the fix! The app is hosted on [football-positions.panliyong.nl](https://football-positions.panliyong.nl/)! ::: That's all for today, thanks for reading!