Building A Football Data Pipeline A Walkthrough Of The Medium

Bonisiwe Shabane

-Nov 18, 2025, 8:55 AM

building a football data pipeline a walkthrough of the medium

If you’re passionate about football and data, this Arsenal FC data pipeline project is an ideal way to practice and learn essential data engineering skills. In this Medium article, we will walk through the entire pipeline setup, from raw data extraction to insightful visualizations, using real-world technologies. This project is a comprehensive end-to-end data engineering pipeline designed to analyze historical performance data of Arsenal FC. It utilizes cutting-edge technologies and tools like Docker, Apache Spark, PostgreSQL, Apache Airflow, and PowerBI for data extraction, transformation, orchestration, and visualization. Let’s dive into setting up this project on your local machine. Before getting started, you’ll need the following installed:

Docker is essential here for setting up an isolated, consistent environment to run the necessary components like PostgreSQL, Apache Spark, and Airflow. Senior Data Engineer | Certified Azure Data Engineer Do you want to dive into a unique data engineering project and get hands-on experience? I wrote a step-by-step guide article for my project: an end-to-end data pipeline for Arsenal F.C! 🏟️⚽️ on how to build a complete data processing system to analyze football match data. You can read all about it in my Medium article here: https://lnkd.in/dvupR5SR I cover everything from data ingestion to analysis, so you can learn how to create your own data pipelines.

Whether you’re a beginner or looking to expand your skills, this project is a great way to get started! Prerequisites: - Docker and Docker Compose - Basics Sql - Pyspark Tools You Will Learn and Use: Infrastructure: Docker Data Warehouse: PostgreSQL, Dimensional Modelling Database: PostgreSQL Orchestration: Apache Airflow Data Processing: Apache Spark ETL... Let’s learn together! Data engineer || Data Management Practitioner (CDMP) || 3 X Azure certified || databricks certified || Kaggle master Data Scientist | Sports Analyst | Software Engineer تبارك الله يا بشمهندس كان عندي سوال بس هل بيفرق لو انا اشتغلت ال model ب sql لوحده وبعدين نقلت الداتا ولا الافضل اعمل زي منتا عامل في البروجيكت ده ال model كله مكتوب...

If you’re a developer or a data engineer, you’ve probably faced your fair share of technical challenges. Some are abstract puzzles, but every now and then, you get one that makes you think, “Hey, this could be a real project.” I recently came across one of those. A take-home assignment for a Data Engineer position that, instead of just being a test, became a source of inspiration. The task was clear, yet ambitious: The goal was to retrieve player information from the Transfermarkt website, store it efficiently, and make it accessible via an API for a football club’s daily operations.

Instead of just ticking the boxes, I decided to embrace the challenge and build a complete, end-to-end project. Today, I want to walk you through that entire journey — the architecture, the code, the technical decisions, and most importantly, the lessons learned. (Pro-tip: If you’re ever looking for portfolio ideas, browse online for technical assignments. They are a fantastic source of real-world problems to solve.) Back in 2019, when I was about to graduate, I noticed a surge of people analyzing football statistics on Twitter and Reddit. Some insights were sharp, others way off—but it sparked my curiosity.

Either way, the field had piqued my interest back then, which eventually nudged me into working with data in general. With almost six years in the industry now, I’m channeling that curiosity into a passion project: building something as close to production-grade as possible—while keeping it local, open-source, and free. Thanks for reading Anwesh’s Substack! Subscribe for free to receive new posts and support my work. Here is an in-progress basic architecture flow of how the potential solution might look like - In short, we’ll scrape football data with Python, store it in Postgres, and later use it for dashboards and chatbots

For now, though, the main focus will be on fetching/scraping data from the web and pushing it into PostgresDB. Football analytics has evolved dramatically over the past decade. What was once a domain dominated by simple statistics like goals and assists has expanded to include sophisticated metrics that measure everything from expected goals to progressive carries. As a data scientist and football enthusiast, I wanted to leverage these advanced metrics to solve a specific challenge: identifying the ideal players for Ruben Amorim’s distinctive 5–2–3 system. In this technical deep-dive, I’ll explain the architecture of the data pipeline I built for this analysis, the algorithms behind the position-specific scoring models, and how I transformed raw statistics into intuitive visualizations. If you’re interested in the footballing insights resulting from this analysis, check out my companion article, “The Perfect XI: Using Data Science to Build Amorim’s Ideal Squad”.

The pipeline consists of several interconnected components: The first challenge was obtaining comprehensive player data. I utilized a football data API to fetch statistics for players across Europe’s top five leagues over three seasons (2021–2022, 2022–2023, and 2023–2024). The data collection follows a hierarchical pattern: There was an error while loading. Please reload this page.

📊 Football Analytics Data Pipeline: From Raw Data to Position-Specific Insights I've published a two-part series on Medium that details the development of a data pipeline for football position analysis: "Behind the Numbers: Building... 📝 Read Part 1: https://lnkd.in/gSxnDTes 📝 Read Part 2: https://lnkd.in/gAC4MHUK #FootballAnalytics #DataScience #SportsAnalytics #Python #DataVisualization Data Analyst / Data Science | SQL • Python • Machine Learning | Predictive Modeling & Data Pipelines For NFL fans and data people out there. I recently told a friend that I thought the Packers drafted better in the later rounds than in the first (relative to expectations of the round). I decided to check with some web scraping in Python.

Using data from 2002-2021 drafts, it seems like I was correct when comparing the team's weighted Approximate Value (wAV) to league average. With that said, my theory may be a bit biased because you get "more bites at the apple" on Day 3 than Day 1 of the draft. Especially for a team like the Packers that like to trade down to accumulate draft capital. You can check your favorite team in the dashboard below! 📊 Tableau: https://lnkd.in/e6mC7T6w 🐍 Web Scraping: https://lnkd.in/eia4biKu 🏈 Data Source: https://lnkd.in/e5R4a9Kg 🔎 About Approximate Value: https://lnkd.in/eRYXXRQH Caveat with the data (damn, you’re still reading this?): Stats cover each player’s career, not just their time... That means players like Eli Manning and Philip Rivers are listed under the Chargers and Giants respectively — despite never playing a snap for those teams.

Aspiring Business Analyst | BBA Graduate | Experienced in Portfolio Management, Digital Marketing, and Operations | Passionate about Data-Driven Decision Making 💡 𝐓𝐮𝐫𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐢𝐧𝐭𝐨 𝐀𝐜𝐭𝐢𝐨𝐧 — 𝐓𝐡𝐞 𝐏𝐨𝐰𝐞𝐫 𝐨𝐟 𝐀𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐚𝐥 𝐓𝐡𝐢𝐧𝐤𝐢𝐧𝐠 📊 As I continue learning and practicing data analytics, I’ve realized that it’s not just about mastering tools — it’s about developing the... Behind every visualization or query lies a bigger question: 👉 What story is this data trying to tell? Whether it’s analyzing trends, identifying patterns, or presenting dashboards — every small project brings me closer to understanding how data drives smart business decisions. #DataAnalytics #PowerBI #Python #SQL #DataVisualization #LearningJourney #BusinessIntelligence #DataAnalyst #AnalyticsMindset #Growth Football enthusiasts often crave detailed insights into matches, and providing a dynamic and interactive platform for this is a game-changer.

In this project, I built a full-stack football analytics pipeline that allows users to search for specific matches by entering team names and match dates. The backend fetches data dynamically, processes it, and updates a Power BI dashboard in real time. The pipeline has been deployed on cloud platoform, GCP, allowing scalability and making it easy to be globally accessed. The pipeline consists of the following key components: The frontend provides users with a simple interface to: Once users enter this information, it is sent to the backend via an HTTP POST request.

One of the most satisfying parts of data science is turning raw web pages into structured information you can analyze. In this post, I’ll walk through my basic web scraper built in Python that pulls soccer statistics from FBref. If you’ve ever looked at a table in your browser and thought, “I wish I could get this into a DataFrame”, this is for you. FBref is one of the most comprehensive sources of soccer stats: player metrics, team stats, match logs, and more. Its tables are consistent and well-structured, which makes it a great playground for web scraping. The notebook (fbref_webscraping.ipynb) follows a simple but extensible flow:

Building A Football Data Pipeline A Walkthrough Of The Medium

People Also Search

If You’re Passionate About Football And Data, This Arsenal FC

Docker Is Essential Here For Setting Up An Isolated, Consistent

Whether You’re A Beginner Or Looking To Expand Your Skills,

If You’re A Developer Or A Data Engineer, You’ve Probably

Instead Of Just Ticking The Boxes, I Decided To Embrace