This project aims to provide the functionality to extract football data from various public resources and file to a users local project file system. This project provides the code to extract data for the Football Data Warehouse project which aims to provide a comprehensive data warehouse for football data.
Currently, there is functionality to extract data from two sources:
- Fbref
- Player related stats such as shooting, passing, and defense.
- Also provides data on player wages.
- Transfermarkt
- Player market valuation data and general player information.
In future, more sources will be added.
If you would like to use this project independently of Football Data Warehouse, you can follow the steps below to set up the project.
The package manager used for this project is uv.
- Clone the repository
git clone
- Install
uv
for package management
pip install uv
- Create a virtual environment and specify the python version to use
uv venv --python 3.12
- Install the required packages by running
uv install
uv sync
- Activate the uv virtual environment if not already activated
source .venv/bin/activate
To use the code in this project, you can define each directory as a dependency within your own project. To do this, you don't need to follow the virtual environment instructions above.
For example, the two main directories are fbref
and transfermarkt
. You can install these as dependencies using uv
in another repo like I have done
in the Football Data Warehouse project.
uv add https://github.com/chonalchendo/football-data-extractor.git
You must also add your USER_AGENT
to your .env
file otherwise the scrapy
transfermarkt
spiders will not work.
USER_AGENT = "INSERT YOUR USER AGENT HERE"
You can get your user agent by simply googling, "what is my user agent?"
This will then allow you to use the fbref
and transfermarkt
packages inside your own project.
Fbref is a custom extractor developed using pandas
and beautifulsoup
to extract data from the fbref website.
Therefore, you can call fbref as a package inside your python scripts.
from fbref import fbref
from .settings import get_config
def run_stats_crawler(collector: str, season: str) -> None:
settings = get_config().fbref_extract
feed = fbref.ParquetFeed(
output_path=settings.FEEDS.PATH,
format=settings.FEEDS.FORMAT,
)
runner = fbref.NavigatorRunner(feed=feed)
runner.navigate(collector=collector, season=season)
runner.start()
Transfermarkt is built using scrapy to extract data from the transfermarkt website.
This means you can use scrapy
to run the spiders defined in the transfermarkt
package.
from pathlib import Path
import polars as pl
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from .settings import get_config
def set_scrapy_settings() -> Settings:
settings = get_project_settings()
config = get_config()
scrapy_settings = config.transfermarkt_extract.model_dump(exclude=None)
settings.setdict(scrapy_settings)
return settings
def run_squads_spider(crawler: str, season: str) -> None:
settings = set_scrapy_settings()
process = CrawlerProcess(settings)
clubs_path = Path(f"data/raw/transfermarkt/{season}/clubs.parquet").resolve()
clubs_df = pl.read_parquet(clubs_path, use_pyarrow=True)
clubs = clubs_df.to_dicts()
process.crawl(crawler, season=season, clubs=clubs)
process.start()
Beacause it is built using scrapy
, you can run the spider using the CrawlerProcess
class by passing the spider information.
In the above example, I load in my spider settings from a pydantic settings model in the same way you do in a scrapy
project via the settings.py
file.
from typing import Any
from pydantic import BaseModel, Field
from pydantic_settings import BaseSettings, SettingsConfigDict
class TransfermarktConfig(BaseSettings):
SPIDER_MODULES: list[str] = ["transfermarkt"]
NEWSPIDER_MODULE: list[str] = ["transfermarkt"]
USER_AGENT: str = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
)
ROBOTSTXT_OBEY: bool = True
DOWNLOAD_DELAY: int = 1
COOKIES_ENABLED: bool = False
REQUEST_FINGERPRINTER_IMPLEMENTATION: str = "2.7"
TWISTED_REACTOR: str = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING: str = "utf-8"
ITEM_PIPELINES: dict[str, int] = Field(
default_factory=lambda: {
"transfermarkt.transfermarkt.pipelines.TransfermarktParquetPipeline": 300
}
)
FEEDS: dict[str, Any] = Field(
default_factory=lambda: {
"data/raw/transfermarkt/{season}/{name}.parquet": {"format": "parquet"}
}
)
)
Returning the above as a dictionary and passing it to the CrawlerProcess
lets scrapy know what scraper to use and where to save the data.
- Add more sources to extract data from.
- Add more datasets from existing sources.
These goals will happen at some point in the future. If you would like to contribute to this project, please feel free to open a PR.