The establishment of FootVid F.C. was guided by the simple idea:
Revolutionize the classification of match video frames using AI
In this repository you will find implementation of ML-based methods targeting
the aforementioned task. Sample data available for the project are visualized
below using the source package function:
footvid.utils.images.plot_directories_representation
.
├── README.md <- The top-level README for developers using this project.
│
├── data
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs
│ └── images <- Folder for storing images used across the package documentation.
│
├── logs <- Tensorboard model training logs.
│
├── models <- Trained and serialized models, model predictions, or model summaries.
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering) and
│ a short `-` delimited description, e.g. `00-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
|
├── poetry.lock <- File to resolve and install all dependencies listed in the
│ pyproject.toml file.
├── pyproject.toml <- File orchestrating the project and its dependencies.
│
├── footvid <- Source code for use in this project.
The project is designed to separate the particular modeling steps into notebooks. Notebook list:
- 00-images-integrity focuses on getting familiarity with data, examines images size, resolution and relation of positive to negative frames. It also generates a presentation of an example input.
- 01-train-valid-split
is dedicated to dividing the data set into an appropriately represented
training and validation set to avoid consequences of sampling bias like
shown in the widely known The Literary Digest
Presidential poll.
Implementation of an algorithm that works in a quadratic worst-case time
complexity and finds best splits in a semi-greedy manner guided by
match_hash
is also provided in this notebook. - 10-resnet
provides ResNet50 experiments setup.
Three methodologies of fine-tuning were tested. First, optimizing only the
weights of the last, fully-connected, layer. Second, additionally unfreezing
3rd and 4th ResNet layers. The last one, fine-tuning the whole neural net.
Those operations are possible due to the recursive nature of the
footvid.arena.freeze_layers
function. The models were trained with the use of a safe SGD optimizer, avoiding the risk of unstable solutions obtained with adaptive optimizers. - 11-model-selection shows method of selecting the best model with use of the TensorBoard training logs and prepares test dataset predictions.
- 12-gradcam renders Gradient-weighted Class Activation Mappings of the Champion model for 11 randomly chosen images from the validation set. It can be observed that the model is trying to extract a group of players located on the football pitch, which seems to be a fairly good heuristic for assessing whether the image is a match video frame or not. Whenever there is a camera close-up or something unrelated with the football match the model easily detects that it is not a match frame, which can be seen by the lack of "warm" areas on the heatmap. Examples below.
True label: positive, prediction: 0.95
True label: negative, prediction: 0.13
If only the footvid source package functionalities are of interest then it is enough to run:
pip install git+https://github.com/mrtovsky/footvid.git
To interact with the notebooks e.g. rerun them, full project preparation is necessary. It can be done in the following few steps. First of all, you need to clone the repository:
git clone https://github.com/mrtovsky/footvid.git
Then, enter this directory and create a .env file that stores environment variable with the cloned repository path:
cd footvid/
touch .env
printf "REPOSITORY_PATH=\"$(pwd)\"" >> .env
The recommended way of installing the full project is via Poetry package. If Poetry is not installed already, follow the installation instructions at the provided link. Then, assuming you have already entered the footvid directory, resolve and install dependencies using:
poetry install
Furthermore, you may want to attach a kernel with the already created virtual environment to Jupyter Notebook. This can be done by calling:
poetry run python -m ipykernel install --name=footvid-venv
This will make footvid-venv available in your Jupyter Notebook kernels.
It is also possible to install the package in a traditional way, simply run:
pip install -e .
This will install the package in an editable mode. If you installed it inside of the virtual environment, then attaching it to the Jupyter Notebook kernel is the same as with the Poetry but the command is stripped from the first two elements (remember that the virtualenv needs to be activated beforehand):
python -m ipykernel install --name=footvid-venv
Method | Accuracy score (train / validation) | Average precision (train / validation) |
---|---|---|
Fully-connected layer fine-tuning | 0.9676 / 0.9780 | 0.9947 / 0.9949 |
Top2 CNN layers and FCL fine-tuning | 0.9826 / 0.9824 | 0.9961 / 0.9970 |
Full network fine-tuning | 0.9758 / 0.9802 | 0.9956 / 0.9950 |
More detailed training results can be displayed by opening the tensorboard:
tensorboard --logdir ./logs/ --host localhost
Judging by the prism of plain values, intuitively, best results brings the model that has fine-tuned layers responsible for learning more intricate, dataset-specific features and a classifier on top of that. Training the classification layer only may result in the worse results due to the fact that more complex features obtained from convolutional layers are not well tailored to the needs of a given problem. On the other hand, entire network fine-tuning, even the layers responsible for recognizing more general, universal shapes makes the model vulnerable to learning the noise present in smaller datasets. Training only the top layers of CNNs seems to find the perfect balance between the above-mentioned risks.
It should be remembered that this inference is based on bare results on a very small validation set, where there is a high risk of no statistical significance between the differences in the performance of models. If we want to actually check whether one estimator performs better than the other, we should use the appropriate test, e.g. combined 5x2 cv F-test. However, it is not within our interest in this project, so having to decide on one of the models mentioned, we will take the one that had the best results as our new champion and use it for preparing test predictions.
Some ideas on how to continue the research work on this project:
- Training shallow classification model using features extracted with Histogram of Oriented Gradients (HOG) method.
- Using the already developed model to active learning - detect mislabeled frames.
- Clustering to group frames from the same match despite having different
match_hash
assigned to them. - Redefining the classification problem into object detection and creating match frame detection heuristics guided by the presence of "football pitch landmarks".
Karpathy et al., Large-scale Video Classification with Convolutional Neural Networks
Key takeaways:
-
CNNs' expected behavior is to learn increasingly complex patterns with successive, deeper layers - first layers will learn more generic features, on the contrary, deeper layers will be responsible for learning intricate, dataset-specific features. In the case of transfer learning, this will have a significant impact on the choice of "defrosted" layers to be fine-tuned.
-
Most sport videos are biased towards displaying the objects of interest in the central region. Thus, images preprocessing proposed in this article includes:
- cropping to center region,
- resizing images to 200 × 200 pixels,
- randomly sampling a 170 × 170 region,
- randomly flipping the images horizontally with 50% probability.
The final absolute size of the images is too small for our problem because the pre-trained torchvision.models require the image height and width of at least 224 x 224 pixels but the idea is definitely worth attention, after appropriate adjustment to the problem under consideration.
Simeon Jackman, Football Shot Detection using Convolutional Neural Networks
Key takeaways:
- Confirmation of correctness of the appliaction of transfer learning in the context of football match TV broadcast analysis, with the use of pre-trained models trained on the ImageNet dataset.