Low rank field-weighted FM model in PyTorch (paper)
The code was forked and modified from https://github.com/rixwew/pytorch-fm, written by rixwew. The API Documentation of the original code is: https://rixwew.github.io/pytorch-fm
This package provides a PyTorch implementation of low rank factorization machine models, the factorization model baselines and the common datasets in CTR prediction. The code is running on the following datasets: Avazu, Criteo, and Movielens.
- Step 1: Data Preprocessing and
- Step 2: Running the models on the preprocessed data.
- Step 3: Analyzing the results
Moreover, this repository also contains a notebook notebooks/inference_timing.ipynb
that demonstrates inference speedups
attained by low-rank models over pruned models, for different ad auction sizes and
amounts of context fields.
Example of preprocessed dataset that will be obtained after running the preprocessing scripts on the initial datasets:
label, user_id, item_id, C1, C2, …
<label>,10,2,3,17,11,15
<label>,11,2,4,16,12,14
-
Download the initial file (train, 6.31G) should be called data_avazu.csv https://www.kaggle.com/datasets/atirpetkar/avazu-ctr
-
put it to pytorch-fm/torchfm/test-datasets/avazu/
-
To create train-validation-test datatsets, from the python shell run:
from torchfm.torch_utils.parsing_datasets.avazu.avazu_parsing import process_data process_data()
-
Check that now train.csv, test.csv, validation.csv and stored under /pytorch-fm/torchfm/test-datasets/avazu/
-
Check you have enough (5G) available space and proceed to run the ML models on the train-validation-test datasets.
-
Download the initial file (train.txt, 11.15G) should be called data_criteo.csv https://www.kaggle.com/datasets/mrkmakr/criteo-dataset
-
put it to pytorch-fm/torchfm/test-datasets/criteo/
-
To create train-validation-test datatsets, from the python shell run:
from torchfm.torch_utils.parsing_datasets.criteo.criteo_parsing import CriteoParsing CriteoParsing.do_preprocessing()
-
Check that now train.csv, test.csv, validation.csv and stored under /pytorch-fm/torchfm/test-datasets/criteo/
-
Check you have enough (5G) available space and proceed to run the ML models on the train-validation-test datasets.
-
Download the initial file users.dat, movies.dat, ratings.dat from https://www.kaggle.com/datasets/sherinclaudia/movielens
-
put it to pytorch-fm/torchfm/test-datasets/movielens/
-
To create train-validation-test datatsets, from the python shell run:
from torchfm.torch_utils.parsing_datasets.movielens.movielens_parsing import MovielensParsing MovielensParsing.process_data()
-
Check that now train.csv, test.csv, validation.csv and stored under /pytorch-fm/torchfm/test-datasets/movielens/
-
Check you have enough (5G) available space and proceed to run the ML models on the train-validation-test datasets.
-
Copy the code to the dedicated folder. Install all requirements.
-
In the shell redefine PYTHONPATH to point to your project root, .e.g, export PYTHONPATH=$PYTHONPATH:/home/${USER}/pytorch-fm/src:/home/${USER}/pytorch-fm/src/main_functions
-
Edit pytorch-fm/torchfm/torch_utils/constants.py file to have
- base_path_project pointing to your project root
- Edit dataset_name to have the dataset you run on (avazu, criteo, movielens) E.g., dataset_name = movielens
Edit pytorch-fm/src/main_functions/run_processes.py file in examples folder, to refer to the list of options to run: currently, as example, it contains lst_tmp in "for tpl in lst_tmp:".
-
Copy train.csv/validation.csv/test.csv splitted datasets to be under pytorch-fm/data/test-datasets// (dataset is either criteo or avazu or movielens) Then, open a python shell by running just: "python" command from the shell.
-
Check that you have enough space (at least 5G available) after all these steps in your local environment, e.g., by: df -h /home/default/your_location Otherwise remove non-required data (e.g., datasets you don’t use for the current run)
-
Check that you have pytorch-fm/data/tmp_save_dir (if not create this folder)
-
If you are rerunning, check you don’t have a previous run. results stored (especially .log files - locking the next run), otherwise remove: rm pytorch-fm/data/tmp_save_dir/*
-
Run by python ./pytorch-fm/src/main_functions/run_processes.py
-
After the run is done, the results are saved in /pytorch-fm/data/tmp_save_dir/optuna_results.txt. Also, debug info is saved in /pytorch-fm/data/tmp_save_dir/debug_info.txt
Open the notebook in notebooks/analysis.ipynb
, modify the line
files = glob.glob('../data/optuna_results*.txt')
to point to the appropriate paths with the result files saved in the previous stage, and run the notebook. It produces a Pandas table with the metrics and corresponding lifts of each dataset, and a LaTeX code snippet you can put in a paper to share the results.