Skip to content

Commit

Permalink
Changed dumping process a bit
Browse files Browse the repository at this point in the history
Now algorithms objects are  serialized. Trainsets are not dumped anymore
as they are in the algorithm objects anyway. changed the load_algo
function to simply 'load'. Added FAQ entry to serialize algorithms along
with an example.
  • Loading branch information
NicolasHug committed May 2, 2017
1 parent d55ac43 commit 9e759a3
Show file tree
Hide file tree
Showing 10 changed files with 146 additions and 76 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Current
=======

* Changed the dumping process a bit (see API changes). Plus, dumps can now be
loaded.
* Added possibility to get accuracy performances on the trainset
* Added inner-to-raw id conversion in the Trainset class
* The r_ui parameter of the predict() method is now optional
Expand All @@ -10,6 +12,15 @@ Current
* Corrected factor vectors initialization of SVD algorithms. Thanks to
adideshp.

API Changes
-----------

* The dump() method now dumps a list of predition (optional) and an algorithm
(optional as well). The algorithm is now a real algorithm object. The
trainset is not dumped anymore as it is already part of the algorithm anyway.
* The dump() method is now part of the dump namespace, and not the global
namespace (so it is accessed by surprise.dump.dump)

VERSION 1.0.2
=============

Expand Down
3 changes: 2 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
TODO
====

* Change the dumping machinery to be more consistent
* Complete FAQ

* Implement some recommendation strategy (like recommend the items with the 10
highest estimation)
* Rewright stuff on prediction dumping for analysis (l.40 building_on_algo)
* Allow to discount similarities (see aggarwal)
* Support conda?
* Allow incremental updates for some algorithms
Expand All @@ -21,6 +21,7 @@ Maybe, Maybe not
Done:
-----

* Change the dumping machinery to be more consistent
* Allow to test on the trainset
* make bibtex entry
* Verbosity of gridsearch still prints stuff because of evaluate. Fix that.
Expand Down
30 changes: 28 additions & 2 deletions doc/source/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,45 @@ How to get the :math:`k` nearest neighbors of a user (or item)
How to get the top-:math:`k` recommendations for a user
-------------------------------------------------------

How to save an algorithm for later use
--------------------------------------
.. _save_algorithm_for_later_use:

How to serialize an algorithm
-----------------------------

Prediction algortihms can be serialized and loaded back using the :func:`dump()
<surprise.dump.dump>` and :func:`load() <surprise.dump.load>` functions. Here
is a small example where the SVD algorithm is trained on a dataset and
serialized. It is then reloaded and can be used again for making predictions:

.. literalinclude:: ../../examples/serialize_algorithm.py
:caption: From file ``examples/serialize_algorithm.py``
:name: serialize_algorithm.py
:lines: 9-

How to build my own prediction algorithm
----------------------------------------

See the :ref:`user guide <building_custom_algo>`.

What are raw and inner ids
--------------------------

See :ref:`this note <raw_inner_note>`.

How to use my own dataset with Surprise
---------------------------------------

See the :ref:`user guide <load_custom>`.

How to tune an algorithm parameters
-----------------------------------

To tune the parameters of your algorithm, you can use the :class:`GridSearch
<surprise.evaluate.GridSearch>` class as described :ref:`here
<tuning_algorithm_parameters>`. After the tuning, you may want to have an
:ref:`unbiased estimate of your algorithm performances
<unbiased_estimate_after_tuning>`.

How to get accuracy measures on the training set
------------------------------------------------

Expand All @@ -40,6 +64,8 @@ with the :meth:`test()

Check out the example file for more usage examples.

.. _unbiased_estimate_after_tuning:

How to save some data for unbiased accuracy estimation
------------------------------------------------------

Expand Down
2 changes: 1 addition & 1 deletion doc/source/building_custom_algo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ return a dictionary with given details: ::

This dictionary will be stored in the :class:`prediction
<surprise.prediction_algorithms.predictions.Prediction>` as the ``details``
field and can be used :ref:`for later analysis <dumping>`.
field and can be used for later analysis.



Expand Down
22 changes: 0 additions & 22 deletions doc/source/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,28 +234,6 @@ Obviously, it is perfectly fine to use the :meth:`predict()
during a cross-validation process. It's then up to you to ensure that the user
and item ids are present in the trainset though.

.. _dumping:

Dump the predictions for later analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You may want to save your algorithm predictions along with all the useful
information about the algorithm. This way, you can run your algorithm once,
save the results, and go back to them whenever you want to inspect in greater
details each of the predictions, and get a good insight on why your algorithm
performs well (or bad!). `Surprise <https://nicolashug.github.io/Surprise/>`_
provides with some tools to do that.

You can dump your algorithm predictions either using the :func:`evaluate()
<surprise.evaluate.evaluate>` function, or do it manually with the :func:`dump
<surprise.dump.dump>` function. Either way, an example is worth a thousand
words, so here a few `jupyter <http://jupyter.org/>`_ notebooks:

- `Dumping and analysis of the KNNBasic algorithm
<http://nbviewer.jupyter.org/github/NicolasHug/Surprise/tree/master/examples/notebooks/KNNBasic_analysis.ipynb/>`_.
- `Comparison of two algorithms
<http://nbviewer.jupyter.org/github/NicolasHug/Surprise/tree/master/examples/notebooks/Compare.ipynb/>`_.

Command line usage
~~~~~~~~~~~~~~~~~~

Expand Down
33 changes: 33 additions & 0 deletions examples/serialize_algorithm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""
This module illustrates the use of the dump and load methods for serializing an
algorithm. The SVD algorithm is trained on a dataset and then serialized. It is
then reloaded and can be used again for making predictions.
"""

from __future__ import (absolute_import, division, print_function,
unicode_literals)
import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.train(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('~/dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')
2 changes: 1 addition & 1 deletion surprise/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from .evaluate import evaluate
from .evaluate import print_perf
from .evaluate import GridSearch
from .dump import dump
from . import dump

__all__ = ['AlgoBase', 'NormalPredictor', 'BaselineOnly', 'KNNBasic',
'KNNWithMeans', 'KNNBaseline', 'SVD', 'SVDpp', 'NMF', 'SlopeOne',
Expand Down
66 changes: 29 additions & 37 deletions surprise/dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,63 +3,55 @@
"""

import pickle
import importlib


def dump(file_name, predictions, trainset=None, algo=None):
"""Dump a list of :obj:`predictions
<surprise.prediction_algorithms.predictions.Prediction>` for future
analysis, using Pickle.
def dump(file_name, predictions=None, algo=None, verbose=0):
"""A basic wrapper around Pickle to serialize a list of prediction and/or
an algorithm on drive.
If needed, the :class:`trainset <surprise.dataset.Trainset>` object and the
algorithm can also be dumped. What is dumped is a dictionary with keys
``'predictions``, ``'trainset'``, and ``'algo'``.
The dumped algorithm won't be a proper :class:`algorithm
<surprise.prediction_algorithms.algo_base.AlgoBase>` object but simply a
dictionary with the algorithm attributes as keys-values (technically, the
``algo.__dict__`` attribute).
See :ref:`User Guide <dumping>` for usage.
What is dumped is a dictionary with keys ``'predictions'`` and ``'algo'``.
Args:
file_name(str): The name (with full path) specifying where to dump the
predictions.
predictions(list of :obj:`Prediction\
<surprise.prediction_algorithms.predictions.Prediction>`): The
predictions to dump.
trainset(:class:`Trainset <surprise.dataset.Trainset>`, optional): The
trainset to dump.
algo(:class:`Algorithm\
<surprise.prediction_algorithms.algo_base.AlgoBase>`, optional):
algorithm to dump.
The algorithm to dump.
verbose(int): Level of verbosity. If ``1``, then a message indicates
that the dumping went succesfully. Default is ``0``.
"""

dump_obj = dict()

dump_obj['predictions'] = predictions

if trainset is not None:
dump_obj['trainset'] = trainset

if algo is not None:
dump_obj['algo'] = algo.__dict__ # add algo attributes
dump_obj['algo']['name'] = algo.__class__.__name__

dump_obj = {'predictions': predictions,
'algo': algo
}
pickle.dump(dump_obj, open(file_name, 'wb'))
print('The dump has been saved as file', file_name)

if verbose:
print('The dump has been saved as file', file_name)


def load_algo(file_name):
"""Load a prediction algorithm from a file, using Pickle.
def load(file_name):
"""A basic wrapper around Pickle to deserialize a list of prediction and/or
an algorithm that were dumped on drive using :func:`dump()
<surprise.dump.dump>`.
Args:
file_name(str): The path of the file from which the algorithm is
to be loaded
Returns:
A tuple ``(predictions, algo)`` where ``predictions`` is a list of
:class:`Prediction
<surprise.prediction_algorithms.predictions.Prediction>` objects and
``algo`` is an :class:`Algorithm
<surprise.prediction_algorithms.algo_base.AlgoBase>` object. Depending
on what was dumped, some of these may be ``None``.
"""

dump_obj = pickle.load(open(file_name, 'rb'))
algo_module = importlib.import_module('surprise.prediction_algorithms')
algo = getattr(algo_module, dump_obj['algo']['name'])()
algo.__dict__ = dump_obj['algo']
return algo

return dump_obj['predictions'], dump_obj['algo']
9 changes: 4 additions & 5 deletions surprise/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,10 @@ def evaluate(algo, data, measures=['rmse', 'mae'], with_dump=False,
measures(list of string): The performance measures to compute. Allowed
names are function names as defined in the :mod:`accuracy
<surprise.accuracy>` module. Default is ``['rmse', 'mae']``.
with_dump(bool): If True, the predictions, the trainsets and the
algorithm parameters will be dumped for later further analysis at
each fold (see :ref:`User Guide <dumping>`). The file names will
be set as: ``'<date>-<algorithm name>-<fold number>'``. Default is
``False``.
with_dump(bool): If True, the predictions and the algorithm will be
dumped for later further analysis at each fold (see :ref:`FAQ
<save_algorithm_for_later_use>`). The file names will be set as:
``'<date>-<algorithm name>-<fold number>'``. Default is ``False``.
dump_dir(str): The directory where to dump to files. Default is
``'~/.surprise_data/dumps/'``.
verbose(int): Level of verbosity. If 0, nothing is printed. If 1
Expand Down
44 changes: 37 additions & 7 deletions tests/test_dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,48 @@
from __future__ import (absolute_import, division, print_function,
unicode_literals)
import tempfile
import random
import os

from surprise import Prediction
from surprise import AlgoBase
from surprise import Trainset
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise import dump


def test_dump():
"""Train an algorithm, compute its predictions then dump them.
Ensure that the predictions that are loaded back are the correct ones, and
that the predictions of the dumped algorithm are also equal to the other
ones."""

predictions = [Prediction(None, None, None, None, None)]
algo = AlgoBase()
trainset = Trainset(*[None] * 9)
random.seed(0)

train_file = os.path.join(os.path.dirname(__file__), './u1_ml100k_train')
test_file = os.path.join(os.path.dirname(__file__), './u1_ml100k_test')
data = Dataset.load_from_folds([(train_file, test_file)],
Reader('ml-100k'))

for trainset, testset in data.folds():
pass

algo = BaselineOnly()
algo.train(trainset)
predictions = algo.test(testset)

with tempfile.NamedTemporaryFile() as tmp_file:
dump.dump(tmp_file.name, predictions, algo)
predictions_dumped, algo_dumped = dump.load(tmp_file.name)

predictions_algo_dumped = algo_dumped.test(testset)
assert predictions == predictions_dumped
assert predictions == predictions_algo_dumped


def test_dump_nothing():
"""Ensure that by default None objects are dumped."""
with tempfile.NamedTemporaryFile() as tmp_file:
dump(tmp_file.name, predictions, trainset, algo)
dump.dump(tmp_file.name)
predictions, algo = dump.load(tmp_file.name)
assert predictions is None
assert algo is None

0 comments on commit 9e759a3

Please sign in to comment.