Changed dumping process a bit

Now algorithms objects are serialized. Trainsets are not dumped anymore as they are in the algorithm objects anyway. changed the load_algo function to simply 'load'. Added FAQ entry to serialize algorithms along with an example.
XZLeo · May 2, 2017 · 9e759a3 · 9e759a3
1 parent d55ac43
commit 9e759a3
Show file tree

Hide file tree

Showing 10 changed files with 146 additions and 76 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,8 @@
 Current
 =======
 
+* Changed the dumping process a bit (see API changes). Plus, dumps can now be
+  loaded.
 * Added possibility to get accuracy performances on the trainset
 * Added inner-to-raw id conversion in the Trainset class
 * The r_ui parameter of the predict() method is now optional
@@ -10,6 +12,15 @@ Current
 * Corrected factor vectors initialization of SVD algorithms. Thanks to
   adideshp.
 
+API Changes
+-----------
+
+* The dump() method now dumps a list of predition (optional) and an algorithm
+  (optional as well). The algorithm is now a real algorithm object. The
+  trainset is not dumped anymore as it is already part of the algorithm anyway.
+* The dump() method is now part of the dump namespace, and not the global
+  namespace (so it is accessed by surprise.dump.dump)
+
 VERSION 1.0.2
 =============
 

diff --git a/TODO.md b/TODO.md
@@ -1,11 +1,11 @@
 TODO
 ====
 
-* Change the dumping machinery to be more consistent 
 * Complete FAQ
 
 * Implement some recommendation strategy (like recommend the items with the 10
   highest estimation)
+* Rewright stuff on prediction dumping for analysis (l.40 building_on_algo)
 * Allow to discount similarities (see aggarwal)
 * Support conda?
 * Allow incremental updates for some algorithms
@@ -21,6 +21,7 @@ Maybe, Maybe not
 Done:
 -----
 
+* Change the dumping machinery to be more consistent 
 * Allow to test on the trainset
 * make bibtex entry
 * Verbosity of gridsearch still prints stuff because of evaluate. Fix that.

diff --git a/doc/source/FAQ.rst b/doc/source/FAQ.rst
@@ -9,21 +9,45 @@ How to get the :math:`k` nearest neighbors of a user (or item)
 How to get the top-:math:`k` recommendations for a user
 -------------------------------------------------------
 
-How to save an algorithm for later use
---------------------------------------
+.. _save_algorithm_for_later_use:
+
+How to serialize an algorithm
+-----------------------------
+
+Prediction algortihms can be serialized and loaded back using the :func:`dump()
+<surprise.dump.dump>` and :func:`load() <surprise.dump.load>` functions. Here
+is a small example where the SVD algorithm is trained on a dataset and
+serialized. It is then reloaded and can be used again for making predictions:
+
+.. literalinclude:: ../../examples/serialize_algorithm.py
+    :caption: From file ``examples/serialize_algorithm.py``
+    :name: serialize_algorithm.py
+    :lines: 9-
 
 How to build my own prediction algorithm
 ----------------------------------------
 
+See the :ref:`user guide <building_custom_algo>`.
+
 What are raw and inner ids
 --------------------------
 
+See :ref:`this note <raw_inner_note>`.
+
 How to use my own dataset with Surprise
 ---------------------------------------
 
+See the :ref:`user guide <load_custom>`.
+
 How to tune an algorithm parameters
 -----------------------------------
 
+To tune the parameters of your algorithm, you can use the :class:`GridSearch
+<surprise.evaluate.GridSearch>` class as described :ref:`here
+<tuning_algorithm_parameters>`. After the tuning, you may want to have an
+:ref:`unbiased estimate of your algorithm performances
+<unbiased_estimate_after_tuning>`.
+
 How to get accuracy measures on the training set
 ------------------------------------------------
 
@@ -40,6 +64,8 @@ with the :meth:`test()
 
 Check out the example file for more usage examples.
 
+.. _unbiased_estimate_after_tuning:
+
 How to save some data for unbiased accuracy estimation
 ------------------------------------------------------
 

diff --git a/doc/source/building_custom_algo.rst b/doc/source/building_custom_algo.rst
@@ -37,7 +37,7 @@ return a dictionary with given details: ::
 
 This dictionary will be stored in the :class:`prediction
 <surprise.prediction_algorithms.predictions.Prediction>` as the ``details``
-field and can be used :ref:`for later analysis <dumping>`.
+field and can be used for later analysis.
 
 
 

diff --git a/doc/source/getting_started.rst b/doc/source/getting_started.rst
@@ -234,28 +234,6 @@ Obviously, it is perfectly fine to use the :meth:`predict()
 during a cross-validation process. It's then up to you to ensure that the user
 and item ids are present in the trainset though.
 
-.. _dumping:
-
-Dump the predictions for later analysis
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You may want to save your algorithm predictions along with all the useful
-information about the algorithm. This way, you can run your algorithm once,
-save the results, and go back to them whenever you want to inspect in greater
-details each of the predictions, and get a good insight on why your algorithm
-performs well (or bad!). `Surprise <https://nicolashug.github.io/Surprise/>`_
-provides with some tools to do that.
-
-You can dump your algorithm predictions either using the :func:`evaluate()
-<surprise.evaluate.evaluate>` function, or do it manually with the :func:`dump
-<surprise.dump.dump>` function. Either way, an example is worth a thousand
-words, so here a few `jupyter <http://jupyter.org/>`_ notebooks:
-
-  - `Dumping and analysis of the KNNBasic algorithm
-    <http://nbviewer.jupyter.org/github/NicolasHug/Surprise/tree/master/examples/notebooks/KNNBasic_analysis.ipynb/>`_.
-  - `Comparison of two algorithms
-    <http://nbviewer.jupyter.org/github/NicolasHug/Surprise/tree/master/examples/notebooks/Compare.ipynb/>`_.
-
 Command line usage
 ~~~~~~~~~~~~~~~~~~
 

diff --git a/examples/serialize_algorithm.py b/examples/serialize_algorithm.py
@@ -0,0 +1,33 @@
+"""
+This module illustrates the use of the dump and load methods for serializing an
+algorithm. The SVD algorithm is trained on a dataset and then serialized. It is
+then reloaded and can be used again for making predictions.
+"""
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+import os
+
+from surprise import SVD
+from surprise import Dataset
+from surprise import dump
+
+
+data = Dataset.load_builtin('ml-100k')
+trainset = data.build_full_trainset()
+
+algo = SVD()
+algo.train(trainset)
+
+# Compute predictions of the 'original' algorithm.
+predictions = algo.test(trainset.build_testset())
+
+# Dump algorithm and reload it.
+file_name = os.path.expanduser('~/dump_file')
+dump.dump(file_name, algo=algo)
+_, loaded_algo = dump.load(file_name)
+
+# We now ensure that the algo is still the same by checking the predictions.
+predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
+assert predictions == predictions_loaded_algo
+print('Predictions are the same')
diff --git a/surprise/__init__.py b/surprise/__init__.py
@@ -21,7 +21,7 @@
 from .evaluate import evaluate
 from .evaluate import print_perf
 from .evaluate import GridSearch
-from .dump import dump
+from . import dump
 
 __all__ = ['AlgoBase', 'NormalPredictor', 'BaselineOnly', 'KNNBasic',
            'KNNWithMeans', 'KNNBaseline', 'SVD', 'SVDpp', 'NMF', 'SlopeOne',

diff --git a/surprise/dump.py b/surprise/dump.py
@@ -3,63 +3,55 @@
 """
 
 import pickle
-import importlib
 
 
-def dump(file_name, predictions, trainset=None, algo=None):
-    """Dump a list of :obj:`predictions
-    <surprise.prediction_algorithms.predictions.Prediction>` for future
-    analysis, using Pickle.
+def dump(file_name, predictions=None, algo=None, verbose=0):
+    """A basic wrapper around Pickle to serialize a list of prediction and/or
+    an algorithm on drive.
 
-    If needed, the :class:`trainset <surprise.dataset.Trainset>` object and the
-    algorithm can also be dumped. What is dumped is a dictionary with keys
-    ``'predictions``, ``'trainset'``, and ``'algo'``.
-
-    The dumped algorithm won't be a proper :class:`algorithm
-    <surprise.prediction_algorithms.algo_base.AlgoBase>` object but simply a
-    dictionary with the algorithm attributes as keys-values (technically, the
-    ``algo.__dict__`` attribute).
-
-    See :ref:`User Guide <dumping>` for usage.
+    What is dumped is a dictionary with keys ``'predictions'`` and ``'algo'``.
 
     Args:
         file_name(str): The name (with full path) specifying where to dump the
             predictions.
-
         predictions(list of :obj:`Prediction\
             <surprise.prediction_algorithms.predictions.Prediction>`): The
             predictions to dump.
-        trainset(:class:`Trainset <surprise.dataset.Trainset>`, optional): The
-            trainset to dump.
         algo(:class:`Algorithm\
             <surprise.prediction_algorithms.algo_base.AlgoBase>`, optional):
-            algorithm to dump.
+            The algorithm to dump.
+        verbose(int): Level of verbosity. If ``1``, then a message indicates
+            that the dumping went succesfully. Default is ``0``.
     """
 
-    dump_obj = dict()
-
-    dump_obj['predictions'] = predictions
-
-    if trainset is not None:
-        dump_obj['trainset'] = trainset
-
-    if algo is not None:
-        dump_obj['algo'] = algo.__dict__  # add algo attributes
-        dump_obj['algo']['name'] = algo.__class__.__name__
-
+    dump_obj = {'predictions': predictions,
+                'algo': algo
+                }
     pickle.dump(dump_obj, open(file_name, 'wb'))
-    print('The dump has been saved as file', file_name)
+
+    if verbose:
+        print('The dump has been saved as file', file_name)
 
 
-def load_algo(file_name):
-    """Load a prediction algorithm from a file, using Pickle.
+def load(file_name):
+    """A basic wrapper around Pickle to deserialize a list of prediction and/or
+    an algorithm that were dumped on drive using :func:`dump()
+    <surprise.dump.dump>`.
 
     Args:
         file_name(str): The path of the file from which the algorithm is
             to be loaded
+
+    Returns:
+        A tuple ``(predictions, algo)`` where ``predictions`` is a list of
+        :class:`Prediction
+        <surprise.prediction_algorithms.predictions.Prediction>` objects and
+        ``algo`` is an :class:`Algorithm
+        <surprise.prediction_algorithms.algo_base.AlgoBase>` object. Depending
+        on what was dumped, some of these may be ``None``.
+
     """
+
     dump_obj = pickle.load(open(file_name, 'rb'))
-    algo_module = importlib.import_module('surprise.prediction_algorithms')
-    algo = getattr(algo_module, dump_obj['algo']['name'])()
-    algo.__dict__ = dump_obj['algo']
-    return algo
+
+    return dump_obj['predictions'], dump_obj['algo']
diff --git a/surprise/evaluate.py b/surprise/evaluate.py
@@ -32,11 +32,10 @@ def evaluate(algo, data, measures=['rmse', 'mae'], with_dump=False,
         measures(list of string): The performance measures to compute. Allowed
             names are function names as defined in the :mod:`accuracy
             <surprise.accuracy>` module. Default is ``['rmse', 'mae']``.
-        with_dump(bool): If True, the predictions, the trainsets and the
-            algorithm parameters will be dumped for later further analysis at
-            each fold (see :ref:`User Guide <dumping>`).  The file names will
-            be set as: ``'<date>-<algorithm name>-<fold number>'``.  Default is
-            ``False``.
+        with_dump(bool): If True, the predictions and the algorithm will be
+            dumped for later further analysis at each fold (see :ref:`FAQ
+            <save_algorithm_for_later_use>`). The file names will be set as:
+            ``'<date>-<algorithm name>-<fold number>'``.  Default is ``False``.
         dump_dir(str): The directory where to dump to files. Default is
             ``'~/.surprise_data/dumps/'``.
         verbose(int): Level of verbosity. If 0, nothing is printed. If 1

diff --git a/tests/test_dump.py b/tests/test_dump.py
@@ -4,18 +4,48 @@
 from __future__ import (absolute_import, division, print_function,
                         unicode_literals)
 import tempfile
+import random
+import os
 
-from surprise import Prediction
-from surprise import AlgoBase
-from surprise import Trainset
+from surprise import BaselineOnly
+from surprise import Dataset
+from surprise import Reader
 from surprise import dump
 
 
 def test_dump():
+    """Train an algorithm, compute its predictions then dump them.
+    Ensure that the predictions that are loaded back are the correct ones, and
+    that the predictions of the dumped algorithm are also equal to the other
+    ones."""
 
-    predictions = [Prediction(None, None, None, None, None)]
-    algo = AlgoBase()
-    trainset = Trainset(*[None] * 9)
+    random.seed(0)
 
+    train_file = os.path.join(os.path.dirname(__file__), './u1_ml100k_train')
+    test_file = os.path.join(os.path.dirname(__file__), './u1_ml100k_test')
+    data = Dataset.load_from_folds([(train_file, test_file)],
+                                   Reader('ml-100k'))
+
+    for trainset, testset in data.folds():
+        pass
+
+    algo = BaselineOnly()
+    algo.train(trainset)
+    predictions = algo.test(testset)
+
+    with tempfile.NamedTemporaryFile() as tmp_file:
+        dump.dump(tmp_file.name, predictions, algo)
+        predictions_dumped, algo_dumped = dump.load(tmp_file.name)
+
+        predictions_algo_dumped = algo_dumped.test(testset)
+        assert predictions == predictions_dumped
+        assert predictions == predictions_algo_dumped
+
+
+def test_dump_nothing():
+    """Ensure that by default None objects are dumped."""
     with tempfile.NamedTemporaryFile() as tmp_file:
-        dump(tmp_file.name, predictions, trainset, algo)
+        dump.dump(tmp_file.name)
+        predictions, algo = dump.load(tmp_file.name)
+        assert predictions is None
+        assert algo is None