Added load_from_df() to load a dataset from a pandas dataframe

katieslayden · Jun 5, 2017 · 757c9a1 · 757c9a1
1 parent bbc0dab
commit 757c9a1
Show file tree

Hide file tree

Showing 11 changed files with 192 additions and 46 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,8 @@
 Current
 =======
 
+* Added possibility to load a dataset from a pandas dataframe
+
 VERSION 1.0.3
 =============
 

diff --git a/doc/source/FAQ.rst b/doc/source/FAQ.rst
@@ -91,15 +91,36 @@ How to build my own prediction algorithm
 
 There's a whole guide :ref:`here<building_custom_algo>`.
 
+.. _raw_inner_note:
+
 What are raw and inner ids
 --------------------------
 
-See :ref:`this note <raw_inner_note>`.
+Users and items have a raw id and an inner id. Some methods will use/return a
+raw id (e.g. the :meth:`predict()
+<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method), while
+some other will use/return an inner id.
+
+Raw ids are ids as defined in a rating file or in a pandas dataframe. They can
+be strings or numbers. Note though that if the ratings were read from a file
+which is the standard scenario, they are represented as strings (see e.g.
+:ref:`here <train_on_whole_trainset>`).
+
+On trainset creation, each raw id is mapped to a unique
+integer called inner id, which is a lot more suitable for `Surprise
+<https://nicolashug.github.io/Surprise/>`_ to manipulate. Conversions between
+raw and inner ids can be done using the :meth:`to_inner_uid()
+<surprise.dataset.Trainset.to_inner_uid>`, :meth:`to_inner_iid()
+<surprise.dataset.Trainset.to_inner_iid>`, :meth:`to_raw_uid()
+<surprise.dataset.Trainset.to_raw_uid>`, and :meth:`to_raw_iid()
+<surprise.dataset.Trainset.to_raw_iid>` methods of the :class:`trainset
+<surprise.dataset.Trainset>`.
+
 
-Can I use my own dataset with Surprise
---------------------------------------
+Can I use my own dataset with Surprise, and can it be a pandas dataframe
+------------------------------------------------------------------------
 
-Yes, you can. See the :ref:`user guide <load_custom>`.
+Yes, and yes. See the :ref:`user guide <load_custom>`.
 
 How to tune an algorithm parameters
 -----------------------------------

diff --git a/doc/source/conf.py b/doc/source/conf.py
@@ -299,3 +299,6 @@
 
 # If true, do not generate a @detailmenu in the "Top" node's menu.
 #texinfo_no_detailmenu = False
+
+# warn about all references where the target cannot be found
+#nitpicky=True
diff --git a/doc/source/getting_started.rst b/doc/source/getting_started.rst
@@ -39,39 +39,64 @@ You can of course use a custom dataset. `Surprise
 <https://nicolashug.github.io/Surprise/>`_ offers two ways of loading a custom
 dataset:
 
-- you can either specify a single file with all the ratings and
-  use the :meth:`split ()<surprise.dataset.DatasetAutoFolds.split>` method to
-  perform cross-validation ;
+- you can either specify a single file (or a pandas dataframe) with all the
+  ratings and use the :meth:`split ()<surprise.dataset.DatasetAutoFolds.split>`
+  method to perform cross-validation, or :ref:`train on the whole dataset
+  <train_on_whole_trainset>` ;
 - or if your dataset is already split into predefined folds, you can specify a
   list of files for training and testing.
 
 Either way, you will need to define a :class:`Reader <surprise.dataset.Reader>`
 object for `Surprise <https://nicolashug.github.io/Surprise/>`_ to be able to
-parse the file(s).
-
-We'll see how to handle both cases with the `movielens-100k dataset
-<http://grouplens.org/datasets/movielens/>`_. Of course this is a built-in
-dataset, but we will act as if it were not.
+parse the file(s). We'll see now how to handle both cases.
 
 .. _load_from_file_example:
 
-Load an entire dataset
-~~~~~~~~~~~~~~~~~~~~~~
+Load an entire dataset from a file or a dataframe
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- To load a dataset from a file, you will need the :meth:`load_from_file()
+  <surprise.dataset.Dataset.load_from_file>` method:
+
+  .. literalinclude:: ../../examples/load_custom_dataset.py
+      :caption: From file ``examples/load_custom_dataset.py``
+      :name: load_custom_dataset.py
+      :lines: 17-26
+
+  For more details about readers and how to use them, see the :class:`Reader
+  class <surprise.dataset.Reader>` documentation.
+
+  .. note::
+      As you already know from the previous section, the Movielens-100k dataset
+      is built-in so a much quicker way to load the dataset is to do ``data =
+      Dataset.load_builtin('ml-100k')``. We will of course ignore this here.
 
-.. literalinclude:: ../../examples/load_custom_dataset.py
-    :caption: From file ``examples/load_custom_dataset.py``
-    :name: load_custom_dataset.py
-    :lines: 17-26
+.. _load_from_df_example:
 
-.. note::
-    Actually, as the Movielens-100k dataset is builtin, `Surprise
-    <https://nicolashug.github.io/Surprise/>`_ provides with a proper reader so
-    in this case, we could have just created the reader like this: ::
+- To load a dataset from a pandas dataframe, you will need the
+  :meth:`load_from_df() <surprise.dataset.Dataset.load_from_df>` method. You
+  will also need a :class:`Reader<surprise.dataset.Reader>` object, but only
+  the ``rating_scale`` parameter must be specified. The dataframe must have
+  three columns, corresponding to the user (raw) ids, the item (raw) ids, and
+  the ratings in this order. Each row thus corresponds to a given rating. This
+  is not restrictive as you can reorder the columns of your dataframe easily.
 
-      reader = Reader('ml-100k')
+  .. literalinclude:: ../../examples/load_from_dataframe.py
+      :caption: From file ``examples/load_from_dataframe.py``
+      :name: load_dom_dataframe.py
+      :lines: 19-28
+
+  The dataframe initially looks like this:
+
+  .. parsed-literal::
+
+            itemID  rating    userID
+      0       1       3         9
+      1       1       2        32
+      2       1       4         2
+      3       2       3        45
+      4       2       1  user_foo
 
-For more details about readers and how to use them, see the :class:`Reader
-class <surprise.dataset.Reader>` documentation.
 
 .. _load_from_folds_example:
 
@@ -209,26 +234,18 @@ is call the :meth:`predict()
     :name: query_for_predictions2.py
     :lines: 28-32
 
+The :meth:`predict()
+<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` uses **raw** ids
+(read :ref:`this <raw_inner_note>`). As the dataset we have used has been read
+from a file, the raw ids are strings (even if they represent numbers).
+
 If the :meth:`predict()
 <surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method is called
 with user or item ids that were not part of the trainset, it's up to the
 algorithm to decide if it still can make a prediction or not. If it can't,
 :meth:`predict() <surprise.prediction_algorithms.algo_base.AlgoBase.predict>`
 will still predict the mean of all ratings :math:`\mu`.
 
-.. _raw_inner_note:
-.. note::
-  Raw ids are ids as defined in a rating file. They can be strings, numbers, or
-  whatever (but are still represented as strings).  On trainset creation, each
-  raw id is mapped to a unique integer called inner id, which is a lot more
-  suitable for `Surprise <https://nicolashug.github.io/Surprise/>`_ to
-  manipulate. Conversions between raw and inner ids can be done using the
-  :meth:`to_inner_uid() <surprise.dataset.Trainset.to_inner_uid>`,
-  :meth:`to_inner_iid() <surprise.dataset.Trainset.to_inner_iid>`,
-  :meth:`to_raw_uid() <surprise.dataset.Trainset.to_raw_uid>`, and
-  :meth:`to_raw_iid() <surprise.dataset.Trainset.to_raw_iid>` methods of the
-  :class:`trainset <surprise.dataset.Trainset>`.
-
 Obviously, it is perfectly fine to use the :meth:`predict()
 <surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method directly
 during a cross-validation process. It's then up to you to ensure that the user

diff --git a/doc/source/spelling_wordlist.txt b/doc/source/spelling_wordlist.txt
@@ -16,6 +16,8 @@ slope_one
 accuracies
 NN
 deserialize
+dataframe
+dataframes
 
 
 

diff --git a/examples/load_custom_dataset.py b/examples/load_custom_dataset.py
@@ -23,7 +23,7 @@
 reader = Reader(line_format='user item rating timestamp', sep='\t')
 
 data = Dataset.load_from_file(file_path, reader=reader)
-data.split(n_folds=5)
+data.split(n_folds=5)  # data can now be used normally
 
 # We'll use an algorithm that predicts baseline estimates.
 algo = BaselineOnly()

diff --git a/examples/load_from_dataframe.py b/examples/load_from_dataframe.py
@@ -0,0 +1,32 @@
+"""
+This module descibes how to load a dataset from a pandas dataframe.
+"""
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import pandas as pd
+
+from surprise import NormalPredictor
+from surprise import Dataset
+from surprise import Reader
+
+
+# Dummy algo
+algo = NormalPredictor()
+
+# Creation of the dataframe. Column names are irrelevant.
+ratings_dict = {'itemID': [1, 1, 1, 2, 2],
+                'userID': [9, 32, 2, 45, 'user_foo'],
+                'rating': [3, 2, 4, 3, 1]}
+df = pd.DataFrame(ratings_dict)
+
+# A reader is still needed but only the rating_scale param is requiered.
+reader = Reader(rating_scale=(1, 5))
+# The columns must correspond to user id, item id and ratings (in that order).
+data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)
+data.split(2)  # data can now be used normally
+
+for trainset, testset in data.folds():
+    algo.train(trainset)
+    algo.test(testset)
diff --git a/requirements_dev.txt b/requirements_dev.txt
@@ -8,3 +8,4 @@ sphinx_rtd_theme
 sphinxcontrib-bibtex
 sphinxcontrib-spelling
 flake8>=3.2.1
+pandas
diff --git a/requirements_travis.txt b/requirements_travis.txt
@@ -1,4 +1,4 @@
-# Requirements file for development
+# Requirements file for travis
 numpy>=1.11.2
 Cython>=0.24.1
 six>=1.10.0
@@ -7,4 +7,4 @@ sphinx>=1.4.9
 sphinx_rtd_theme
 sphinxcontrib-bibtex
 flake8>=3.2.1
-
+pandas
diff --git a/surprise/dataset.py b/surprise/dataset.py
@@ -196,6 +196,24 @@ def load_from_folds(cls, folds_files, reader):
 
         return DatasetUserFolds(folds_files=folds_files, reader=reader)
 
+    @classmethod
+    def load_from_df(cls, df, reader):
+        """Load a dataset from a pandas dataframe.
+
+        Use this if you want to use a custom dataset that is stored in a pandas
+        dataframe. See the :ref:`User Guide<load_from_df_example>` for an
+        example.
+
+        Args:
+            df(`Dataframe`): The dataframe containing the ratings. It must have
+                three columns, corresponding to the user (raw) ids, the item
+                (raw) ids, and the ratings, in this order.
+            reader(:obj:`Reader`): A reader to read the file. Only the
+                ``rating_scale`` field needs to be specified.
+        """
+
+        return DatasetAutoFolds(reader=reader, df=df)
+
     def read_ratings(self, file_name):
         """Return a list of ratings (user, item, rating, timestamp) read from
         file_name"""
@@ -297,13 +315,21 @@ class DatasetAutoFolds(Dataset):
     cross-validation) are not predefined. (Or for when there are no folds at
     all)."""
 
-    def __init__(self, ratings_file=None, reader=None):
+    def __init__(self, ratings_file=None, reader=None, df=None):
 
         Dataset.__init__(self, reader)
-        self.ratings_file = ratings_file
         self.n_folds = 5
         self.shuffle = True
-        self.raw_ratings = self.read_ratings(self.ratings_file)
+
+        if ratings_file is not None:
+            self.ratings_file = ratings_file
+            self.raw_ratings = self.read_ratings(self.ratings_file)
+        elif df is not None:
+            self.df = df
+            self.raw_ratings = [(uid, iid, r, None) for (uid, iid, r) in
+                                self.df.itertuples(index=False)]
+        else:
+            raise ValueError('Must specify ratings file or dataframe.')
 
     def build_full_trainset(self):
         """Do not split the dataset into folds and just return a trainset as
@@ -382,7 +408,7 @@ class Reader():
             Accepted values are 'ml-100k', 'ml-1m', and 'jester'. Default
             is ``None``.
         line_format(:obj:`string`): The fields names, in the order at which
-            they are encountered on a line. Example: ``'item user rating'``.
+            they are encountered on a line. Default is ``'user item rating'``.
         sep(char): the separator between fields. Example : ``';'``.
         rating_scale(:obj:`tuple`, optional): The rating scale used for every
             rating.  Default is ``(1, 5)``.
@@ -391,7 +417,7 @@ class Reader():
 
     """
 
-    def __init__(self, name=None, line_format=None, sep=None,
+    def __init__(self, name=None, line_format='user item rating', sep=None,
                  rating_scale=(1, 5), skip_lines=0):
 
         if name:

diff --git a/tests/test_dataset.py b/tests/test_dataset.py
@@ -7,6 +7,7 @@
 import os
 
 import pytest
+import pandas as pd
 
 from surprise import BaselineOnly
 from surprise import Dataset
@@ -155,3 +156,44 @@ def test_trainset_testset():
     assert ('user3', 'item1', trainset.global_mean) not in testset
     assert ('user0', 'item1', trainset.global_mean) in testset
     assert ('user3', 'item0', trainset.global_mean) in testset
+
+
+def test_load_form_df():
+    """Ensure reading dataset from pandas dataframe is OK."""
+
+    # DF creation.
+    ratings_dict = {'itemID': [1, 1, 1, 2, 2],
+                    'userID': [9, 32, 2, 45, 'user_foo'],
+                    'rating': [3, 2, 4, 3, 1]}
+    df = pd.DataFrame(ratings_dict)
+
+    reader = Reader(rating_scale=(1, 5))
+    data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)
+
+    # Assert split and folds can be used without problems
+    data.split(2)
+    assert sum(1 for _ in data.folds()) == 2
+
+    # assert users and items are correctly mapped
+    trainset = data.build_full_trainset()
+    assert trainset.knows_user(trainset.to_inner_uid(9))
+    assert trainset.knows_user(trainset.to_inner_uid('user_foo'))
+    assert trainset.knows_item(trainset.to_inner_iid(2))
+
+    # assert r(9, 1) = 3 and r(2, 1) = 4
+    uid9 = trainset.to_inner_uid(9)
+    uid2 = trainset.to_inner_uid(2)
+    iid1 = trainset.to_inner_iid(1)
+    assert trainset.ur[uid9] == [(iid1, 3)]
+    assert trainset.ur[uid2] == [(iid1, 4)]
+
+    # assert at least rating file or dataframe must be specified
+    with pytest.raises(ValueError):
+        data = Dataset.load_from_df(None, None)
+
+    # mess up the column ordering and assert that users are not correctly
+    # mapped
+    data = Dataset.load_from_df(df[['rating', 'itemID', 'userID']], reader)
+    trainset = data.build_full_trainset()
+    with pytest.raises(ValueError):
+        trainset.to_inner_uid('user_foo')