Skip to content

Commit

Permalink
Added load_from_df() to load a dataset from a pandas dataframe
Browse files Browse the repository at this point in the history
  • Loading branch information
NicolasHug committed Jun 5, 2017
1 parent bbc0dab commit 757c9a1
Show file tree
Hide file tree
Showing 11 changed files with 192 additions and 46 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Current
=======

* Added possibility to load a dataset from a pandas dataframe

VERSION 1.0.3
=============

Expand Down
29 changes: 25 additions & 4 deletions doc/source/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,15 +91,36 @@ How to build my own prediction algorithm

There's a whole guide :ref:`here<building_custom_algo>`.

.. _raw_inner_note:

What are raw and inner ids
--------------------------

See :ref:`this note <raw_inner_note>`.
Users and items have a raw id and an inner id. Some methods will use/return a
raw id (e.g. the :meth:`predict()
<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method), while
some other will use/return an inner id.

Raw ids are ids as defined in a rating file or in a pandas dataframe. They can
be strings or numbers. Note though that if the ratings were read from a file
which is the standard scenario, they are represented as strings (see e.g.
:ref:`here <train_on_whole_trainset>`).

On trainset creation, each raw id is mapped to a unique
integer called inner id, which is a lot more suitable for `Surprise
<https://nicolashug.github.io/Surprise/>`_ to manipulate. Conversions between
raw and inner ids can be done using the :meth:`to_inner_uid()
<surprise.dataset.Trainset.to_inner_uid>`, :meth:`to_inner_iid()
<surprise.dataset.Trainset.to_inner_iid>`, :meth:`to_raw_uid()
<surprise.dataset.Trainset.to_raw_uid>`, and :meth:`to_raw_iid()
<surprise.dataset.Trainset.to_raw_iid>` methods of the :class:`trainset
<surprise.dataset.Trainset>`.


Can I use my own dataset with Surprise
--------------------------------------
Can I use my own dataset with Surprise, and can it be a pandas dataframe
------------------------------------------------------------------------

Yes, you can. See the :ref:`user guide <load_custom>`.
Yes, and yes. See the :ref:`user guide <load_custom>`.

How to tune an algorithm parameters
-----------------------------------
Expand Down
3 changes: 3 additions & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,3 +299,6 @@

# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False

# warn about all references where the target cannot be found
#nitpicky=True
85 changes: 51 additions & 34 deletions doc/source/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,39 +39,64 @@ You can of course use a custom dataset. `Surprise
<https://nicolashug.github.io/Surprise/>`_ offers two ways of loading a custom
dataset:

- you can either specify a single file with all the ratings and
use the :meth:`split ()<surprise.dataset.DatasetAutoFolds.split>` method to
perform cross-validation ;
- you can either specify a single file (or a pandas dataframe) with all the
ratings and use the :meth:`split ()<surprise.dataset.DatasetAutoFolds.split>`
method to perform cross-validation, or :ref:`train on the whole dataset
<train_on_whole_trainset>` ;
- or if your dataset is already split into predefined folds, you can specify a
list of files for training and testing.

Either way, you will need to define a :class:`Reader <surprise.dataset.Reader>`
object for `Surprise <https://nicolashug.github.io/Surprise/>`_ to be able to
parse the file(s).

We'll see how to handle both cases with the `movielens-100k dataset
<http://grouplens.org/datasets/movielens/>`_. Of course this is a built-in
dataset, but we will act as if it were not.
parse the file(s). We'll see now how to handle both cases.

.. _load_from_file_example:

Load an entire dataset
~~~~~~~~~~~~~~~~~~~~~~
Load an entire dataset from a file or a dataframe
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- To load a dataset from a file, you will need the :meth:`load_from_file()
<surprise.dataset.Dataset.load_from_file>` method:

.. literalinclude:: ../../examples/load_custom_dataset.py
:caption: From file ``examples/load_custom_dataset.py``
:name: load_custom_dataset.py
:lines: 17-26

For more details about readers and how to use them, see the :class:`Reader
class <surprise.dataset.Reader>` documentation.

.. note::
As you already know from the previous section, the Movielens-100k dataset
is built-in so a much quicker way to load the dataset is to do ``data =
Dataset.load_builtin('ml-100k')``. We will of course ignore this here.

.. literalinclude:: ../../examples/load_custom_dataset.py
:caption: From file ``examples/load_custom_dataset.py``
:name: load_custom_dataset.py
:lines: 17-26
.. _load_from_df_example:

.. note::
Actually, as the Movielens-100k dataset is builtin, `Surprise
<https://nicolashug.github.io/Surprise/>`_ provides with a proper reader so
in this case, we could have just created the reader like this: ::
- To load a dataset from a pandas dataframe, you will need the
:meth:`load_from_df() <surprise.dataset.Dataset.load_from_df>` method. You
will also need a :class:`Reader<surprise.dataset.Reader>` object, but only
the ``rating_scale`` parameter must be specified. The dataframe must have
three columns, corresponding to the user (raw) ids, the item (raw) ids, and
the ratings in this order. Each row thus corresponds to a given rating. This
is not restrictive as you can reorder the columns of your dataframe easily.

reader = Reader('ml-100k')
.. literalinclude:: ../../examples/load_from_dataframe.py
:caption: From file ``examples/load_from_dataframe.py``
:name: load_dom_dataframe.py
:lines: 19-28

The dataframe initially looks like this:

.. parsed-literal::
itemID rating userID
0 1 3 9
1 1 2 32
2 1 4 2
3 2 3 45
4 2 1 user_foo
For more details about readers and how to use them, see the :class:`Reader
class <surprise.dataset.Reader>` documentation.
.. _load_from_folds_example:

Expand Down Expand Up @@ -209,26 +234,18 @@ is call the :meth:`predict()
:name: query_for_predictions2.py
:lines: 28-32

The :meth:`predict()
<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` uses **raw** ids
(read :ref:`this <raw_inner_note>`). As the dataset we have used has been read
from a file, the raw ids are strings (even if they represent numbers).

If the :meth:`predict()
<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method is called
with user or item ids that were not part of the trainset, it's up to the
algorithm to decide if it still can make a prediction or not. If it can't,
:meth:`predict() <surprise.prediction_algorithms.algo_base.AlgoBase.predict>`
will still predict the mean of all ratings :math:`\mu`.

.. _raw_inner_note:
.. note::
Raw ids are ids as defined in a rating file. They can be strings, numbers, or
whatever (but are still represented as strings). On trainset creation, each
raw id is mapped to a unique integer called inner id, which is a lot more
suitable for `Surprise <https://nicolashug.github.io/Surprise/>`_ to
manipulate. Conversions between raw and inner ids can be done using the
:meth:`to_inner_uid() <surprise.dataset.Trainset.to_inner_uid>`,
:meth:`to_inner_iid() <surprise.dataset.Trainset.to_inner_iid>`,
:meth:`to_raw_uid() <surprise.dataset.Trainset.to_raw_uid>`, and
:meth:`to_raw_iid() <surprise.dataset.Trainset.to_raw_iid>` methods of the
:class:`trainset <surprise.dataset.Trainset>`.

Obviously, it is perfectly fine to use the :meth:`predict()
<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method directly
during a cross-validation process. It's then up to you to ensure that the user
Expand Down
2 changes: 2 additions & 0 deletions doc/source/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ slope_one
accuracies
NN
deserialize
dataframe
dataframes



Expand Down
2 changes: 1 addition & 1 deletion examples/load_custom_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)
data.split(n_folds=5)
data.split(n_folds=5) # data can now be used normally

# We'll use an algorithm that predicts baseline estimates.
algo = BaselineOnly()
Expand Down
32 changes: 32 additions & 0 deletions examples/load_from_dataframe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""
This module descibes how to load a dataset from a pandas dataframe.
"""

from __future__ import (absolute_import, division, print_function,
unicode_literals)

import pandas as pd

from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader


# Dummy algo
algo = NormalPredictor()

# Creation of the dataframe. Column names are irrelevant.
ratings_dict = {'itemID': [1, 1, 1, 2, 2],
'userID': [9, 32, 2, 45, 'user_foo'],
'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)
data.split(2) # data can now be used normally

for trainset, testset in data.folds():
algo.train(trainset)
algo.test(testset)
1 change: 1 addition & 0 deletions requirements_dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ sphinx_rtd_theme
sphinxcontrib-bibtex
sphinxcontrib-spelling
flake8>=3.2.1
pandas
4 changes: 2 additions & 2 deletions requirements_travis.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Requirements file for development
# Requirements file for travis
numpy>=1.11.2
Cython>=0.24.1
six>=1.10.0
Expand All @@ -7,4 +7,4 @@ sphinx>=1.4.9
sphinx_rtd_theme
sphinxcontrib-bibtex
flake8>=3.2.1

pandas
36 changes: 31 additions & 5 deletions surprise/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,24 @@ def load_from_folds(cls, folds_files, reader):

return DatasetUserFolds(folds_files=folds_files, reader=reader)

@classmethod
def load_from_df(cls, df, reader):
"""Load a dataset from a pandas dataframe.
Use this if you want to use a custom dataset that is stored in a pandas
dataframe. See the :ref:`User Guide<load_from_df_example>` for an
example.
Args:
df(`Dataframe`): The dataframe containing the ratings. It must have
three columns, corresponding to the user (raw) ids, the item
(raw) ids, and the ratings, in this order.
reader(:obj:`Reader`): A reader to read the file. Only the
``rating_scale`` field needs to be specified.
"""

return DatasetAutoFolds(reader=reader, df=df)

def read_ratings(self, file_name):
"""Return a list of ratings (user, item, rating, timestamp) read from
file_name"""
Expand Down Expand Up @@ -297,13 +315,21 @@ class DatasetAutoFolds(Dataset):
cross-validation) are not predefined. (Or for when there are no folds at
all)."""

def __init__(self, ratings_file=None, reader=None):
def __init__(self, ratings_file=None, reader=None, df=None):

Dataset.__init__(self, reader)
self.ratings_file = ratings_file
self.n_folds = 5
self.shuffle = True
self.raw_ratings = self.read_ratings(self.ratings_file)

if ratings_file is not None:
self.ratings_file = ratings_file
self.raw_ratings = self.read_ratings(self.ratings_file)
elif df is not None:
self.df = df
self.raw_ratings = [(uid, iid, r, None) for (uid, iid, r) in
self.df.itertuples(index=False)]
else:
raise ValueError('Must specify ratings file or dataframe.')

def build_full_trainset(self):
"""Do not split the dataset into folds and just return a trainset as
Expand Down Expand Up @@ -382,7 +408,7 @@ class Reader():
Accepted values are 'ml-100k', 'ml-1m', and 'jester'. Default
is ``None``.
line_format(:obj:`string`): The fields names, in the order at which
they are encountered on a line. Example: ``'item user rating'``.
they are encountered on a line. Default is ``'user item rating'``.
sep(char): the separator between fields. Example : ``';'``.
rating_scale(:obj:`tuple`, optional): The rating scale used for every
rating. Default is ``(1, 5)``.
Expand All @@ -391,7 +417,7 @@ class Reader():
"""

def __init__(self, name=None, line_format=None, sep=None,
def __init__(self, name=None, line_format='user item rating', sep=None,
rating_scale=(1, 5), skip_lines=0):

if name:
Expand Down
42 changes: 42 additions & 0 deletions tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import os

import pytest
import pandas as pd

from surprise import BaselineOnly
from surprise import Dataset
Expand Down Expand Up @@ -155,3 +156,44 @@ def test_trainset_testset():
assert ('user3', 'item1', trainset.global_mean) not in testset
assert ('user0', 'item1', trainset.global_mean) in testset
assert ('user3', 'item0', trainset.global_mean) in testset


def test_load_form_df():
"""Ensure reading dataset from pandas dataframe is OK."""

# DF creation.
ratings_dict = {'itemID': [1, 1, 1, 2, 2],
'userID': [9, 32, 2, 45, 'user_foo'],
'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

# Assert split and folds can be used without problems
data.split(2)
assert sum(1 for _ in data.folds()) == 2

# assert users and items are correctly mapped
trainset = data.build_full_trainset()
assert trainset.knows_user(trainset.to_inner_uid(9))
assert trainset.knows_user(trainset.to_inner_uid('user_foo'))
assert trainset.knows_item(trainset.to_inner_iid(2))

# assert r(9, 1) = 3 and r(2, 1) = 4
uid9 = trainset.to_inner_uid(9)
uid2 = trainset.to_inner_uid(2)
iid1 = trainset.to_inner_iid(1)
assert trainset.ur[uid9] == [(iid1, 3)]
assert trainset.ur[uid2] == [(iid1, 4)]

# assert at least rating file or dataframe must be specified
with pytest.raises(ValueError):
data = Dataset.load_from_df(None, None)

# mess up the column ordering and assert that users are not correctly
# mapped
data = Dataset.load_from_df(df[['rating', 'itemID', 'userID']], reader)
trainset = data.build_full_trainset()
with pytest.raises(ValueError):
trainset.to_inner_uid('user_foo')

0 comments on commit 757c9a1

Please sign in to comment.