Compute and use statistics on the whole dataset #231

fvisin · 2015-09-15T19:05:54Z

At my best knowledge it is not currently possible to compute statistics on the whole dataset, e.g., per-pixel mean and standard deviation. This wouldn't fit well in the Transformer class as it is not an on-the-fly transformation of the data, so I think it should be a separate class.

The best would be to be able to apply this class either on a Stream coming from a Dataset, or from one coming from a Transformer (I hope my terminology is sound here, correct me if I am wrong), as in some cases preprocessing of the data is required before it is possible to collect statistics.

This issue naturally pairs with #230 , as I can imagine one would want to save the result of this computation in a (potentially new) dataset. In this case, it might be convenient to define how these statistics should be saved in an h5 file, to enforce consistency among the datasets.

The text was updated successfully, but these errors were encountered:

dwf · 2015-09-15T19:37:41Z

There was talk of using Blocks Aggregators for this.

it's a good idea, though we should think about how to store this
information in a standardized way.

@rizar thoughts?

On Tue, Sep 15, 2015 at 3:05 PM, Francesco [email protected] wrote:

At my best knowledge it is not currently possible to compute statistics on
the whole dataset, e.g., per-pixel mean and standard deviation. This
wouldn't fit well in the Transformer class as it is not an on-the-fly
transformation of the data, so I think it should be a separate class.

The best would be to be able to apply this class either on a Stream
coming from a Dataset, or from one coming from a Transformer (I hope my
terminology is sound here, correct me if I am wrong), as in some cases
preprocessing of the data is required before it is possible to collect
statistics.

This issue naturally pairs with #230
#230 , as I can imagine one
would want to save the result of this computation in a (potentially new)
dataset. In this case, it might be convenient to define how these
statistics should be saved in an h5 file, to enforce consistency among the
datasets.

—
Reply to this email directly or view it on GitHub
#231.

rizar · 2015-09-17T21:35:52Z

I am not sure I understand what kind of support from Fuel you guys want. Yes, statistics can be computed by iterating dataset. Indeed, DatasetEvaluator from Blocks might be a great way of doing this and not worrying about memory usage. But the output can be whatever...

On the other hand, I find it not very nice that I can not use Fuel to create a whitened version of my dataset.

fvisin · 2015-09-22T19:01:28Z

This is what I would like from Fuel:

an offline preprocesser to iterate over the whole dataset and compute statistics. It is an operation that has to be done every time a new dataset wrapper is created, it would be nice to have an utility class to standardize the way it's done and to avoid to reinvent the wheel in each converter;
a standardized way to store this information in the dataset file;
once you have 1) and 2), objects that use that information on the fly (e.g. normalize, whiten, etc, ..)

dwf · 2015-09-30T06:21:05Z

@fvisin I think we should split this ticket up. We should try and tackle the offline preprocessor first, as it is potentially useful without the other two pieces. My next priority would be accompanying transformers that can use these statistics to do things like #242 (your number 3), so at least the user can store this stuff manually while we figure out 2.

We should design the API such that it can be used as an optional step at dataset conversion time, so you can say something like "also, store a whitening matrix for this foo as foo_whitening_matrix in the HDF5 file". For large streaming datasets we should think about some way to deal with a very generic conversion pipeline that invokes the aggregation methods of any of these things you specify as the data is being stored/converted.

fvisin · 2015-09-30T15:30:30Z

@dwf I agree on everything. The transformers and the offline preprocessor can indeed be tackled independently, provided that we design them keeping in mind that we want them to work together at some point.

Having transformers that use global statistics could be very handy even without the offline preprocessor, as it is quite common to compute and store global statistics and I would expect people to already have them somewhere for a bunch of datasets, e.g., I pickled the pixel mean and variance of Imagenet when I worked on it, I could plug them into Fuel easily with the Transformer.

fvisin changed the title ~~Add class to compute statistics of the whole dataset~~ Compute and use statistics on the whole dataset Sep 22, 2015

dwf mentioned this issue Sep 30, 2015

Remove mean, (optionally) divide by standard deviation #242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute and use statistics on the whole dataset #231

Compute and use statistics on the whole dataset #231

fvisin commented Sep 15, 2015

dwf commented Sep 15, 2015

rizar commented Sep 17, 2015

fvisin commented Sep 22, 2015

dwf commented Sep 30, 2015

fvisin commented Sep 30, 2015

Compute and use statistics on the whole dataset #231

Compute and use statistics on the whole dataset #231

Comments

fvisin commented Sep 15, 2015

dwf commented Sep 15, 2015

rizar commented Sep 17, 2015

fvisin commented Sep 22, 2015

dwf commented Sep 30, 2015

fvisin commented Sep 30, 2015