Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute and use statistics on the whole dataset #231

Open
fvisin opened this issue Sep 15, 2015 · 5 comments
Open

Compute and use statistics on the whole dataset #231

fvisin opened this issue Sep 15, 2015 · 5 comments

Comments

@fvisin
Copy link

fvisin commented Sep 15, 2015

At my best knowledge it is not currently possible to compute statistics on the whole dataset, e.g., per-pixel mean and standard deviation. This wouldn't fit well in the Transformer class as it is not an on-the-fly transformation of the data, so I think it should be a separate class.

The best would be to be able to apply this class either on a Stream coming from a Dataset, or from one coming from a Transformer (I hope my terminology is sound here, correct me if I am wrong), as in some cases preprocessing of the data is required before it is possible to collect statistics.

This issue naturally pairs with #230 , as I can imagine one would want to save the result of this computation in a (potentially new) dataset. In this case, it might be convenient to define how these statistics should be saved in an h5 file, to enforce consistency among the datasets.

@dwf
Copy link
Contributor

dwf commented Sep 15, 2015

There was talk of using Blocks Aggregators for this.

it's a good idea, though we should think about how to store this
information in a standardized way.

@rizar thoughts?

On Tue, Sep 15, 2015 at 3:05 PM, Francesco [email protected] wrote:

At my best knowledge it is not currently possible to compute statistics on
the whole dataset, e.g., per-pixel mean and standard deviation. This
wouldn't fit well in the Transformer class as it is not an on-the-fly
transformation of the data, so I think it should be a separate class.

The best would be to be able to apply this class either on a Stream
coming from a Dataset, or from one coming from a Transformer (I hope my
terminology is sound here, correct me if I am wrong), as in some cases
preprocessing of the data is required before it is possible to collect
statistics.

This issue naturally pairs with #230
#230 , as I can imagine one
would want to save the result of this computation in a (potentially new)
dataset. In this case, it might be convenient to define how these
statistics should be saved in an h5 file, to enforce consistency among the
datasets.


Reply to this email directly or view it on GitHub
#231.

@rizar
Copy link
Contributor

rizar commented Sep 17, 2015

I am not sure I understand what kind of support from Fuel you guys want. Yes, statistics can be computed by iterating dataset. Indeed, DatasetEvaluator from Blocks might be a great way of doing this and not worrying about memory usage. But the output can be whatever...

On the other hand, I find it not very nice that I can not use Fuel to create a whitened version of my dataset.

@fvisin
Copy link
Author

fvisin commented Sep 22, 2015

This is what I would like from Fuel:

  1. an offline preprocesser to iterate over the whole dataset and compute statistics. It is an operation that has to be done every time a new dataset wrapper is created, it would be nice to have an utility class to standardize the way it's done and to avoid to reinvent the wheel in each converter;
  2. a standardized way to store this information in the dataset file;
  3. once you have 1) and 2), objects that use that information on the fly (e.g. normalize, whiten, etc, ..)

@fvisin fvisin changed the title Add class to compute statistics of the whole dataset Compute and use statistics on the whole dataset Sep 22, 2015
@dwf
Copy link
Contributor

dwf commented Sep 30, 2015

@fvisin I think we should split this ticket up. We should try and tackle the offline preprocessor first, as it is potentially useful without the other two pieces. My next priority would be accompanying transformers that can use these statistics to do things like #242 (your number 3), so at least the user can store this stuff manually while we figure out 2.

We should design the API such that it can be used as an optional step at dataset conversion time, so you can say something like "also, store a whitening matrix for this foo as foo_whitening_matrix in the HDF5 file". For large streaming datasets we should think about some way to deal with a very generic conversion pipeline that invokes the aggregation methods of any of these things you specify as the data is being stored/converted.

@fvisin
Copy link
Author

fvisin commented Sep 30, 2015

@dwf I agree on everything. The transformers and the offline preprocessor can indeed be tackled independently, provided that we design them keeping in mind that we want them to work together at some point.

Having transformers that use global statistics could be very handy even without the offline preprocessor, as it is quite common to compute and store global statistics and I would expect people to already have them somewhere for a bunch of datasets, e.g., I pickled the pixel mean and variance of Imagenet when I worked on it, I could plug them into Fuel easily with the Transformer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants