Slicing a large histogram is slow #644

henryiii · 2021-09-15T02:23:47Z

I assume the following is related, if not please correct me and I open a fresh new issue...
We noticed in the scope of our analysis that __getitem__ is a performance "hurdle" for high dimensional histograms (imagine: dataset axis of O(1000) dataset, category axis of O(100) categories and systematic axis of O(100) shifts).

I will put a snippet here, which makes the performance visible:

import boost_histogram as bh

h = bh.Histogram(
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. datasets
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. categories
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. systematics
    bh.axis.Regular(100, 0, 500),
)

# let's fill a dummy value
h[...] = 1.0

# now the __getitem__ performance:
%timeit h[bh.loc("42"), bh.loc("42"), bh.loc("42"), :].view()
4.08 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit h.view()[h.axes[0].index("42"), h.axes[1].index("42"), h.axes[2].index("42"), :]
20.3 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Currently we use the second option, since on a larger analysis scale with multiple of these huge histograms this results in a difference of O(hours) and O(seconds) for histogram manipulation, such as grouping datasets to physics processes. However the first one is (obviously) a lot more convenient to use.
I think this would be a major improvement, especially for the best usability of hist and boost_histogram for large-scale analysis.

Best, Peter

Originally posted by @pfackeldey in #149 (comment)

The text was updated successfully, but these errors were encountered:

henryiii · 2021-09-15T03:56:42Z

@pfackeldey This is a bit different, vectorizing getitem refers to something else, slices are fine. The second loop will always outperform the first loop above, since it's doing a lot less. The second loop is not even allocating new memory - it's just a view of the memory with different offsets and strides - a new histogram always needs new memory (if it has flow bins, that's logically impossible, without flow bins, it hasn't been worth adding).

I think we should do a bit of profiling here to see if anything could be improved (4 seconds does seem a little overboard), and while I've been very reluctant to add too many axes types, maybe we could have a flow=False option for Str/IntCategory.

This might be something that would be solved by boostorg/histogram#275, since currently we have to work around this, which might be why it's slow.

PS: Originally I started to say ~~The first loop is summing all the removed bins and putting them in the overflow bins.~~, but that is only for slices, not "picking", so this is clearly way too slow.

henryiii · 2021-09-15T04:41:18Z

FYI, running line_profiler:

In [4]: %load_ext line_profiler
In [5]: %lprun -f bh.Histogram.__getitem__ h[bh.loc("42"), bh.loc("42"), bh.loc("42"), :].view()

Shows 100% of the time in this line:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
...
   803         1    4649722.0 4649722.0    100.0          reduced = self._hist.reduce(*slices)
...

So it looks like this is spending it's time in Boost.Histogram. But, turning on logging, I see

DEBUG:boost_histogram._internal.hist:Reduce with []

Which means that self._hist.reduce(*slices) is self._hist.reduce() which is what's taking 4 seconds! Hmm, adding a special override for empty reductions to replace it with a copy ends up with:

1.31 s ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ehh, not much better. In fact, due to the size, this is literally how much time it takes to make a h.copy()!

So this can be fixed in this special case for now by skipping the copy if picking is performed, since that will copy again anyway (but a much smaller histogram!):

368 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Much better for this special case - mixing slicing and picking will drop you down to the original slower code. @HDembinski if Boost.Histogram supported picking (ideally fused with other operations), then this would be simpler and faster. Even if it's not fused, "picking" is a much cheaper operation, so ideally should be done first, rather than last. I could eventually look into reversing the order here, but it complicates (or at least requires me to rewrite) the manipulation code. (PS: empty reduce seems a bit slower than it should be, too).

HDembinski · 2021-09-15T05:47:46Z

I don't think I can fix this with implementing picking. The core issue is that the first call does a reduce and allocates memory for another histogram, which is expensive. The second call just creates a view into the original memory, which is very fast. To make this fast I would have to implement histogram views. It could be done with a new storage type, I think, but it is not easy.

pfackeldey · 2021-09-15T08:25:37Z

Thanks for creating a new issue!

henryiii · 2021-09-15T13:12:06Z

Picking reduces the size of the histogram without extra computation, so picking first cuts the size and computation down for the reduce, which is potentially expensive (not sure why it's 3x slower than a copy when nothing is done). If Boost.Histogram had a separate pick, then I'd still have to do the harder part of this reorganization anyway - the pick coming first has to modify the indices of the reduce afterwards. If it was all part of a reduce, then you'd be doing that for me. ;)

Histogram views can't be done with flow bins, at least not without storing the flow bins somewhere else. The view method above will always be a bit faster - but that's okay, I'm mostly worried about reducing the 4 second time to under a milisecond, not worried about the final 300 µs to 20 µs from not making a copy. Eventually, we could possibly set up a storage with externally defined memory, but given the flow bin issue, it would not save much / anything in most cases.

henryiii mentioned this issue Sep 15, 2021

fix: faster picking when no slices are made #645

Merged

henryiii mentioned this issue Sep 15, 2021

fix: faster picking when slices are made #648

Merged

henryiii closed this as completed in #648 Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slicing a large histogram is slow #644

Slicing a large histogram is slow #644

henryiii commented Sep 15, 2021

henryiii commented Sep 15, 2021

henryiii commented Sep 15, 2021 •

edited

Loading

HDembinski commented Sep 15, 2021

pfackeldey commented Sep 15, 2021

henryiii commented Sep 15, 2021

Slicing a large histogram is slow #644

Slicing a large histogram is slow #644

Comments

henryiii commented Sep 15, 2021

henryiii commented Sep 15, 2021

henryiii commented Sep 15, 2021 • edited Loading

HDembinski commented Sep 15, 2021

pfackeldey commented Sep 15, 2021

henryiii commented Sep 15, 2021

henryiii commented Sep 15, 2021 •

edited

Loading