Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slicing a large histogram is slow #644

Closed
henryiii opened this issue Sep 15, 2021 · 5 comments · Fixed by #648
Closed

Slicing a large histogram is slow #644

henryiii opened this issue Sep 15, 2021 · 5 comments · Fixed by #648

Comments

@henryiii
Copy link
Member

Hi @henryiii @HDembinski ,

I assume the following is related, if not please correct me and I open a fresh new issue...
We noticed in the scope of our analysis that __getitem__ is a performance "hurdle" for high dimensional histograms (imagine: dataset axis of O(1000) dataset, category axis of O(100) categories and systematic axis of O(100) shifts).

I will put a snippet here, which makes the performance visible:

import boost_histogram as bh

h = bh.Histogram(
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. datasets
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. categories
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. systematics
    bh.axis.Regular(100, 0, 500),
)

# let's fill a dummy value
h[...] = 1.0

# now the __getitem__ performance:
%timeit h[bh.loc("42"), bh.loc("42"), bh.loc("42"), :].view()
4.08 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit h.view()[h.axes[0].index("42"), h.axes[1].index("42"), h.axes[2].index("42"), :]
20.3 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Currently we use the second option, since on a larger analysis scale with multiple of these huge histograms this results in a difference of O(hours) and O(seconds) for histogram manipulation, such as grouping datasets to physics processes. However the first one is (obviously) a lot more convenient to use.
I think this would be a major improvement, especially for the best usability of hist and boost_histogram for large-scale analysis.

Best, Peter

Originally posted by @pfackeldey in #149 (comment)

@henryiii
Copy link
Member Author

@pfackeldey This is a bit different, vectorizing getitem refers to something else, slices are fine. The second loop will always outperform the first loop above, since it's doing a lot less. The second loop is not even allocating new memory - it's just a view of the memory with different offsets and strides - a new histogram always needs new memory (if it has flow bins, that's logically impossible, without flow bins, it hasn't been worth adding).

I think we should do a bit of profiling here to see if anything could be improved (4 seconds does seem a little overboard), and while I've been very reluctant to add too many axes types, maybe we could have a flow=False option for Str/IntCategory.

This might be something that would be solved by boostorg/histogram#275, since currently we have to work around this, which might be why it's slow.

PS: Originally I started to say The first loop is summing all the removed bins and putting them in the overflow bins., but that is only for slices, not "picking", so this is clearly way too slow.

@henryiii
Copy link
Member Author

henryiii commented Sep 15, 2021

FYI, running line_profiler:

In [4]: %load_ext line_profiler
In [5]: %lprun -f bh.Histogram.__getitem__ h[bh.loc("42"), bh.loc("42"), bh.loc("42"), :].view()

Shows 100% of the time in this line:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
...
   803         1    4649722.0 4649722.0    100.0          reduced = self._hist.reduce(*slices)
...

So it looks like this is spending it's time in Boost.Histogram. But, turning on logging, I see

DEBUG:boost_histogram._internal.hist:Reduce with []

Which means that self._hist.reduce(*slices) is self._hist.reduce() which is what's taking 4 seconds! Hmm, adding a special override for empty reductions to replace it with a copy ends up with:

1.31 s ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ehh, not much better. In fact, due to the size, this is literally how much time it takes to make a h.copy()!

So this can be fixed in this special case for now by skipping the copy if picking is performed, since that will copy again anyway (but a much smaller histogram!):

368 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Much better for this special case - mixing slicing and picking will drop you down to the original slower code. @HDembinski if Boost.Histogram supported picking (ideally fused with other operations), then this would be simpler and faster. Even if it's not fused, "picking" is a much cheaper operation, so ideally should be done first, rather than last. I could eventually look into reversing the order here, but it complicates (or at least requires me to rewrite) the manipulation code. (PS: empty reduce seems a bit slower than it should be, too).

@HDembinski
Copy link
Member

I don't think I can fix this with implementing picking. The core issue is that the first call does a reduce and allocates memory for another histogram, which is expensive. The second call just creates a view into the original memory, which is very fast. To make this fast I would have to implement histogram views. It could be done with a new storage type, I think, but it is not easy.

@pfackeldey
Copy link

Thanks for creating a new issue!

@henryiii
Copy link
Member Author

Picking reduces the size of the histogram without extra computation, so picking first cuts the size and computation down for the reduce, which is potentially expensive (not sure why it's 3x slower than a copy when nothing is done). If Boost.Histogram had a separate pick, then I'd still have to do the harder part of this reorganization anyway - the pick coming first has to modify the indices of the reduce afterwards. If it was all part of a reduce, then you'd be doing that for me. ;)

Histogram views can't be done with flow bins, at least not without storing the flow bins somewhere else. The view method above will always be a bit faster - but that's okay, I'm mostly worried about reducing the 4 second time to under a milisecond, not worried about the final 300 µs to 20 µs from not making a copy. Eventually, we could possibly set up a storage with externally defined memory, but given the flow bin issue, it would not save much / anything in most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants