Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize at/__getitem__ #149

Open
3 tasks
henryiii opened this issue Oct 17, 2019 · 2 comments
Open
3 tasks

Vectorize at/__getitem__ #149

henryiii opened this issue Oct 17, 2019 · 2 comments
Assignees
Labels
project idea Could be a fellow project

Comments

@henryiii
Copy link
Member

henryiii commented Oct 17, 2019

  • Vectorize _at
  • Vectorize _at_set
  • vectorize __getitem__, __setitem__ (uses the above functions internally)
@henryiii henryiii modified the milestones: 0.6.0, 0.5.2 Oct 17, 2019
@henryiii henryiii modified the milestones: 0.6.0, 0.7.0 Nov 3, 2019
@HDembinski HDembinski self-assigned this Nov 5, 2019
@henryiii henryiii modified the milestones: 0.7.0, 0.8.0 Mar 11, 2020
@henryiii henryiii modified the milestones: 0.8.0, 1.0.0 Jul 2, 2020
@henryiii henryiii modified the milestones: 1.0.0, 1.1.0 Feb 9, 2021
@henryiii
Copy link
Member Author

henryiii commented Feb 9, 2021

This is a bit tricky to implement; I've started it, but pybind11 doesn't provide runtime utilities for array access, and I don't want to generate 32 copies of this, so likely will miss the 1.0 target. I think that's fine, as no one has been too worried about missing this so far. The easy buffer access with .view() and such make it a bit less important.

@henryiii henryiii modified the milestones: 1.1.0, 1.2.0 Jul 7, 2021
@pfackeldey
Copy link

pfackeldey commented Sep 14, 2021

Hi @henryiii @HDembinski ,

I assume the following is related, if not please correct me and I open a fresh new issue...
We noticed in the scope of our analysis that __getitem__ is a performance "hurdle" for high dimensional histograms (imagine: dataset axis of O(1000) dataset, category axis of O(100) categories and systematic axis of O(100) shifts).

I will put a snippet here, which makes the performance visible:

import boost_histogram as bh

h = bh.Histogram(
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. datasets
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. categories
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. systematics
    bh.axis.Regular(100, 0, 500),
)

# let's fill a dummy value
h[...] = 1.0

# now the __getitem__ performance:
%timeit h[bh.loc("42"), bh.loc("42"), bh.loc("42"), :].view()
4.08 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit h.view()[h.axes[0].index("42"), h.axes[1].index("42"), h.axes[2].index("42"), :]
20.3 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Currently we use the second option, since on a larger analysis scale with multiple of these huge histograms this results in a difference of O(hours) and O(seconds) for histogram manipulation, such as grouping datasets to physics processes. However the first one is (obviously) a lot more convenient to use.
I think this would be a major improvement, especially for the best usability of hist and boost_histogram for large-scale analysis.

Best, Peter

@henryiii henryiii removed this from the 1.3.0 milestone Apr 15, 2022
@henryiii henryiii added the project idea Could be a fellow project label Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project idea Could be a fellow project
Projects
None yet
Development

No branches or pull requests

3 participants