-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Fix for issue #21150, using a simple lock to prevent an issue with multiple threads accessing an Index #22006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test file couldn’t import in py27. Is there a backport of ThreadpoolExecutor? For the most part pandas doesn’t promise threadsafety. Is there a reason non-safety in this spot is particularly problematic? |
@jbrockmendel thanks for reviewing it so quickly.
Oh, it's not included with py27. I believe it's included in py3, but for py27 we'd have to have something like
Only if users are using indexes in a multi threaded environment. Though, I believe we could suggest as workaround to ask users to put the lock before they call the index. I'd be happy to close the PR and comment in the #21150 and go with that approach. At least I learned a few things while reading the issue & the code (-: |
with the fix, it must always work with not issues.""" | ||
x = pd.date_range('2001', '2020') | ||
with ThreadPoolExecutor(2) as p: | ||
assert all(p.map(lambda x: x.is_unique, [x] * 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I think we can incorporate this test into one of the existing files in this directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one you think would accommodate this one? Happy to update the pull request with this change :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_base.py
should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
from dummy_thread import allocate_lock as _thread_allocate_lock | ||
except: | ||
from _dummy_thread import allocate_lock as _thread_allocate_lock | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move this into pandas/compat/__init__.py
and import from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just import it from tslibs.strptime where it is already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we should move that import code to pandas/compat
so it can be utilized by both modules (in more semantically logical arrangement IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address this? We also recently fixed all our bare excepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger fixed the bare excepts (had a look at recent commits and saw one where many bare excepts were fixed, kudos!!!).
As for the part about moving to pandas/compat
; if I recall correctly, I implemented that, and then reverted as that was asked in other comments (you can see in the comments in the issue).
Pushing a change with the fix for bare excepts. But let me know if I should update where it is imported.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine, I didn't read through all the comments :)
@jbrockmendel : We don't promise thread-safety, but IMO we should try to improve it if possible. 🙂 |
@kinow : Going to also need a |
Hello @kinow! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 27, 2018 at 11:59 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #22006 +/- ##
=======================================
Coverage 92.2% 92.2%
=======================================
Files 169 169
Lines 50924 50924
=======================================
Hits 46952 46952
Misses 3972 3972
Continue to review full report at Codecov.
|
@gfyoung thanks for the thorough review. Here's a summary of my changes, but before, just a heads-up that I added a few more commits, but am happy to squash them all before merging.
Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will have a look
but we make no guarantee on threading at all
so not inclined to add this
(replying here) Not a problem, no hard feelings :-) as I said, it was fun and I learned while working on this, so even if this PR is closed and not merged, that's still all right (and I'd only have to thanks everybody who reviewed/replied here too). |
Sure. My point was that if we are going to start addressing thread-safety, we should triage. I really have no idea if this should be at the top of the list, hence asking the OP. Also curious how nicely locks play with cython code. |
Definitely revert. A lot of time and effort has gone into getting the dependency structure of those modules right. Adding a dependency on |
so is it possible to use he cython / c-python API calls for the locks in strptime.pyx? i agree getting this with a good foundation would be a nice start |
@jbrockmendel : Where do you want to put the import code then? It should be moved if we're going to have it in two places in the Cython code. |
@jreback : Is there a reason to not make something thread safe if possible? |
I am not against threadsafeing at all. But say pandas is thread-safe is a huge thing, and so saying 'part' is threadsafe leads one to believe 'all'. so have to be really careful about this. Further the performance might (or might not be an issue). |
The import in strptime needs to stay where it is. If another module needs it, get it from there. Or in pandas.compat import it from strptime, then in index.pyx import it from pandas.compat |
|
In order to import |
Ah...I see. Proposal: |
There may be merit to this idea, but it would be for the actual compat code that is currently in _libs/src/headers/ and src/compat_helper.h Please leave strptime alone. A lot of effort has gone into making tslibs logically independent for build/testing purposes. |
Fair enough. We can figure out where best to put this thread allocator import afterwards, though I still am of the opinion that something should be abstracted out so we don't write this code multiple times or unnecessarily make |
Updated pull request, now wait for Travis CI to confirm I didn't forget anything (test_base.py passing locally). Rebased and squashed commits too. I'd be ok with leaving this PR as it is, and instead working on a broader ticket to define a plan to support multi-thread in Pandas - assuming there's a consensus to slowly introduce it, as I don't think it can happen in one-go. Otherwise, I'm also keen to help documenting what can be used in a multi-threaded environment, and what cannot. We could simply add a few docstring comments, stating whether a class is thread-safe or not, or giving more info about how to use it in a multi-threaded environment. Normally that's a convention that I am used to in the Apache Software Foundation with Java. That way tickets like this could be closed in favour of good documentation. Happy to help working on either of these two, or even a third option. Because as Captain Kirk said, if really necessary there's always a third alternative 🤓 |
@kinow I think we would welcome better multi thread support with a few caveats
the last point is the key one |
can you merge master and update? |
cc9f49b
to
9b07f01
Compare
Sure thing @jreback. Done 🎉 |
…ue with multiple threads accessing an Index
9b07f01
to
3414adf
Compare
closing as stale. if you'd like to continue, pls ping. |
Hi,
First time working with Cython, and after uni did very little C/C++, so feel free to be a bit more serious about reviewing this pull request.
I am trying to practice more Python, and learn more about the code base of projects like Panda, and found the issue #21150, which happens a lot in Java (my main language), so decided to see how I'd go about fixing it.
Used the same approach found in
strptime.pyx
, to add a lock to prevent two threads of accessing themapping
Hashtable in the_ensure_mapping_populated
method.The problem occurred when two threads accessed the method around the same time, then one would create the
mapping
Hashtable instance, preventing the other from entering theif
statement which also sets theunique
flag totrue/1
.I didn't find any other part that could cause an obvious issue due to this lock, or another thread-safety issue. However, I can imagine that moving the logic maybe to the constructor so that we immediately know whether the index is unique or not, could fix it too, and remove the need of the lock, at the expense of changing the current design (I believe the index is not tightly coupled with the values passed... later, after certain calls, the mapping hashtable gets populated, which gives better performance I believe).
Or, if core devs prefer to ask users to use a lock in their code base instead of implementing in Pandas, I'd be keen to close this PR and open a new one for the documentation with some note about it.
Feel free to suggest any changes or even leave comments to educate me :-) The documentation for setting up the development environment and for contributing was super easy to follow, I hope I didn't forget anything.
Thank you!
Bruno
git diff upstream/master -u -- "*.py" | flake8 --diff