Skip to content

Fix for issue #21150, using a simple lock to prevent an issue with multiple threads accessing an Index #22006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kinow
Copy link
Contributor

@kinow kinow commented Jul 21, 2018

Hi,

First time working with Cython, and after uni did very little C/C++, so feel free to be a bit more serious about reviewing this pull request.

I am trying to practice more Python, and learn more about the code base of projects like Panda, and found the issue #21150, which happens a lot in Java (my main language), so decided to see how I'd go about fixing it.

Used the same approach found in strptime.pyx, to add a lock to prevent two threads of accessing the mapping Hashtable in the _ensure_mapping_populated method.

The problem occurred when two threads accessed the method around the same time, then one would create the mapping Hashtable instance, preventing the other from entering the if statement which also sets the unique flag to true/1.

I didn't find any other part that could cause an obvious issue due to this lock, or another thread-safety issue. However, I can imagine that moving the logic maybe to the constructor so that we immediately know whether the index is unique or not, could fix it too, and remove the need of the lock, at the expense of changing the current design (I believe the index is not tightly coupled with the values passed... later, after certain calls, the mapping hashtable gets populated, which gives better performance I believe).

Or, if core devs prefer to ask users to use a lock in their code base instead of implementing in Pandas, I'd be keen to close this PR and open a new one for the documentation with some note about it.

Feel free to suggest any changes or even leave comments to educate me :-) The documentation for setting up the development environment and for contributing was super easy to follow, I hope I didn't forget anything.

Thank you!
Bruno

@jbrockmendel
Copy link
Member

Test file couldn’t import in py27. Is there a backport of ThreadpoolExecutor?

For the most part pandas doesn’t promise threadsafety. Is there a reason non-safety in this spot is particularly problematic?

@kinow
Copy link
Contributor Author

kinow commented Jul 21, 2018

@jbrockmendel thanks for reviewing it so quickly.

Test file couldn’t import in py27. Is there a backport of ThreadpoolExecutor?

Oh, it's not included with py27. I believe it's included in py3, but for py27 we'd have to have something like pip install futures.

For the most part pandas doesn’t promise threadsafety. Is there a reason non-safety in this spot is particularly problematic?

Only if users are using indexes in a multi threaded environment. Though, I believe we could suggest as workaround to ask users to put the lock before they call the index.

I'd be happy to close the PR and comment in the #21150 and go with that approach. At least I learned a few things while reading the issue & the code (-:

@gfyoung gfyoung added Bug Multithreading Parallelism in pandas labels Jul 23, 2018
with the fix, it must always work with not issues."""
x = pd.date_range('2001', '2020')
with ThreadPoolExecutor(2) as p:
assert all(p.map(lambda x: x.is_unique, [x] * 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I think we can incorporate this test into one of the existing files in this directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one you think would accommodate this one? Happy to update the pull request with this change :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_base.py should suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

from dummy_thread import allocate_lock as _thread_allocate_lock
except:
from _dummy_thread import allocate_lock as _thread_allocate_lock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move this into pandas/compat/__init__.py and import from there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just import it from tslibs.strptime where it is already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we should move that import code to pandas/compat so it can be utilized by both modules (in more semantically logical arrangement IMO).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this? We also recently fixed all our bare excepts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger fixed the bare excepts (had a look at recent commits and saw one where many bare excepts were fixed, kudos!!!).

As for the part about moving to pandas/compat; if I recall correctly, I implemented that, and then reverted as that was asked in other comments (you can see in the comments in the issue).

Pushing a change with the fix for bare excepts. But let me know if I should update where it is imported.

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine, I didn't read through all the comments :)

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2018

For the most part pandas doesn’t promise threadsafety. Is there a reason non-safety in this spot is particularly problematic?

@jbrockmendel : We don't promise thread-safety, but IMO we should try to improve it if possible. 🙂

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2018

@kinow : Going to also need a whatsnew entry for 0.24.0.

@pep8speaks
Copy link

pep8speaks commented Jul 23, 2018

Hello @kinow! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 27, 2018 at 11:59 Hours UTC

@codecov
Copy link

codecov bot commented Jul 23, 2018

Codecov Report

Merging #22006 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22006   +/-   ##
=======================================
  Coverage    92.2%    92.2%           
=======================================
  Files         169      169           
  Lines       50924    50924           
=======================================
  Hits        46952    46952           
  Misses       3972     3972
Flag Coverage Δ
#multiple 90.62% <ø> (ø) ⬆️
#single 42.3% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...3414adf. Read the comment docs.

@kinow
Copy link
Contributor Author

kinow commented Jul 23, 2018

@gfyoung thanks for the thorough review. Here's a summary of my changes, but before, just a heads-up that I added a few more commits, but am happy to squash them all before merging.

  • moved thread/allocate_lock imports to pandas/compat/__init__.py. I had one issue later, because I was renaming the builtin module _thread (Py3 in my env) to thread. Then trying to from thread import allocate_lock. I believe it's because the way Python treats builtins in a special way? I think the current import in the compat module should work, with the cons that in case we need more classes/methods from the thread module, we'd have to keep adding them to the import list
  • took liberty of fixing strptime as well, which was mentioned earlier in this ticket, but happy to revert in case we don't want to mix the changes here, and instead submit another PR later for that?
  • added futures to a few files under ci/ folder, which I believe are used by the CI tools to test Pandas. With this, the Travis-CI error from before, related to not finding ThreadpoolExecutor should be gone (hopefully, it's still running as I write this)
  • moved the unit test from its own class, to under test_base.py, still keeping it in its isolated class. So that maybe we can add more in the future if necessary? Otherwise, happy to move it to either TestIndexUtils or TestIndex (two other classes there).

Thanks
Bruno

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will have a look
but we make no guarantee on threading at all
so not inclined to add this

@kinow
Copy link
Contributor Author

kinow commented Jul 23, 2018

@jreback

i will have a look
but we make no guarantee on threading at all
so not inclined to add this

(replying here)

Not a problem, no hard feelings :-) as I said, it was fun and I learned while working on this, so even if this PR is closed and not merged, that's still all right (and I'd only have to thanks everybody who reviewed/replied here too).

@jbrockmendel
Copy link
Member

We don't promise thread-safety, but IMO we should try to improve it if possible.

Sure. My point was that if we are going to start addressing thread-safety, we should triage. I really have no idea if this should be at the top of the list, hence asking the OP.

Also curious how nicely locks play with cython code.

@jbrockmendel
Copy link
Member

took liberty of fixing strptime as well, which was mentioned earlier in this ticket, but happy to revert in case we don't want to mix the changes here

Definitely revert. A lot of time and effort has gone into getting the dependency structure of those modules right. Adding a dependency on pandas.compat is not part of the gameplan. (cc @gfyoung)

@jreback
Copy link
Contributor

jreback commented Jul 23, 2018

so is it possible to use he cython / c-python API calls for the locks in strptime.pyx?

i agree getting this with a good foundation would be a nice start

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2018

Adding a dependency on pandas.compat is not part of the gameplan. (cc @gfyoung)

@jbrockmendel : Where do you want to put the import code then? It should be moved if we're going to have it in two places in the Cython code.

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2018

but we make no guarantee on threading at all
so not inclined to add this

@jreback : Is there a reason to not make something thread safe if possible?

@jreback
Copy link
Contributor

jreback commented Jul 24, 2018

I am not against threadsafeing at all. But say pandas is thread-safe is a huge thing, and so saying 'part' is threadsafe leads one to believe 'all'. so have to be really careful about this. Further the performance might (or might not be an issue).

@jbrockmendel
Copy link
Member

Where do you want to put the import code then?

The import in strptime needs to stay where it is. If another module needs it, get it from there. Or in pandas.compat import it from strptime, then in index.pyx import it from pandas.compat

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2018

The import in strptime needs to stay where it is. If another module needs it, get it from there.

  • Why does it need to stay there?
  • I don't follow why we should make other code depend on strptime, such as pandas.compat. Semantically, that's confusing me a lot. I should also point out that we do use compat in other Cython files like parsers.pyx.

@jbrockmendel
Copy link
Member

Why does it need to stay there?

In order to import Timestamp, Timedelta, ... into pandas._libs.__init__, we need _libs.tslibs to have no import-time dependencies on non-_libs pandas.

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2018

In order to import Timestamp, Timedelta, ... into pandas._libs.init, we need _libs.tslibs to have no import-time dependencies on non-_libs pandas.

Ah...I see.

Proposal: _libs.compat ?

@jbrockmendel
Copy link
Member

Proposal: _libs.compat ?

There may be merit to this idea, but it would be for the actual compat code that is currently in _libs/src/headers/ and src/compat_helper.h

Please leave strptime alone. A lot of effort has gone into making tslibs logically independent for build/testing purposes.

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2018

Please leave strptime alone. A lot of effort has gone into making tslibs logically independent for build/testing purposes.

Fair enough. We can figure out where best to put this thread allocator import afterwards, though I still am of the opinion that something should be abstracted out so we don't write this code multiple times or unnecessarily make index.pyx dependent on strptime.pyx.

@kinow kinow force-pushed the fix-issue-21150 branch from 06054a2 to cc9f49b Compare July 27, 2018 11:59
@kinow
Copy link
Contributor Author

kinow commented Jul 27, 2018

Updated pull request, now wait for Travis CI to confirm I didn't forget anything (test_base.py passing locally). Rebased and squashed commits too.

I'd be ok with leaving this PR as it is, and instead working on a broader ticket to define a plan to support multi-thread in Pandas - assuming there's a consensus to slowly introduce it, as I don't think it can happen in one-go.

Otherwise, I'm also keen to help documenting what can be used in a multi-threaded environment, and what cannot. We could simply add a few docstring comments, stating whether a class is thread-safe or not, or giving more info about how to use it in a multi-threaded environment. Normally that's a convention that I am used to in the Apache Software Foundation with Java.

That way tickets like this could be closed in favour of good documentation. Happy to help working on either of these two, or even a third option. Because as Captain Kirk said, if really necessary there's always a third alternative 🤓

@jreback
Copy link
Contributor

jreback commented Jul 27, 2018

@kinow I think we would welcome better multi thread support with a few caveats

  • must have more benchmarks / in particular we don’t want to slow down single threaded too much
  • create a central place to handle the code as much as possible (eg the lock cohde)
  • most of this would be cython code - so need to support the idiomatic expressions
  • ideally identify 2 or more issues so that we can have a generalized plan rather than one offs

the last point is the key one

@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

can you merge master and update?

@kinow
Copy link
Contributor Author

kinow commented Oct 11, 2018

Sure thing @jreback. Done 🎉

…ue with multiple threads accessing an Index
@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

closing as stale. if you'd like to continue, pls ping.

@jreback jreback closed this Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Multithreading Parallelism in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: race condition in Index.is_unique
6 participants