Skip to content

BUG: race condition in Index.is_unique #21150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adbull opened this issue May 21, 2018 · 11 comments
Closed

BUG: race condition in Index.is_unique #21150

adbull opened this issue May 21, 2018 · 11 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Multithreading Parallelism in pandas

Comments

@adbull
Copy link
Contributor

adbull commented May 21, 2018

Code Sample, a copy-pastable example if possible

Input:

from concurrent.futures import ThreadPoolExecutor
import pandas as pd

x = pd.date_range('2001', '2020')
with ThreadPoolExecutor(2) as p:
    assert all(p.map(lambda x: x.is_unique, [x]*2))

Output:

Traceback (most recent call last):
  File "bug.py", line 7, in <module>
    assert all(p.map(lambda x: x.is_unique, [x]*2))
AssertionError

Problem description

When calling Index.is_unique from multiple threads simultaneously, the wrong answer is returned.

Expected Output

Shouldn't raise.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 21, 2018

virtually nothing is threadsafe in pandas you have to be very careful. you are welcome to submit a patch.

@jreback
Copy link
Contributor

jreback commented May 21, 2018

see #2728

kinow added a commit to kinow/pandas that referenced this issue Jul 21, 2018
…ue with multiple threads accessing an Index
kinow added a commit to kinow/pandas that referenced this issue Jul 21, 2018
…ue with multiple threads accessing an Index
@gfyoung gfyoung added Bug Multithreading Parallelism in pandas labels Jul 23, 2018
kinow added a commit to kinow/pandas that referenced this issue Jul 27, 2018
…ue with multiple threads accessing an Index
kinow added a commit to kinow/pandas that referenced this issue Oct 11, 2018
…ue with multiple threads accessing an Index
kinow added a commit to kinow/pandas that referenced this issue Oct 12, 2018
…ue with multiple threads accessing an Index
@batterseapower
Copy link
Contributor

batterseapower commented Feb 21, 2019

It's not surprising that modifying a frame while trying to use it from multiple threads is unsafe, but it's kind of weird that "read only" operations like is_unique can break. In my experience at least, this race in Index is the only part of Pandas that seems to go wrong when using a DataFrame in a read-only fashion from several threads. I guess there might be a few other initialised caches hanging around other parts of the codebase though.

@marberi
Copy link

marberi commented Jun 4, 2019

Would this be an acceptable fix? I tested acquiring locks inside the __get__ method itself and
it did not work.

marberi@ad33858

@TomAugspurger
Copy link
Contributor

We would need to measure the cost of acquiring the lock relative to the operation.

Do you know, if I do .is_unique multiple times, would the lock once, on the first call? And subsequent calls with return the cached value, without acquiring the lock?

Also, this feels like too low of a level for acquiring the lock, though I may be wrong about that.

@marberi
Copy link

marberi commented Jun 5, 2019

Note that this will affect not only "is_unique", but all other methods which are using the same decorator. Also, it locks on all calls, which is excessive. That should be easily fixable, tho. You have a better suggestion of where to add the locking?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 5, 2019 via email

@rhshadrach
Copy link
Member

I cannot reproduce on Python 3.12.

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Nov 5, 2024
@batterseapower
Copy link
Contributor

Seems unlikely that what you are observing is reliable, right? The only reliable (ie robust to thread scheduling changes) to fix this is to add a lock on the pandas side?

@batterseapower
Copy link
Contributor

Just to be clear, the issue is being closed but the problem is not actually fixed, right?

Because _ensure_mapping_populated is still clearly not thread safe: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/index.pyx#L339

@rhshadrach
Copy link
Member

rhshadrach commented May 10, 2025

Looking over this again, as mentioned above essentially none of pandas is thread-safe. Even operations that appear to the user to be read-only can modify caches for performance. I am negative making modifications to support thread-safety that have a performance impact unless there is a PDEP that can address (a) level of effort (b) single thread performance cost and (c) benefit across the entire pandas API.

Doing this in a one-off manner that has performance costs is not the appropriate way to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Multithreading Parallelism in pandas
Projects
None yet
8 participants