Skip to content

Fix get_node_from_slot to handle resharding #3182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

johan-seesaw
Copy link

@johan-seesaw johan-seesaw commented Mar 13, 2024

Pull Request check-list

Please make sure to review and check all of these items:

  • Do tests and lints pass with this change?
  • Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
  • Is there an example added to the examples folder (if applicable)?
  • Was the change added to CHANGES file?

NOTE: these things are not required to open a PR and can be done
afterwards / while the PR is open.

Description of change

Introduce a new Enum and optional flag value to allow reading only from replicas, if the command supports it. It appears the sync version of RedisCluster used to support this when server_type was passed in to get_node_from_slot, but that parameter isn't set anywhere.

This PR addresses the issue of reading from replicas while resharding, which can cause index failures.
It is an alternative, and more comprehensive solution to #2989, but in both sync and asyncio implementations. (#2988).

This implementation moves the logic to a shared location between the asyncio and the sync versions of the library. I have a follow on PR to introduce additional read-only modes that was initially part of this PR, but has been kept separate to hopefully increase the likelihood that this PR can get merged.

Before the change, we stored the next replica to read from in the primary_to_idx cache. After the change we store the last read replica/primary. This is important for the following example:

  1. Server exists with a slot with 1 primary and 2 replicas.
  2. A number of commands are executed. The last command executes on index 1, setting primary_to_idx to value 2.
  3. Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now 2.
  4. get_node_from_slot reads the next-node value as 2, and returns that (and sets the next-node value to 1, due to 2+1%2 = 1, skipping the primary on the next run)
  5. We try to read the Node at slot_nodes[2], which doesn't exist, and gives us an index exception.

After, we only store the last-read, not the next-value, so the situation would unfold as follows:

  1. Server exists with a slot with 1 primary and 2 replicas.
  2. A number of commands are executed. The last command executes on index 1, setting primary_name_to_last_used_index to value 1.
  3. Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now 2.
  4. get_node_from_slot reads the last-read-node value as 1, increments, and modulos against length of 2, resulting in the value of 0 being stored in primary_name_to_last_used_index and returned
  5. We try to read the Node at slot_nodes[0], the primary, as expected, and everything works.

Fixes #2988

@johan-seesaw johan-seesaw changed the title Add replica-only read mode to cluster and asyncio cluster Fix get_node_from_slot to handle resharding May 28, 2024
@johan-seesaw johan-seesaw force-pushed the master branch 2 times, most recently from 58254be to f9f2a39 Compare May 28, 2024 18:49
@johan-seesaw
Copy link
Author

@gerzse I was wondering if there might be any bandwidth to review this PR?

@johan-seesaw johan-seesaw reopened this Jun 21, 2024
@johan-seesaw johan-seesaw marked this pull request as ready for review June 21, 2024 00:46
@johan-seesaw
Copy link
Author

johan-seesaw commented Jun 21, 2024

Hi @gerzse I've added UTs, a change test, and fixed a bug. My initial commit was intended to solicit feedback before performing those actions, but I figured I'd go ahead and complete them in hope this change might make it in.

I've structured two commits that you can see, and eventually squash into one before merging. It demonstrates the problem area here and here (notably, two separate exception types for effectively the same issue).

The actual-fix commit eliminates the cause demonstrated by those tests.

@johan-seesaw
Copy link
Author

More related issues.

#3238
#2575

@noorul
Copy link

noorul commented Mar 26, 2025

@johan-seesaw How did you fix it in the application as this is taking time to get merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SlotNotCoveredError when cluster is resharding
2 participants