Fix get_node_from_slot to handle resharding #3182

johan-seesaw · 2024-03-13T19:31:17Z

Pull Request check-list

Please make sure to review and check all of these items:

Do tests and lints pass with this change?
Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
Is the new or changed code fully tested?
Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
Is there an example added to the examples folder (if applicable)?
Was the change added to CHANGES file?

NOTE: these things are not required to open a PR and can be done
afterwards / while the PR is open.

Description of change

Introduce a new Enum and optional flag value to allow reading only from replicas, if the command supports it. It appears the sync version of RedisCluster used to support this when server_type was passed in to get_node_from_slot, but that parameter isn't set anywhere.

This PR addresses the issue of reading from replicas while resharding, which can cause index failures.
It is an alternative, and more comprehensive solution to #2989, but in both sync and asyncio implementations. (#2988).

This implementation moves the logic to a shared location between the asyncio and the sync versions of the library. I have a follow on PR to introduce additional read-only modes that was initially part of this PR, but has been kept separate to hopefully increase the likelihood that this PR can get merged.

Before the change, we stored the next replica to read from in the primary_to_idx cache. After the change we store the last read replica/primary. This is important for the following example:

Server exists with a slot with 1 primary and 2 replicas.
A number of commands are executed. The last command executes on index 1, setting primary_to_idx to value 2.
Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now 2.
get_node_from_slot reads the next-node value as 2, and returns that (and sets the next-node value to 1, due to 2+1%2 = 1, skipping the primary on the next run)
We try to read the Node at slot_nodes[2], which doesn't exist, and gives us an index exception.

After, we only store the last-read, not the next-value, so the situation would unfold as follows:

Server exists with a slot with 1 primary and 2 replicas.
A number of commands are executed. The last command executes on index 1, setting primary_name_to_last_used_index to value 1.
Before another command is executed, the number of replicas changes from 2 to 1. The total number of nodes in that slot is now 2.
get_node_from_slot reads the last-read-node value as 1, increments, and modulos against length of 2, resulting in the value of 0 being stored in primary_name_to_last_used_index and returned
We try to read the Node at slot_nodes[0], the primary, as expected, and everything works.

Fixes #2988

johan-seesaw · 2024-05-28T19:07:06Z

@gerzse I was wondering if there might be any bandwidth to review this PR?

johan-seesaw · 2024-06-21T00:48:58Z

Hi @gerzse I've added UTs, a change test, and fixed a bug. My initial commit was intended to solicit feedback before performing those actions, but I figured I'd go ahead and complete them in hope this change might make it in.

I've structured two commits that you can see, and eventually squash into one before merging. It demonstrates the problem area here and here (notably, two separate exception types for effectively the same issue).

The actual-fix commit eliminates the cause demonstrated by those tests.

johan-seesaw · 2024-07-16T21:55:08Z

More related issues.

#3238
#2575

noorul · 2025-03-26T08:06:52Z

@johan-seesaw How did you fix it in the application as this is taking time to get merged?

johan-seesaw force-pushed the master branch from 1e1dcbd to 25273b6 Compare March 13, 2024 19:32

johan-seesaw changed the title ~~Add replica-only read mode to cluster and asyncio cluster~~ Fix get_node_from_slot to handle resharding May 28, 2024

johan-seesaw force-pushed the master branch 2 times, most recently from 58254be to f9f2a39 Compare May 28, 2024 18:49

johan-seesaw force-pushed the master branch from f9f2a39 to f72d5da Compare June 20, 2024 22:59

johan-seesaw marked this pull request as draft June 20, 2024 23:01

johan-seesaw closed this Jun 20, 2024

johan-seesaw force-pushed the master branch from f72d5da to 70b4f48 Compare June 20, 2024 23:02

johan-seesaw added 2 commits June 20, 2024 17:30

Add tests which I hope pass, to show broken code

03c191b

Fix get_node_from_slot index error during reshard

e20f9e5

johan-seesaw reopened this Jun 21, 2024

johan-seesaw marked this pull request as ready for review June 21, 2024 00:46

Merge branch 'master' into master

b765597

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix get_node_from_slot to handle resharding #3182

Fix get_node_from_slot to handle resharding #3182

johan-seesaw commented Mar 13, 2024 •

edited

Loading

johan-seesaw commented May 28, 2024

johan-seesaw commented Jun 21, 2024 •

edited

Loading

johan-seesaw commented Jul 16, 2024

noorul commented Mar 26, 2025

Fix get_node_from_slot to handle resharding #3182

Are you sure you want to change the base?

Fix get_node_from_slot to handle resharding #3182

Conversation

johan-seesaw commented Mar 13, 2024 • edited Loading

Pull Request check-list

Description of change

johan-seesaw commented May 28, 2024

johan-seesaw commented Jun 21, 2024 • edited Loading

johan-seesaw commented Jul 16, 2024

noorul commented Mar 26, 2025

johan-seesaw commented Mar 13, 2024 •

edited

Loading

johan-seesaw commented Jun 21, 2024 •

edited

Loading