Skip to content

Fix SlotNotCoveredError when cluster is resharding #2989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dinispeixoto
Copy link

Pull Request check-list

Please make sure to review and check all of these items:

  • Do tests and lints pass with this change?
  • Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
  • Is there an example added to the examples folder (if applicable)?
  • Was the change added to CHANGES file?

NOTE: these things are not required to open a PR and can be done
afterwards / while the PR is open.

Description of change

Please provide a description of the change here.

Fixes #2988

@johan-seesaw
Copy link

johan-seesaw commented Nov 15, 2023

Interestingly I think I hit roughly the same bug today but in the sync version of this code.

https://github.com/redis/redis-py/blob/master/redis/cluster.py#L1392

  File "/python311/lib64/python3.11/site-packages/redis/cluster.py", line 1115, in execute_command
    raise e
  File "/python311/lib64/python3.11/site-packages/redis/cluster.py", line 1101, in execute_command
    res[node.name] = self._execute_command(node, *args, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/python311/lib64/python3.11/site-packages/redis/cluster.py", line 1210, in _execute_command
    raise e
  File "/python311/lib64/python3.11/site-packages/redis/cluster.py", line 1138, in _execute_command
    target_node = self.nodes_manager.get_node_from_slot(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/python311/lib64/python3.11/site-packages/redis/cluster.py", line 1425, in get_node_from_slot
    return self.slots_cache[slot][node_idx]
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
IndexError: list index out of range

I don't really understand why these two seemly very similar NodesManagers exist in isolation on the sync/asyncio versions of this library. The asyncio version catches the IndexOutOfRange and exposes it as a SlotNotCoveredError, while the sync version passes it through unchanged. But the same fix I think would apply to both.


# we use the node returned by RR in the load balancer
# if it's part of the slots cache, otherwise we use primary
node = node_idx if node_idx < len(self.slots_cache[slot]) else 0
Copy link

@johan-seesaw johan-seesaw Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case I can see this happening, is the LoadBalancer object having a pre-existing history with a list_size of lets say 3 nodes for the given primary. If an event happens where the list is no longer 3 (lets say new list is size 1), and we enter LoadBalancer with the existing dictionary having a value of 2 for this primary, then we will return 2 from the get_server_index method, but it will perform a % 1 operation before storing the "next value" in the dictionary.

Perhaps a simpler rewrite would be to store the last-used value, not the next-used value in the LoadBalancer class, then there would only be one modulo operation, and we would always be performing it with the current list size.

    def get_server_index(self, primary: str, list_size: int) -> int:
        # default to -1 if not found, so after incrementing it will be 0
        server_index = (self.primary_to_idx.get(primary, -1) + 1) % list_size
        self.primary_to_idx[primary] = server_index
        return server_index

@petyaslavova
Copy link
Collaborator

SlotNotCoveredError is now handled by the cluster's retry mechanism, so I'm closing this PR as the issue has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SlotNotCoveredError when cluster is resharding
3 participants