You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version: What redis-py and what redis version is the issue happening on?
Client: 4.5.5
Cluster Engine: 6.2.6
Platform: What platform / version? (For example Python 3.5.1 on Windows 7 / Ubuntu 15.10 / Azure)
ECS and ElastiCache
Description: Description of your issue, stack traces from errors and code that reproduces the issue
When the cluster is up or down scaling, while slots are being migrated to the new shard, the client raises the following error:
redis.exceptions.SlotNotCoveredError: Command # 9 (HGETALL ...) of pipeline caused error: ('Slot "4718" not covered by the cluster. "require_full_coverage=True"',)
Full stack trace:
File "/root/clients/redis_client.py", line 219, in run_queries
return await pipeline.execute()
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1455, in execute
return await self._execute(
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1536, in _execute
raise result
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1520, in _execute
cmd.result = await client.execute_command(
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 725, in execute_command
raise e
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 696, in execute_command
ret = await self._execute_command(target_nodes[0], *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 745, in _execute_command
target_node = self.nodes_manager.get_node_from_slot(
File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1174, in get_node_from_slot
raise SlotNotCoveredError(
redis.exceptions.SlotNotCoveredError: Command # 6 (HGET ... ...) of pipeline caused error: ('Slot "11246" not covered by the cluster. "require_full_coverage=True"',)
Reproduce locally
It's also possible to reproduce this locally quite easily:
Create a simple Redis app that leverages a pipeline of read-only commands and run a load test to force multiple reads. I also made sure there was a different client writing non-stop to the local cluster.
Force resharding of a few slots (e.g. 5k) with redis-cli --cluster reshard 0.0.0.0:7000
Redis app should raise SlotNotCoveredError quite often while the migration of slots is happening, once it's done it stops.
Possible root cause
I tried debugging the library and here's what I could find so far:
The slots cache is outdated, the slot we need is in a different shard. Client replaces self.slots_cache[slot] with the redirected_node (returned by the MOVED error) which is only the master node for the new shard of the slot.
else:
# The new slot owner is a new server, or a server from a different# shard. We need to remove all current nodes from the slot's list# (including replications) and add just the new node.self.slots_cache[e.slot_id] = [redirected_node]
However at the same time, we might call NodesManager.get_node_from_slot. Since we are only reading from read replicas it starts by getting the server index using the RR load balancer. Let's say we have 1 master node and 2 replicas for each shard, it can return an index from 0 to 2. However, as we've seen earlier for this specific slot the slots cache only has a single entry (the master node). This essentially means that if the RR load balancer returns either 1 or 2, it will raise an IndexError, thus a SlotNotCoveredError.
try:
ifread_from_replicas:
# get the server index in a Round-Robin mannerprimary_name=self.slots_cache[slot][0].namenode_idx=self.read_load_balancer.get_server_index(
primary_name, len(self.slots_cache[slot])
)
returnself.slots_cache[slot][node_idx]
returnself.slots_cache[slot][0]
except (IndexError, TypeError):
raiseSlotNotCoveredError(
f'Slot "{slot}" not covered by the cluster. 'f'"require_full_coverage={self.require_full_coverage}"'
)
In order to fix the issue I just patched the method to check whether the index returned by the load balancer is available in the slots cache and in case it isn't just get it from the primary node instead.
Version: What redis-py and what redis version is the issue happening on?
Platform: What platform / version? (For example Python 3.5.1 on Windows 7 / Ubuntu 15.10 / Azure)
Description: Description of your issue, stack traces from errors and code that reproduces the issue
When the cluster is up or down scaling, while slots are being migrated to the new shard, the client raises the following error:
Full stack trace:
Reproduce locally
It's also possible to reproduce this locally quite easily:
redis-cli --cluster reshard 0.0.0.0:7000
SlotNotCoveredError
quite often while the migration of slots is happening, once it's done it stops.Possible root cause
I tried debugging the library and here's what I could find so far:
self.slots_cache[slot]
with theredirected_node
(returned by theMOVED
error) which is only the master node for the new shard of the slot.https://github.com/redis/redis-py/blob/master/redis/asyncio/cluster.py#L1177
NodesManager.get_node_from_slot
. Since we are only reading from read replicas it starts by getting the server index using the RR load balancer. Let's say we have 1 master node and 2 replicas for each shard, it can return an index from 0 to 2. However, as we've seen earlier for this specific slot the slots cache only has a single entry (the master node). This essentially means that if the RR load balancer returns either 1 or 2, it will raise anIndexError
, thus aSlotNotCoveredError
.https://github.com/redis/redis-py/blob/master/redis/asyncio/cluster.py#L1191C1-L1204C14
Happy to create a PR if this makes sense. Sorry if I'm missing something.
Thanks in advance 🙌
The text was updated successfully, but these errors were encountered: