-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Elastic.Clients.Elasticsearch version: 9.1.4
Elasticsearch version:9.1.0
.NET runtime version:6
Operating system version:Windows 11
Description of the problem including expected versus actual behavior:
https://github.com/elastic/elastic-transport-net/blob/1e3422b0218d5f13b35072faa588fda46f9000c6/src/Elastic.Transport/DistributedTransport.cs#L288-L323
When attempting to use the same ElasticsearchClient to make BulkAsync Indexing calls in a multithreaded scenario, I encountered race condition that triggered an ArgumentException within the GetOrCreateBoundConfiguration local method of the lower level Elastic.Transport Library's DistributedTransport.cs class, specifically from within the cache.Add method call inside the lock statement. What appears to be occurring is that multiple threads have a cache miss at the exact same time, so both attempt to a new BoundConfiguration to the cache. While the Add method is guarded by a lock, nothing stops the following sequence:
Thread A checks Cache for presence of BoundConfiguration-> cache miss
Thread B checks Cache for presence of BoundConfiguration-> cache miss
Thread A creates new BoundConfiguration
Thread A aquires lock, and adds the newly created BoundConfiguration to the cache, releases lock
Thread B creates new BoundConfiguration
Thread B acquires lock and attempts to add its BoundConfiguration to the cache with same key as Thread A, and an Argument Exception is thrown because an entry with the same key has already been added.
Reproducing the sequence that triggers the Argument Exception cannot be done with 100% reliability without access to robust multithreaded debugging tools, as the window for the race condition to trigger an ArgumentException in the cache add step is exceptionally small, but I have been able to reproduce the Exception without controlling the advancement of threads (though not consistently).
A workaround is to upgrade to dotnet 8, which bypasses the lock, and has safer cache access semantics, as the tryadd does not throw.
Steps to reproduce:
- Create multiple bulk API calls from multiple threads. You will need to set up breakpoints to ensure that GetOrCreateBoundConfiguration method has multiple threads enter the critical section at the same time
- Wait for ArgumentException to be thrown on cache add
As this is a multithreading race condition, reproducing consistently can be challenging. In the project I encountered it, it will occur "naturally" 30-40% of the time, but in a similar proof of concept project that number less than 4%
Expected behavior
No ArgumentException will be thrown before the bulk async request is sent.
Provide ConnectionSettings
(if relevant): Connection settings aren't relevant, as the error occurs before the request is sent
//ES behind external load balancer now, hence the "single" node
var primarySettings = new ElasticsearchClientSettings(
new SingleNodePool(connectionSettings.CurrentValue.PrimaryCluster.Nodes.Select(n => new Uri(n)).First())
).MaximumRetries(5)
.Authentication(new ApiKey(connectionSettings.CurrentValue.PrimaryCluster.ApiKey));
primarySettings.CertificateFingerprint(connectionSettings.CurrentValue.CertAuthSHA256Fingerprint);
Provide DebugInformation
(if relevant):