Fix realtime get of nested fields with synthetic source #119575

dnhatn · 2025-01-06T09:36:45Z

Today, for get-from-translog operations, we only need to reindex the root document into an in-memory Lucene, as the _source is stored in the root document and is sufficient. However, synthesizing the source for nested fields requires both the root document and its child documents. This causes realtime-get operations (as well as update and update-by-query operations) to miss nested fields.

Another issue is that the translog operation is reindexed lazily during get-from-translog operations. As a result, two realtime-get operations can return slightly different outputs: one reading from the translog and the other from Lucene.

This change resolves both issues. However, addressing the second issue can degrade the performance of realtime-get and update operations. If slight inconsistencies are acceptable, the translog operation should be reindexed lazily instead.

Closes #119553

elasticsearchmachine · 2025-01-06T20:23:54Z

Hi @dnhatn, I've created a changelog YAML for you.

elasticsearchmachine · 2025-01-06T20:26:07Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

…ic-source

henningandersen

Thanks Nhat, left a few clarifying questions.

henningandersen · 2025-01-07T08:42:05Z

server/src/main/java/org/elasticsearch/index/engine/TranslogDirectoryReader.java

+                numDocs = parsedDocs.docs().size();
+                writer.addDocuments(parsedDocs.docs());


I am not 100% sure I understand why this is necessary for synthetic source and not for regular source, perhaps you can help me out?

Synthesizing the source of nested fields requires both the root document and its child documents, whereas the stored source only requires the root document:

elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/NestedObjectMapper.java

Line 466 in 12e86b1

collectChildren(parentDoc, parentDocs, childScorer.iterator());

Thanks. I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?

I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?

I don't think including stored fields of nested (child) documents works today. Looks like the root docid is used in ShardGetService#innerGetFetch(...).

henningandersen · 2025-01-07T08:43:27Z

server/src/main/java/org/elasticsearch/index/engine/TranslogDirectoryReader.java

+            // When using synthetic source, the translog operation must always be reindexed into an in-memory Lucene to ensure consistent
+            // output for realtime-get operations. However, this can degrade the performance of realtime-get and update operations.


The inconsistencies would those purely be white-space and field order or also actual content like order in arrays etc?

I think this is ok, synthetic source and heavy real-time GET / update use cases are probably rare, but I'd like to understand the tradeoff more deeply.

If it is only white-space and field order, could we sort that instead and how difficult would that be? Not asking to implement that now, more about knowing our backdoor in case we need this.

Yes, the inconsistencies also include removing duplicate values in an array and returning a single value as an object instead of an array. For example:

{ "f": ["v"] }

becomes

{ "f": "v" }

Or:

{ "f": ["v1", "v1", "v2"] }

becomes

{ "f": ["v1", "v2"] }

henningandersen

LGTM.

dnhatn · 2025-01-08T16:15:21Z

@henningandersen Thank you for reviewing. I wanted to double-check if you're comfortable with the trade-off decision in this PR. I think the alternative approach - trading slight inconsistency for better performance - is also acceptable. While the implementation for that approach is slightly more complex than this PR, I'm happy to make changes if you think it's the better trade-off.

martijnvg

LGTM

martijnvg · 2025-01-08T16:22:54Z

server/src/main/java/org/elasticsearch/index/engine/TranslogDirectoryReader.java

+                numDocs = parsedDocs.docs().size();
+                writer.addDocuments(parsedDocs.docs());


I wonder about the field fetch support that is in the GET API, but I suppose it is restricted to not be able to access nested objects in any way?

I don't think including stored fields of nested (child) documents works today. Looks like the root docid is used in ShardGetService#innerGetFetch(...).

elasticsearchmachine added the v9.0.0 label Jan 6, 2025

dnhatn force-pushed the realtime-get-synthetic-source branch 2 times, most recently from 10bd8c0 to 979a327 Compare January 6, 2025 17:57

Fix realtime get nested fields with synthetic source

775663e

dnhatn force-pushed the realtime-get-synthetic-source branch from 979a327 to 775663e Compare January 6, 2025 20:06

dnhatn added >bug v8.18.0 v8.17.1 :StorageEngine/Mapping The storage related side of mappings labels Jan 6, 2025

Update docs/changelog/119575.yaml

8d4625e

dnhatn requested review from henningandersen and martijnvg January 6, 2025 20:25

dnhatn added the auto-backport Automatically create backport pull requests when merged label Jan 6, 2025

dnhatn marked this pull request as ready for review January 6, 2025 20:25

elasticsearchmachine added the Team:StorageEngine label Jan 6, 2025

dnhatn added 2 commits January 6, 2025 12:59

stylecheck

b99d1e2

Merge remote-tracking branch 'elastic/main' into realtime-get-synthet…

32f9f9f

…ic-source

henningandersen reviewed Jan 7, 2025

View reviewed changes

henningandersen approved these changes Jan 8, 2025

View reviewed changes

martijnvg approved these changes Jan 8, 2025

View reviewed changes

elasticsearchmachine added v8.17.2 and removed v8.17.1 labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix realtime get of nested fields with synthetic source #119575

Fix realtime get of nested fields with synthetic source #119575

dnhatn commented Jan 6, 2025 •

edited

Loading

elasticsearchmachine commented Jan 6, 2025

elasticsearchmachine commented Jan 6, 2025

henningandersen left a comment

henningandersen Jan 7, 2025

dnhatn Jan 8, 2025

henningandersen Jan 8, 2025

martijnvg Jan 8, 2025

henningandersen Jan 7, 2025

dnhatn Jan 8, 2025

henningandersen left a comment

dnhatn commented Jan 8, 2025

martijnvg left a comment

martijnvg Jan 8, 2025

		numDocs = parsedDocs.docs().size();
		writer.addDocuments(parsedDocs.docs());

		// When using synthetic source, the translog operation must always be reindexed into an in-memory Lucene to ensure consistent
		// output for realtime-get operations. However, this can degrade the performance of realtime-get and update operations.

Fix realtime get of nested fields with synthetic source #119575

Are you sure you want to change the base?

Fix realtime get of nested fields with synthetic source #119575

Conversation

dnhatn commented Jan 6, 2025 • edited Loading

elasticsearchmachine commented Jan 6, 2025

elasticsearchmachine commented Jan 6, 2025

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Jan 7, 2025

Choose a reason for hiding this comment

dnhatn Jan 8, 2025

Choose a reason for hiding this comment

henningandersen Jan 8, 2025

Choose a reason for hiding this comment

martijnvg Jan 8, 2025

Choose a reason for hiding this comment

henningandersen Jan 7, 2025

Choose a reason for hiding this comment

dnhatn Jan 8, 2025

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

dnhatn commented Jan 8, 2025

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Jan 8, 2025

Choose a reason for hiding this comment

dnhatn commented Jan 6, 2025 •

edited

Loading