Skip to content

ESQL: Compute infrastruture for LEFT JOIN #118889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 27, 2024

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Dec 17, 2024

This adds some infrastructure that we can use to run LOOKUP JOIN using real LEFT JOIN semantics.

Right now if LOOKUP JOIN matches many rows in the lookup index we merge all of the values into a multivalued field. So the number of rows emitted from LOOKUP JOIN is the same as the number of rows that comes into LOOKUP JOIN.

This change builds the infrastructure to emit one row per match, mostly reusing the infrastructure from ENRICH.

This adds some infrastructure that we can use to run LOOKUP JOIN using
real LEFT JOIN semantics.

Right now if LOOKUP JOIN matches many rows in the `lookup` index we
merge all of the values into a multivalued field. So the number of rows
emitted from LOOKUP JOIN is the same as the number of rows that comes
into LOOKUP JOIN.

This change builds the infrastructure to emit one row per match, mostly
reusing the infrastructure from ENRICH.
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000
Copy link
Member Author

nik9000 commented Dec 17, 2024

I'd like to add another randomized test to this and then get it in. Then I'd like to plug it in in a follow up change. That change will:

  1. Change the LookupResponse to return a List
  2. Remove the MergePositionsOperator from lookup and return all of the enriched pages.
  3. Plug this into the LookupFromIndexOperator and have it consume all the pages from the response, returning the results of this thing.
  4. Fix all the tests it breaks.
  5. Party

* | l99 | null | null |
* }</pre>
*/
class RightChunkedLeftJoin implements Releasable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the asciiart

try {
int b = 0;
while (b < leftHand.getBlockCount()) {
blocks[b] = leftHand.getBlock(b).filter(filter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
To avoid allocating the array, a new method could be added to block for consecutive integers such as Block#filter(int start, int stop) or depending how well it gets JITed a Range class Block#filter(range)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. There's a lot we could do here to be honest. The range version of filter might be quite nice.

I think those might be for a follow up.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nik9000
Copy link
Member Author

nik9000 commented Dec 17, 2024

I've pushed a randomize test that seems to fail .7% of the time. Fun. I'll have a debug when I'm back.

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

void populate(int docCount, List<String> expected) throws IOException;
}

private void runLookup(PopulateIndices populateIndices) throws IOException {
// TODO this should *fail* if the target index isn't a lookup type index - it doesn't now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: If this comment talks about the ESQL query, it does fail now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That was about the query. Great news!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We build multiple vectors/blocks here. Should we add a cranky CB test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@nik9000 nik9000 enabled auto-merge (squash) December 27, 2024 18:07
@nik9000 nik9000 merged commit 8afbb52 into elastic:main Dec 27, 2024
16 checks passed
@nik9000
Copy link
Member Author

nik9000 commented Jan 15, 2025

Looks like I never backported this... bleh. Incoming

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 15, 2025
This adds some infrastructure that we can use to run LOOKUP JOIN using
real LEFT JOIN semantics.

Right now if LOOKUP JOIN matches many rows in the `lookup` index we
merge all of the values into a multivalued field. So the number of rows
emitted from LOOKUP JOIN is the same as the number of rows that comes
into LOOKUP JOIN.

This change builds the infrastructure to emit one row per match, mostly
reusing the infrastructure from ENRICH.
@nik9000
Copy link
Member Author

nik9000 commented Jan 15, 2025

backport is #120232

nik9000 added a commit that referenced this pull request Jan 16, 2025
* ESQL: Compute infrastruture for LEFT JOIN (#118889)

This adds some infrastructure that we can use to run LOOKUP JOIN using
real LEFT JOIN semantics.

Right now if LOOKUP JOIN matches many rows in the `lookup` index we
merge all of the values into a multivalued field. So the number of rows
emitted from LOOKUP JOIN is the same as the number of rows that comes
into LOOKUP JOIN.

This change builds the infrastructure to emit one row per match, mostly
reusing the infrastructure from ENRICH.

* ESQL: Make LOOKUP more left-joiny (#119475)

This makes `LOOKUP` return multiple rows if there are multiple matches. This is the way SQL works so it's *probably* what folks will expect. Even if it isn't, it allows for more optimizations. Like, this change doesn't optimize anything - it just changes the behavior. But there are optimizations you can do *later* that are transparent when we have *this* behavior, but not with the old behavior.

Example:
```
-  2  | [German, German, German] | [Austria, Germany, Switzerland]
+  2  | German                   | [Austria, Germany]
+  2  | German                   | Switzerland
+  2  | German                   | null
```

Relates: #118781
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants