Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Compute infrastruture for LEFT JOIN #118889

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Dec 17, 2024

This adds some infrastructure that we can use to run LOOKUP JOIN using real LEFT JOIN semantics.

Right now if LOOKUP JOIN matches many rows in the lookup index we merge all of the values into a multivalued field. So the number of rows emitted from LOOKUP JOIN is the same as the number of rows that comes into LOOKUP JOIN.

This change builds the infrastructure to emit one row per match, mostly reusing the infrastructure from ENRICH.

This adds some infrastructure that we can use to run LOOKUP JOIN using
real LEFT JOIN semantics.

Right now if LOOKUP JOIN matches many rows in the `lookup` index we
merge all of the values into a multivalued field. So the number of rows
emitted from LOOKUP JOIN is the same as the number of rows that comes
into LOOKUP JOIN.

This change builds the infrastructure to emit one row per match, mostly
reusing the infrastructure from ENRICH.
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000
Copy link
Member Author

nik9000 commented Dec 17, 2024

I'd like to add another randomized test to this and then get it in. Then I'd like to plug it in in a follow up change. That change will:

  1. Change the LookupResponse to return a List
  2. Remove the MergePositionsOperator from lookup and return all of the enriched pages.
  3. Plug this into the LookupFromIndexOperator and have it consume all the pages from the response, returning the results of this thing.
  4. Fix all the tests it breaks.
  5. Party

*/
default Block insertNulls(IntVector before) {
// TODO remove default and scatter to implementation where it can be a lot more efficient
try (Builder builder = elementType().newBlockBuilder(getPositionCount() + before.getPositionCount(), blockFactory())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
Since there are multiple calls to getPositionCount() and before.getPositionCount() how about extracting them into separate vars:

try (long pc = getPositionCount(), beforePC = before.getPositionCount(), Builder builder = ...) { } 

* | l99 | null | null |
* }</pre>
*/
class RightChunkedLeftJoin implements Releasable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the asciiart

}
BlockFactory factory = leftHand.getBlock(0).blockFactory();
Block[] blocks = new Block[leftHand.getBlockCount() + mergedElementCount];
int[] filter = IntStream.range(next, leftHand.getPositionCount()).toArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int[] filter = new int[leftHand.getPositionCount()];
Arrays.setAll(filter, i -> next + i);

try {
int b = 0;
while (b < leftHand.getBlockCount()) {
blocks[b] = leftHand.getBlock(b).filter(filter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
To avoid allocating the array, a new method could be added to block for consecutive integers such as Block#filter(int start, int stop) or depending how well it gets JITed a Range class Block#filter(range)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. There's a lot we could do here to be honest. The range version of filter might be quite nice.

I think those might be for a follow up.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nik9000
Copy link
Member Author

nik9000 commented Dec 17, 2024

I've pushed a randomize test that seems to fail .7% of the time. Fun. I'll have a debug when I'm back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants