Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Initial support for unmapped fields #119886

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

GalLalouche
Copy link
Contributor

@GalLalouche GalLalouche commented Jan 9, 2025

This PR adds initial support for unmapped fields, using the INSIST clause. For starters, this only support a single unmapped field, without casting. Casting may appear after the INSIST clause, similar to union types.
Note that the INSIST keyword is potentially a placeholder, as the method of defining an unmapped field might change in the future, e.g., use a special magic function.

First stage of #120072.

Specifically, the following features are implemented in this PR:

  • Support for INSIST keyword and a single parameter without a cast. In particular, if the type being INSISTed upon is mapped to anything other than KEYWORD, it will result in an InvalidMappedField. There is not support for union type resolution on top of INSIST. Future PRs will handle these conflicts.
  • Enforcing that INSIST must always be on top of a FROM. While this may change in the future, e.g., in cases like FROM foo | EVAL x = 1 | INSIST bar, it will not be done in this PR, as it makes handling INSIST too complicated.

@GalLalouche GalLalouche force-pushed the feature/unmapped_fields_squashed branch 7 times, most recently from 5f04c1f to fd1abd0 Compare January 12, 2025 12:18
@GalLalouche GalLalouche added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v8.18.0 labels Jan 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @GalLalouche, I've created a changelog YAML for you.

@GalLalouche GalLalouche marked this pull request as ready for review January 12, 2025 14:04
@GalLalouche GalLalouche requested a review from a team as a code owner January 12, 2025 14:04
@GalLalouche GalLalouche requested review from craigtaverner and removed request for a team January 12, 2025 14:04
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@GalLalouche GalLalouche requested a review from nik9000 January 12, 2025 16:33
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya! I only took a quick glance at this, but I noticed the following:

  • The new syntax and command are immediately released. Can we please hide this behind a SNAPSHOT-only? Otherwise it's gonna be made available in Serverless on the next rollout. Update: sorry, it's already hidden behind SNAPSHOT in the parser, so that's good.
  • I can see some unrelated parser files being changed (SQL, EQL, KQL and Painless). I guess these were accidentally changed due to regenerating the parsers? I'd like to exclude these from the PR, please.
  • If at all possible, it'd be great to include a csv test to showcase the functionality. Or a yaml test if csv tests are too inflexible. In case that'd blow the PR's scope because not all is in place to run the feature end-to-end, let's at least add a short javadoc to the Insist class to describe its purpose. Update: Sorry, you correctly added a csv-spec test which I missed.

@alex-spies alex-spies self-requested a review January 13, 2025 11:07
@GalLalouche GalLalouche force-pushed the feature/unmapped_fields_squashed branch from 4289393 to 0ca5e79 Compare January 13, 2025 11:54
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments, without going deep with the code:

  • disabling the _source should be covered so that whatever we do know should still happen after this PR
  • excluding certain fields from _source behavior should also be kept as is. If we already have tests for this somewhere in yaml, csv or other IT tests, that's great and apologies for mentioning it.
  • there is also the scenario where index: false and doc_values: false can be configured for a certain field. In this case, I guess this PR should address loading the value from _source as well (if _source is available, of course, considering my previous two use cases)

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an error presented when using insist with an existent field? For example, FROM employees | insist languages | stats count(*) by languages

languages is a numeric field and this field exists only in this index, so insist here shouldn't do anything imo. This is not a casting operation to keyword, as the error message seems to imply. I would have expected a warning message at most saying something like "there are no unmapped fields called [languages]".

fieldIsMappedToNonKeywordWithCastSingleIndex
required_capability: unmapped_fields
FROM partial_mapping_sample_data
| INSIST client_ip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at how this command is used (and I understand the PR is doing the bare minimum) the more I see the need of an explicit mapping ie insist client_ip::keyword (keyword is the default right now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer to have this in a future PR, and desugar INSIST foo :: x into INSIST foo | EVAL foo = foo :: x.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From internal discussions regarding this functionality I understood that we want to start simple with FROM logs | INSIST field::LONG (the explicit casting is there) and that this command has a different equivalent than the one you specified there: FROM logs | EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))::LONG. Meaning, if the field already exists and there is no conflict, we don't force the extraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here regarding check if field exists or not: #119886 (comment)

@GalLalouche
Copy link
Contributor Author

GalLalouche commented Jan 17, 2025

This is not a casting operation to keyword, as the error message seems to imply. I would have expected a warning message at most saying something like "there are no unmapped fields called [languages]".

Unfortunately, we don't currently keep any information on which index has the required mappings, and which don't (we only keep it in case of mapping conflicts). I would prefer to tackle the exact behavior of this edge case in another PR, as this would require a pretty large change of our index resolution.

fieldDoesNotExistSingleIndex
required_capability: unmapped_fields
FROM partial_mapping_sample_data
| INSIST foo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a query like FROM partial_mapping_sample_data | eval x = 1 | INSIST client_ip and getting an error message is kind of unexpected:

  • the error message doesn't simply say what's wrong, ie insist must come after from somehow: "line 1:49: INSIST command can only be applied on top of a source". Maybe you can improve this with .... on top of a source command (from, row etc).
    • But does this work with row? I tested row a = 1 | INSIST client_ip and I get class org.elasticsearch.xpack.esql.plan.logical.Row cannot be cast to class org.elasticsearch.xpack.esql.plan.logical.EsRelation
  • that eval x = 1 has nothing to do with insist and also doesn't change the query in any way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Sure, I'll change the message.
  • I'll fix the bug with ROW.
  • You're right that the EVAL x = 1 has nothing to do with INSIST, but it makes the code much easier if I don't have to consider all the various places INSIST can appear in :) As this syntax is a work in progress, I'd rather keep this requirement as is for now, and revisit it later if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, should we really allow for ROW? It's almost certainly a mistake, as the INSIST command is a noop in that case. I think it would make more sense to forbid erroneous code than to silently ignore it, or even warn about it. WDYT?

@@ -2358,4 +2358,8 @@ public void testFailingMetadataWithSquareBrackets() {
"line 1:11: mismatched input '[' expecting {<EOF>, '|', ',', 'metadata'}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good for this class to have more tests for what insist supports and what not. In the future when the restrictions on this command will ease, it would be easier to check them. For example insist foo, foo or insist foo*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for foo* and a (temporary) test for foo, bar. As for foo, foo, I'm thinking we should allow it, since we also allow KEEP foo, foo (Although I have no idea why...). WDYT?

import java.io.IOException;

// To make things simpler, we're only dealing with a single INSIST parameter for now.
public record InsistParameters(String identifier) implements NamedWriteable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need for a separate class just to hold a list of field names (assuming that this will be in the future a list).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -204,4 +204,8 @@ public static void writeIndexMode(StreamOutput out, IndexMode indexMode) throws
throw new IllegalStateException("not ready to support index mode [" + indexMode + "]");
}
}

public EsRelation withAttributes(List<Attribute> concat) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what happens when you ask Intellij to create a method 😅.

@elasticsearchmachine
Copy link
Collaborator

Hi @GalLalouche, I've updated the changelog YAML for you.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a number of comments - some low level some higher.
The main takeaways are:

  • rename the syntax to FIELD_🐔
  • use attribute (Unresolved or if needed introduce another one) and the Analyzer to perform the resolution instead of using strings in the command

Since this is a big PR, let's aim to get it to a good enough state and descope as much as possible into potential follow-ups.

P.S. Overall, try to keep the style of the existing code base; help with readability. Most of it its arbitrary and consistency is valuable - use alphabetical order where need, chronological otherwise.

Comment on lines 39 to 84
public sealed interface State {}

/**
* A field which is either unmapped in all indices, or mapped to {@link DataType#KEYWORD} in all indices where it is mapped.
* In either case, there is no type conflict.
*/
public enum KeywordResolved implements State {
INSTANCE;
}

/**
* A field which is mapped to a single type, which is not {@link DataType#KEYWORD}. This can be resolved using a cast, similar to
* union types. This is treated differently from {@link Invalid}, since resolving this conflict doesn't use
* {@link MultiTypeEsField}.
*/
public record SimpleConflict(DataType otherType) implements State {
public SimpleConflict {
if (otherType == KEYWORD) {
throw new IllegalArgumentException("Use NoConflicts.INSTANCE for KEYWORD");
}
}
}

/**
* A field which is mapped to a single type, but that type is not {@link DataType#KEYWORD}.
* This is resolved using the specified conversions.
*/
public record SimpleResolution(Expression unmappedConversion, Expression mappedConversion) implements State {}

/**
* A field which is mapped to more than one type in multiple indices, *in addition* to the unmapped case which is always treated as
* {@link DataType#KEYWORD}. This can be resolved using a cast, similar to union types.
*/
public record Invalid(InvalidMappedField invalidMappedField) implements State {}

/**
* A field which is mapped to different types in different indices, but resolved using union types. In mapped indices, we treat this
* as a union type, and use the specified conversion for unmapped indices.
*/
public record MultiType(Expression conversionFromKeyword, MultiTypeEsField multiTypeEsField) implements State {}

private final State state;

public State getState() {
return state;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been struggling with this section.
Looks like PotentiallyUnmappedEsField tries to do multiple things at once:

  1. be a container that holds the underlying data type
  2. indicate if there's a conflict
  3. specify a conversion expression
  4. be an invalid mapped field

Recommend breaking the behavior in bits and using the existing field classes instead based on the Analyzer resolution:
a. have the UnmappedEsField handle only extraction from source, potentially by supporting a conversion on it as well
b. in case of invalid mapping, use InvalidMappedField.
c. if the solution is multiType, use the MultiTpeEsField instead.

As a side note, instead of using the expression inside the field type, use the actual type you want (I believe EvaluatorMapper) or add an lambda interface to it that wraps the target action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the PR has been descoped, these are no longer necessary in this PR.

@@ -83,4 +85,13 @@ public static <T> List<T> nullSafeList(T... entries) {
}
return list;
}

public static <T> OptionalInt findIndex(List<T> list, Predicate<T> predicate) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current style of ESQL limits the use of functional constructs and Optional and instead uses more of a C style approach and use an out of band number for primitives (-1) or null for objects.

for (int i = 0, s = list.size(); i < s; i++) {
      if (predicate.test(list,.get(i)) {
           return i;
      }
}
return -1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I get that this is the current style, I think the use of Optional is a lot safer and clearer. If this isn't a very strict requirement, I would much rather keep this as is. No reason to throw good code after bad :)

Comment on lines 292 to 296
if (input instanceof EsRelation || input instanceof UnresolvedRelation) {
return new Insist(source, visitIdentifier(ctx.identifier()), input);
}
throw new ParsingException(source, "INSIST command can only be applied on top of a FROM command.");
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is there's another insist command in front?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, INSIST on top of another INSIST? Then this would fail. While we could support this kind of superfluous syntax, I don't think it should be done in this PR.

public class Insist extends UnaryPlan {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(LogicalPlan.class, "INSIST", Insist::new);

private final String insistIdentifier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identifiers are rarely used since are fragile (renames, name spaces, etc). Use Attributes instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 46 to 53
private List<Attribute> computeOutput() {
var result = new ArrayList<>(child().output());
OptionalInt index = CollectionUtils.findIndex(child().output(), c -> c.name().equals(insistIdentifier));
index.ifPresentOrElse(i -> {
var field = ((FieldAttribute) child().output().get(i)).field();
result.set(i, new FieldAttribute(source(), insistIdentifier, PotentiallyUnmappedEsField.fromField(field)));
}, () -> result.add(new FieldAttribute(source(), insistIdentifier, PotentiallyUnmappedEsField.fromStandalone(insistIdentifier))));
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing resolution here, use an UnresolvedAttribute (or extension of it) which the Analyzer resolved to a proper field, similar to UnresolvedAttribute.
The command thus starts with a bunch of attributes which the analyzer tries to resolves and either succeeds (MultiType, etc..) or fails (Unresolved with a message).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@GalLalouche GalLalouche force-pushed the feature/unmapped_fields_squashed branch from 47f568d to 9bfb1e7 Compare January 26, 2025 13:37
@GalLalouche
Copy link
Contributor Author

I've descoped a lot of the features in the PR, specifically, the resolution of union types on top of an INSIST. If there is any type conflict, i.e., if a field is mapped to anything other than KEYWORD in any index, it is no longer resolvable from within the language in this PR.

In addition, the resolved index now maintains a set of all partially mapped fields. This does not change metadata queried via field capabilities, and does not live past the analyzer, so it doesn't affect serialization.

@GalLalouche GalLalouche force-pushed the feature/unmapped_fields_squashed branch from 9bfb1e7 to a8d9372 Compare January 26, 2025 14:07
@GalLalouche GalLalouche requested a review from costin January 26, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants