-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Initial support for unmapped fields #119886
base: main
Are you sure you want to change the base?
ESQL: Initial support for unmapped fields #119886
Conversation
5f04c1f
to
fd1abd0
Compare
Hi @GalLalouche, I've created a changelog YAML for you. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya! I only took a quick glance at this, but I noticed the following:
The new syntax and command are immediately released. Can we please hide this behind a SNAPSHOT-only? Otherwise it's gonna be made available in Serverless on the next rollout.Update: sorry, it's already hidden behind SNAPSHOT in the parser, so that's good.- I can see some unrelated parser files being changed (SQL, EQL, KQL and Painless). I guess these were accidentally changed due to regenerating the parsers? I'd like to exclude these from the PR, please.
If at all possible, it'd be great to include a csv test to showcase the functionality. Or a yaml test if csv tests are too inflexible. In case that'd blow the PR's scope because not all is in place to run the feature end-to-end, let's at least add a short javadoc to theUpdate: Sorry, you correctly added a csv-spec test which I missed.Insist
class to describe its purpose.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/util/CollectionUtils.java
Outdated
Show resolved
Hide resolved
4289393
to
0ca5e79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments, without going deep with the code:
- disabling the _source should be covered so that whatever we do know should still happen after this PR
- excluding certain fields from _source behavior should also be kept as is. If we already have tests for this somewhere in yaml, csv or other IT tests, that's great and apologies for mentioning it.
- there is also the scenario where
index: false
anddoc_values: false
can be configured for a certain field. In this case, I guess this PR should address loading the value from_source
as well (if _source is available, of course, considering my previous two use cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there an error presented when using insist
with an existent field? For example, FROM employees | insist languages | stats count(*) by languages
languages
is a numeric field and this field exists only in this index, so insist
here shouldn't do anything imo. This is not a casting operation to keyword
, as the error message seems to imply. I would have expected a warning
message at most saying something like "there are no unmapped fields called [languages]".
x-pack/plugin/esql/qa/testFixtures/src/main/resources/unmapped_fields.csv-spec
Outdated
Show resolved
Hide resolved
fieldIsMappedToNonKeywordWithCastSingleIndex | ||
required_capability: unmapped_fields | ||
FROM partial_mapping_sample_data | ||
| INSIST client_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I look at how this command is used (and I understand the PR is doing the bare minimum) the more I see the need of an explicit mapping ie insist client_ip::keyword
(keyword
is the default right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really prefer to have this in a future PR, and desugar INSIST foo :: x
into INSIST foo | EVAL foo = foo :: x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From internal discussions regarding this functionality I understood that we want to start simple with FROM logs | INSIST field::LONG
(the explicit casting is there) and that this command has a different equivalent than the one you specified there: FROM logs | EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))::LONG
. Meaning, if the field already exists and there is no conflict, we don't force the extraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here regarding check if field exists or not: #119886 (comment)
Unfortunately, we don't currently keep any information on which index has the required mappings, and which don't (we only keep it in case of mapping conflicts). I would prefer to tackle the exact behavior of this edge case in another PR, as this would require a pretty large change of our index resolution. |
fieldDoesNotExistSingleIndex | ||
required_capability: unmapped_fields | ||
FROM partial_mapping_sample_data | ||
| INSIST foo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a query like FROM partial_mapping_sample_data | eval x = 1 | INSIST client_ip
and getting an error message is kind of unexpected:
- the error message doesn't simply say what's wrong, ie
insist
must come afterfrom
somehow:"line 1:49: INSIST command can only be applied on top of a source"
. Maybe you can improve this with.... on top of a source command (from, row etc)
.- But does this work with
row
? I testedrow a = 1 | INSIST client_ip
and I getclass org.elasticsearch.xpack.esql.plan.logical.Row cannot be cast to class org.elasticsearch.xpack.esql.plan.logical.EsRelation
- But does this work with
- that
eval x = 1
has nothing to do withinsist
and also doesn't change the query in any way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sure, I'll change the message.
- I'll fix the bug with
ROW
. - You're right that the
EVAL x = 1
has nothing to do withINSIST
, but it makes the code much easier if I don't have to consider all the various placesINSIST
can appear in :) As this syntax is a work in progress, I'd rather keep this requirement as is for now, and revisit it later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought, should we really allow for ROW
? It's almost certainly a mistake, as the INSIST
command is a noop in that case. I think it would make more sense to forbid erroneous code than to silently ignore it, or even warn about it. WDYT?
@@ -2358,4 +2358,8 @@ public void testFailingMetadataWithSquareBrackets() { | |||
"line 1:11: mismatched input '[' expecting {<EOF>, '|', ',', 'metadata'}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good for this class to have more tests for what insist
supports and what not. In the future when the restrictions on this command will ease, it would be easier to check them. For example insist foo, foo
or insist foo*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test for foo*
and a (temporary) test for foo, bar
. As for foo, foo
, I'm thinking we should allow it, since we also allow KEEP foo, foo
(Although I have no idea why...). WDYT?
import java.io.IOException; | ||
|
||
// To make things simpler, we're only dealing with a single INSIST parameter for now. | ||
public record InsistParameters(String identifier) implements NamedWriteable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the need for a separate class just to hold a list of field names (assuming that this will be in the future a list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
@@ -204,4 +204,8 @@ public static void writeIndexMode(StreamOutput out, IndexMode indexMode) throws | |||
throw new IllegalStateException("not ready to support index mode [" + indexMode + "]"); | |||
} | |||
} | |||
|
|||
public EsRelation withAttributes(List<Attribute> concat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concat
? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what happens when you ask Intellij to create a method 😅.
Hi @GalLalouche, I've updated the changelog YAML for you. |
…uche/elasticsearch into feature/unmapped_fields_squashed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a number of comments - some low level some higher.
The main takeaways are:
- rename the syntax to FIELD_🐔
- use attribute (Unresolved or if needed introduce another one) and the Analyzer to perform the resolution instead of using strings in the command
Since this is a big PR, let's aim to get it to a good enough state and descope as much as possible into potential follow-ups.
P.S. Overall, try to keep the style of the existing code base; help with readability. Most of it its arbitrary and consistency is valuable - use alphabetical order where need, chronological otherwise.
server/src/main/java/org/elasticsearch/index/mapper/BlockSourceReader.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/type/EsField.java
Outdated
Show resolved
Hide resolved
public sealed interface State {} | ||
|
||
/** | ||
* A field which is either unmapped in all indices, or mapped to {@link DataType#KEYWORD} in all indices where it is mapped. | ||
* In either case, there is no type conflict. | ||
*/ | ||
public enum KeywordResolved implements State { | ||
INSTANCE; | ||
} | ||
|
||
/** | ||
* A field which is mapped to a single type, which is not {@link DataType#KEYWORD}. This can be resolved using a cast, similar to | ||
* union types. This is treated differently from {@link Invalid}, since resolving this conflict doesn't use | ||
* {@link MultiTypeEsField}. | ||
*/ | ||
public record SimpleConflict(DataType otherType) implements State { | ||
public SimpleConflict { | ||
if (otherType == KEYWORD) { | ||
throw new IllegalArgumentException("Use NoConflicts.INSTANCE for KEYWORD"); | ||
} | ||
} | ||
} | ||
|
||
/** | ||
* A field which is mapped to a single type, but that type is not {@link DataType#KEYWORD}. | ||
* This is resolved using the specified conversions. | ||
*/ | ||
public record SimpleResolution(Expression unmappedConversion, Expression mappedConversion) implements State {} | ||
|
||
/** | ||
* A field which is mapped to more than one type in multiple indices, *in addition* to the unmapped case which is always treated as | ||
* {@link DataType#KEYWORD}. This can be resolved using a cast, similar to union types. | ||
*/ | ||
public record Invalid(InvalidMappedField invalidMappedField) implements State {} | ||
|
||
/** | ||
* A field which is mapped to different types in different indices, but resolved using union types. In mapped indices, we treat this | ||
* as a union type, and use the specified conversion for unmapped indices. | ||
*/ | ||
public record MultiType(Expression conversionFromKeyword, MultiTypeEsField multiTypeEsField) implements State {} | ||
|
||
private final State state; | ||
|
||
public State getState() { | ||
return state; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been struggling with this section.
Looks like PotentiallyUnmappedEsField tries to do multiple things at once:
- be a container that holds the underlying data type
- indicate if there's a conflict
- specify a conversion expression
- be an invalid mapped field
Recommend breaking the behavior in bits and using the existing field classes instead based on the Analyzer resolution:
a. have the UnmappedEsField handle only extraction from source, potentially by supporting a conversion on it as well
b. in case of invalid mapping, use InvalidMappedField.
c. if the solution is multiType, use the MultiTpeEsField instead.
As a side note, instead of using the expression inside the field type, use the actual type you want (I believe EvaluatorMapper) or add an lambda interface to it that wraps the target action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the PR has been descoped, these are no longer necessary in this PR.
@@ -83,4 +85,13 @@ public static <T> List<T> nullSafeList(T... entries) { | |||
} | |||
return list; | |||
} | |||
|
|||
public static <T> OptionalInt findIndex(List<T> list, Predicate<T> predicate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current style of ESQL limits the use of functional constructs and Optional and instead uses more of a C style approach and use an out of band number for primitives (-1) or null for objects.
for (int i = 0, s = list.size(); i < s; i++) {
if (predicate.test(list,.get(i)) {
return i;
}
}
return -1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I get that this is the current style, I think the use of Optional
is a lot safer and clearer. If this isn't a very strict requirement, I would much rather keep this as is. No reason to throw good code after bad :)
x-pack/plugin/esql-core/src/test/java/org/elasticsearch/xpack/esql/core/util/TestUtils.java
Outdated
Show resolved
Hide resolved
if (input instanceof EsRelation || input instanceof UnresolvedRelation) { | ||
return new Insist(source, visitIdentifier(ctx.identifier()), input); | ||
} | ||
throw new ParsingException(source, "INSIST command can only be applied on top of a FROM command."); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is there's another insist command in front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, INSIST
on top of another INSIST
? Then this would fail. While we could support this kind of superfluous syntax, I don't think it should be done in this PR.
public class Insist extends UnaryPlan { | ||
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(LogicalPlan.class, "INSIST", Insist::new); | ||
|
||
private final String insistIdentifier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identifiers are rarely used since are fragile (renames, name spaces, etc). Use Attributes instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
private List<Attribute> computeOutput() { | ||
var result = new ArrayList<>(child().output()); | ||
OptionalInt index = CollectionUtils.findIndex(child().output(), c -> c.name().equals(insistIdentifier)); | ||
index.ifPresentOrElse(i -> { | ||
var field = ((FieldAttribute) child().output().get(i)).field(); | ||
result.set(i, new FieldAttribute(source(), insistIdentifier, PotentiallyUnmappedEsField.fromField(field))); | ||
}, () -> result.add(new FieldAttribute(source(), insistIdentifier, PotentiallyUnmappedEsField.fromStandalone(insistIdentifier)))); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing resolution here, use an UnresolvedAttribute
(or extension of it) which the Analyzer resolved to a proper field, similar to UnresolvedAttribute.
The command thus starts with a bunch of attributes which the analyzer tries to resolves and either succeeds (MultiType, etc..) or fails (Unresolved with a message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Insist.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Insist.java
Outdated
Show resolved
Hide resolved
47f568d
to
9bfb1e7
Compare
I've descoped a lot of the features in the PR, specifically, the resolution of union types on top of an In addition, the resolved index now maintains a set of all partially mapped fields. This does not change metadata queried via field capabilities, and does not live past the analyzer, so it doesn't affect serialization. |
9bfb1e7
to
a8d9372
Compare
This PR adds initial support for unmapped fields, using the
INSIST
clause. For starters, this only support a single unmapped field, without casting. Casting may appear after theINSIST
clause, similar to union types.Note that the
INSIST
keyword is potentially a placeholder, as the method of defining an unmapped field might change in the future, e.g., use a special magic function.First stage of #120072.
Specifically, the following features are implemented in this PR:
INSIST
keyword and a single parameter without a cast. In particular, if the type beingINSIST
ed upon is mapped to anything other thanKEYWORD
, it will result in anInvalidMappedField
. There is not support for union type resolution on top ofINSIST
. Future PRs will handle these conflicts.INSIST
must always be on top of aFROM
. While this may change in the future, e.g., in cases likeFROM foo | EVAL x = 1 | INSIST bar
, it will not be done in this PR, as it makes handlingINSIST
too complicated.