Fix ProcessorFactory.withDistinguishedRootNode and isError #1122

stevedlawrence · 2023-11-28T19:27:01Z

When calling withDistinguishedRootNode, we copy the ProcessorFactory replacing the root spec with the new node. But when we do this, also copy the existing SchemaSet. However, this SchemaSet was created with the previous root spec, which means withDistinguishedRootNode doesn't actually do anything.

Instead, the only workaround is to pass in the root name/namspace to the compileSource/File function, so that the root spec is available when the SchemaSet is created.

To fix this, this modifies withDistinguishedRootNode to not copy the SchemaSet, so a new SchemaSet will be created and initialized with the correct root spec. Each ProcessorFactory now has a unique SchemaSet instance with the correct root name and namespace.

This also discovered cases in the Compiler.compileSourceInternal and JAPI/SAPI compileFile functions where they called isError on the newly created ProcessorFactory before returning it to the caller. This meant that the SchemaSet inside the ProcessorFactory would be forced to be created, even though it might never be used, like in the case of withDistinguishedRootNode being called and a different SchemaSet being created.

Really the isError function should only ever be called by the user when they are ready for the object to be evaluated, and not forced by Daffodil internals. This ensures we avoid unnecessary until the user is ready, since isError is what eventually causes lazy vals to be evaluated. However, this does mean that if a user doesn't call isError they might get exceptions/assertions, but this is well documented in the JAPI/SAPI that this function must be called first before calling other functions.

This also means that some debug logging that relies on knowing that status of isError must be moved elsewhere, since the user might not have called isError yet.

This also discovered instances in the TDMLRunner and CLI where we queried diagnostics before checking isError--isError should always be done first.

DAFFODIL-2864, DAFFODIL-697

stevedlawrence · 2023-11-28T19:30:06Z

daffodil-core/src/main/scala/org/apache/daffodil/core/compiler/Compiler.scala

@@ -92,13 +89,10 @@ final class ProcessorFactory private (
      validateDFDLSchemas,
      checkAllTopLevel,
      tunables,
-      Some(sset),


This line is the core issue, where we copy the current incorrect schema set into the new ProcessorFactory.

All the other stuff is related to isErrors, which avoids work that may not be needed. The downside is that work won't happen until a user calls isError, which is potentially confusing. It also means were are more strict about isError being called before other functions. Maybe this is controversial and should be backed out?

I think requiring that a user call isError before calling some ProcessorFactory methods is potentially confusing. I would like to know more about the use case(s) where someone gets a ProcessorFactory but abstains from calling isError on purpose. What is the justification or motivation for such use case(s)? Is the justification that a user might want to call various withXXX methods on the ProcessorFactory to fine tune the ProcessorFactory's parameters and configuration before actually using the ProcessorFactory to get a Processor?

If that is the justification and the motivation for it is compelling enough (avoid doing a considerable amount of work more than once), then this PR's changes will let the ProcessorFactory initialize only part of itself and let the user finish the initialization by calling isError only after completing all of the withXXX calls. I would say that maybe we should find a better API design, but I know it's not easy to design an API that can be used by both Java and Scala programmers. I will not argue with this ProcessorFactory withXXX / isError API design.

I would like to know more about the use case(s) where someone gets a ProcessorFactory but abstains from calling isError on purpose. What is the justification or motivation for such use case(s)?

Users can do this if they are confident they schema will compile without any errors. The ProcessorFactory.onPath function asserts that isError is false, so if a user doesn't call isError, then onPath will, and all needed work will be done then. But if isError is true, onPath will throw an exception saying this function should not be called if isError is true.

Note that this is why some tests don't call isError. The test just assumes there isn't a compile error, and if there is then an assertion exception will be thrown and the test will fail. It would probably be better to assert that isError is false, but it's not technically required for the test to pass/fail correctly.

Note that another side effect of this change is that if a user calls ProcessorFactory.getDiagnostics prior to calling isError, then no work will have been done and so there will be no diagnostics. This was actually the case with the CLI and TDML runner, hence a couple of changes in this PR. We could avoid this by modifying getDiagnostics to call isError before returning the diags.

Is the justification that a user might want to call various withXXX methods on the ProcessorFactory to fine tune the ProcessorFactory's parameters and configuration before actually using the ProcessorFactory to get a Processor?

Yeah, that's why this change was made. With only the change to create a new SchemaSet when withDistinguishedRootNode is called, it means if we have something like this:

val pf = compiler.compileSource(uri).withDistinguishedRootNode(name, namespace) pf.isError

Then we create and evaluate two SchemaSets, one when compileSource is called (because it calls isError) and the other when pf.isError is called. Note that a user could do something like this:

val pf = compiler.compileSource(uri) .withDistinguishedRootNode(name1, namespace1) .withDistinguishedRootNode(name2, namespace2) pf.isError

And that would still only evaluated two SchemaSets, because withDistinguishedRootNode does not call isError.

You could kind of think of this API as a builder API. Where compileSource creates a builder, the withXYZ functions configure that builder, and isError is the build() method, e.g.

val pf = new ProcessorFactoryBuilder(uri) .withDistinguishedRootNode(name, namespace) .build()

It wouldn't make any sense to create a ProcessorFactory when the builder is initialized, and then a new one when .build() is called. That's essentially what we were doing before, only we weren't actually creating a new one which was the bug causing withDistinguishedRootNode to be ignored.

If that is the justification and the motivation for it is compelling enough (avoid doing a considerable amount of work more than once)

When you call pf.isError, it calls the isError function on the SchemaSet. And that function has this comment:

/** * Asking if there are errors drives compilation. First the schema is validated, then * the abstract syntax tree is constructed (which could be a source of errors) * and finally the AST objects are checked for errors, which recursively * demands that all other aspects of compilation occur. */

That's pretty much everything except asking for a runtime1 parser/unparser, which is what pf.onPath() does. So it is a lot of work that really should be avoided. We don't want to do all that twice, especially if the first is just going to be thrown away.

Another option is to drop withDistinguishedRootNode entirely, and require users to provide the root name/namespace to the compileSource function. This is the only withXYZ function on the ProcessorFactory. I do kindof like the builder pattern though, makes it easy to add new configurations in the future without having to modify the new constructor.

I would say that maybe we should find a better API design

I'm not against this. It's certainly not obvious to the user that .isError is what triggers the work. It's definitely convenient, but does side effect things that isn't obvious.

There are a number of changes we've been considering that might warrant a 4.0.0 release (e.g. drop Java 8, new layer API), so if we do want to change our API, this would be a good time to do it.

Note that another side effect of this change is that if a user calls ProcessorFactory.getDiagnostics prior to calling isError, then no work will have been done and so there will be no diagnostics. This was actually the case with the CLI and TDML runner, hence a couple of changes in this PR. We could avoid this by modifying getDiagnostics to call isError before returning the diags.

Yes, this is the only thing I would suggest changing. A programmer might think pf is already initialized and call pf.getDiagnostics before calling pf.onPath or pf.isError. I'd suggest that pf.getDiagnostics throw an assertion exception to ensure such a programming error gets fixed, instead of calling isError before returning the diagnostics or simply returning no diagnostics.

I'd suggest that pf.getDiagnostics throw an assertion exception to ensure such a programming error gets fixed, instead of calling isError before returning the diagnostics or simply returning no diagnostics.

Doesn't seem unreasonable to me. We would have to add a boolean var since we can't currently tell if a user calls isError or not, but that's not too big of a deal.

There is maybe an argument that if a user calls getDiagnostics then they aren't planning on calling any more withXYZ functions and we should just call isError for them so they get the right result. This would mean calling isError or getDiagnostics will trigger work.

Regardless, we should definitely do one of these. There were two instances in our own codebase where we did this wrong, so it's not crazy think there might be other places that do it and we don't want getDiagnostics to silently return nothing. I don't feel strongly about which we choose. Calling isError in getDiagnostics is maybe less likely to break things.

Interestingly, I discovered that DataProcessor.isError always returns false:

https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/DataProcessor.scala#L274

This was surprising to me, but this codecov in the TDML runner confirms that we do never get a DataProcessor with isError set to true, at least in any of our TDML tests:

https://app.codecov.io/gh/apache/daffodil/blob/main/daffodil-tdml-processor%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fdaffodil%2Fprocessor%2Ftdml%2FDaffodilTDMLDFDLProcessor.scala#L107

I think this goes to show how much work is done in ProcessorFactory.isError, so much so that if we successfully create a ProcessorFactory, there is no way for Daffodil to fail to create a DataProcessor. So I think it is pretty important to avoid that extra work if a user calls withDistinguishedRootNode.

This also means isError doesn't do any work in the DataProcessor. All the work to create a DataProcessor is done in ProcessorFactory.onPath. Not sure it matters, but I didn't realize that was the case. I thought isError did all the work in both classes.

There is maybe an argument that if a user calls getDiagnostics then they aren't planning on calling any more withXYZ functions and we should just call isError for them so they get the right result. This would mean calling isError or getDiagnostics will trigger work.

I find your argument persuasive. Let's trigger the work in both isError and getDiagnostics as you suggest.

Let's trigger the work in both isError and getDiagnostics as you suggest

Done, and added some tests to check that this works.

tuxji

+1

I have some caveats, but I am okay with being more strict about calling pf.isError.

tuxji · 2023-11-28T20:55:09Z

daffodil-core/src/main/scala/org/apache/daffodil/core/compiler/Compiler.scala

@@ -92,13 +89,10 @@ final class ProcessorFactory private (
      validateDFDLSchemas,
      checkAllTopLevel,
      tunables,
-      Some(sset),


I think requiring that a user call isError before calling some ProcessorFactory methods is potentially confusing. I would like to know more about the use case(s) where someone gets a ProcessorFactory but abstains from calling isError on purpose. What is the justification or motivation for such use case(s)? Is the justification that a user might want to call various withXXX methods on the ProcessorFactory to fine tune the ProcessorFactory's parameters and configuration before actually using the ProcessorFactory to get a Processor?

If that is the justification and the motivation for it is compelling enough (avoid doing a considerable amount of work more than once), then this PR's changes will let the ProcessorFactory initialize only part of itself and let the user finish the initialization by calling isError only after completing all of the withXXX calls. I would say that maybe we should find a better API design, but I know it's not easy to design an API that can be used by both Java and Scala programmers. I will not argue with this ProcessorFactory withXXX / isError API design.

tuxji · 2023-11-28T21:19:25Z

daffodil-japi/src/main/scala/org/apache/daffodil/japi/Daffodil.scala

@@ -141,7 +141,6 @@ class Compiler private[japi] (private var sCompiler: SCompiler) {
  ): ProcessorFactory = {

    val pf = sCompiler.compileFile(schemaFile, Option(rootName), Option(rootNamespace))
-    pf.isError
    new ProcessorFactory(pf)


Lines 143 - 144 look strange at first glance. We get a pf which is a ProcessorFactory and wrap another ProcessorFactory around it. I do realize that pf is actually a core Daffodil ProcessorFactory class and the ProcessorFactory on line 144 is actually a Java API ProcessorFactory class. Nevertheless, I wish we could avoid using these Java and Scala API wrapper classes if possible.

Yeah, I really want to get rid of this wrapper stuff. It's complicated to maintain and provides very little real benefit. Pretty much the only real difference is sometimes the Scala API returns a Seq but the Java API returns a List. Getting rid of this would be a good candidate for a 4.0.0 release as well.

I think the pattern for getting rid of this wrapper stuff is to have a java-usable abstract base/interface defined in a ".java" file, that has all the methods the Java API user would need to call including static factory methods. This ".java" file also has all the javadoc.

Then the scala implementation derives its actual classes from that base/interface, and those can all use scala types and such. Then the "API" for java users is just returning the object, upcasted to the java base/interface that shows only the java-usable methods.

This doesn't prevent the Java user from just downcasting and using something internal, but I think the whole wrapper approach, which does prevent the downcasting bypass, is just not worth the maintenance headache. Plus this is open source, so what are we really preventing anyway?

Agreed, that feels like the right approach.

tuxji · 2023-11-28T21:21:01Z

daffodil-japi/src/test/java/org/apache/daffodil/example/TestJavaAPI.java

+        java.io.File schemaFile = getResource("/test/japi/mySchema3.dfdl.xsd");
+        ProcessorFactory pf = c.compileFile(schemaFile);
+        pf = pf.withDistinguishedRootNode("e4", null);
+        DataProcessor dp = pf.onPath("/");


Aren't we supposed to call pf.isError before we call pf.onPath("/")?

I mentioned this in my big comment above, but isError it's not mandatory. The pf.onPath function calls isError. The only problem is if isError is true then onPath will thrown an assertion exception. But that's not a big deal in a test because a thrown exception will still cause the test to fail. It maybe isn't as elegant as an assertion, e.g. assertFalse(pf.isError), but ultimately tests things just the same.

tuxji · 2023-11-28T21:22:53Z

daffodil-sapi/src/test/scala/org/apache/daffodil/example/TestScalaAPI.scala

+    val pf = c
+      .compileFile(schemaFile)
+      .withDistinguishedRootNode("e4", null)
+    val dp1 = pf.onPath("/")


Aren't we supposed to call pf.isError before we call pf.onPath("/")?

mbeckerle

+1

mbeckerle · 2023-12-06T22:02:51Z

daffodil-japi/src/main/scala/org/apache/daffodil/japi/Daffodil.scala

@@ -141,7 +141,6 @@ class Compiler private[japi] (private var sCompiler: SCompiler) {
  ): ProcessorFactory = {

    val pf = sCompiler.compileFile(schemaFile, Option(rootName), Option(rootNamespace))
-    pf.isError
    new ProcessorFactory(pf)


I think the pattern for getting rid of this wrapper stuff is to have a java-usable abstract base/interface defined in a ".java" file, that has all the methods the Java API user would need to call including static factory methods. This ".java" file also has all the javadoc.

Then the scala implementation derives its actual classes from that base/interface, and those can all use scala types and such. Then the "API" for java users is just returning the object, upcasted to the java base/interface that shows only the java-usable methods.

This doesn't prevent the Java user from just downcasting and using something internal, but I think the whole wrapper approach, which does prevent the downcasting bypass, is just not worth the maintenance headache. Plus this is open source, so what are we really preventing anyway?

When calling withDistinguishedRootNode, we copy the ProcessorFactory and replace the root spec with the new node. When we do this, we also copy the existing SchemaSet. But this SchemaSet was created with the previous root spec, which means withDistinguishedRootNode doesn't actually do anything. Fortunately, there is a workaround which is to pass in the root name/namspace to the compileSource/File functions, so that the root spec is available when the SchemaSet is created, and avoid withDistinguishedRootNode entirely. But to fix this, this modifies withDistinguishedRootNode to not copy the SchemaSet, so a new SchemaSet is created and initialized with the correct root spec. Each ProcessorFactory now has a unique SchemaSet instance with the correct root name and namespace. This also discovered cases in the Compiler.compileSourceInternal and JAPI/SAPI compileFile functions where they called isError on the newly created ProcessorFactory before returning it to the caller. This meant that the SchemaSet inside the ProcessorFactory would be forced to be created, even though it might never be used, like in the case of withDistinguishedRootNode being called and a different SchemaSet being created. Really the isError function should only ever be called by the user when they are ready for the object to be evaluated, and not forced by Daffodil internals. This ensures we avoid unnecessary work until the user is ready, since isError is what eventually causes lazy vals to be evaluated. However, this does mean that if a user doesn't check isError they might get exceptions/assertions, but it is well documented in the JAPI/SAPI that isError should be checked before calling other functions. This also means that some debug logging that relies on knowing that status of isError must be moved elsewhere, since the user might not have called isError yet. This also discovered instances in the TDMLRunner and CLI where we queried diagnostics before checking isError--isError should always be done first, which is fixed. That said, this also modifies ProcessorFactory.getDiagnostics so that it calls isError before returning the diagnostics. This way if user calls getDiagnostics without first calling isError (like we used to do in the TDMLRunner and CLI), the compilation work will be done and correct diagnostics generated. Without this, there would be no diagnostics because no work would be done yet. Although technically a user should call isError before asking for diagnostics, there may be cases where users don't, so this allows that to keep working as expected. This is similar to how onPath calls isError even if a user doesn't. DAFFODIL-2864, DAFFODIL-697

stevedlawrence commented Nov 28, 2023

View reviewed changes

tuxji approved these changes Nov 28, 2023

View reviewed changes

tuxji requested review from mbeckerle and mike-mcgann December 6, 2023 20:49

mbeckerle approved these changes Dec 6, 2023

View reviewed changes

mike-mcgann approved these changes Dec 6, 2023

View reviewed changes

stevedlawrence force-pushed the daffodil-2864-with-disinguished-root branch from 3a9a221 to d9aef91 Compare December 7, 2023 14:16

stevedlawrence merged commit b897694 into apache:main Dec 7, 2023
9 of 10 checks passed

stevedlawrence deleted the daffodil-2864-with-disinguished-root branch December 7, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ProcessorFactory.withDistinguishedRootNode and isError #1122

Fix ProcessorFactory.withDistinguishedRootNode and isError #1122

stevedlawrence commented Nov 28, 2023

stevedlawrence Nov 28, 2023

tuxji Nov 28, 2023

stevedlawrence Nov 29, 2023

tuxji Nov 29, 2023

stevedlawrence Nov 29, 2023

stevedlawrence Nov 29, 2023

tuxji Nov 29, 2023

stevedlawrence Nov 29, 2023

tuxji left a comment

tuxji Nov 28, 2023

tuxji Nov 28, 2023

stevedlawrence Nov 29, 2023

mbeckerle Dec 6, 2023

stevedlawrence Dec 7, 2023

tuxji Nov 28, 2023

stevedlawrence Nov 29, 2023

tuxji Nov 28, 2023

mbeckerle left a comment

mbeckerle Dec 6, 2023

Fix ProcessorFactory.withDistinguishedRootNode and isError #1122

Fix ProcessorFactory.withDistinguishedRootNode and isError #1122

Conversation

stevedlawrence commented Nov 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuxji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbeckerle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment