-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ProcessorFactory.withDistinguishedRootNode and isError #1122
Fix ProcessorFactory.withDistinguishedRootNode and isError #1122
Conversation
@@ -92,13 +89,10 @@ final class ProcessorFactory private ( | |||
validateDFDLSchemas, | |||
checkAllTopLevel, | |||
tunables, | |||
Some(sset), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is the core issue, where we copy the current incorrect schema set into the new ProcessorFactory.
All the other stuff is related to isErrors, which avoids work that may not be needed. The downside is that work won't happen until a user calls isError, which is potentially confusing. It also means were are more strict about isError being called before other functions. Maybe this is controversial and should be backed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think requiring that a user call isError before calling some ProcessorFactory methods is potentially confusing. I would like to know more about the use case(s) where someone gets a ProcessorFactory but abstains from calling isError on purpose. What is the justification or motivation for such use case(s)? Is the justification that a user might want to call various withXXX methods on the ProcessorFactory to fine tune the ProcessorFactory's parameters and configuration before actually using the ProcessorFactory to get a Processor?
If that is the justification and the motivation for it is compelling enough (avoid doing a considerable amount of work more than once), then this PR's changes will let the ProcessorFactory initialize only part of itself and let the user finish the initialization by calling isError only after completing all of the withXXX calls. I would say that maybe we should find a better API design, but I know it's not easy to design an API that can be used by both Java and Scala programmers. I will not argue with this ProcessorFactory withXXX / isError API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to know more about the use case(s) where someone gets a ProcessorFactory but abstains from calling isError on purpose. What is the justification or motivation for such use case(s)?
Users can do this if they are confident they schema will compile without any errors. The ProcessorFactory.onPath
function asserts that isError
is false, so if a user doesn't call isError
, then onPath
will, and all needed work will be done then. But if isError
is true, onPath
will throw an exception saying this function should not be called if isError
is true.
Note that this is why some tests don't call isError
. The test just assumes there isn't a compile error, and if there is then an assertion exception will be thrown and the test will fail. It would probably be better to assert that isError
is false, but it's not technically required for the test to pass/fail correctly.
Note that another side effect of this change is that if a user calls ProcessorFactory.getDiagnostics
prior to calling isError
, then no work will have been done and so there will be no diagnostics. This was actually the case with the CLI and TDML runner, hence a couple of changes in this PR. We could avoid this by modifying getDiagnostics
to call isError
before returning the diags.
Is the justification that a user might want to call various withXXX methods on the ProcessorFactory to fine tune the ProcessorFactory's parameters and configuration before actually using the ProcessorFactory to get a Processor?
Yeah, that's why this change was made. With only the change to create a new SchemaSet
when withDistinguishedRootNode
is called, it means if we have something like this:
val pf = compiler.compileSource(uri).withDistinguishedRootNode(name, namespace)
pf.isError
Then we create and evaluate two SchemaSet
s, one when compileSource
is called (because it calls isError
) and the other when pf.isError
is called. Note that a user could do something like this:
val pf = compiler.compileSource(uri)
.withDistinguishedRootNode(name1, namespace1)
.withDistinguishedRootNode(name2, namespace2)
pf.isError
And that would still only evaluated two SchemaSets
, because withDistinguishedRootNode
does not call isError
.
You could kind of think of this API as a builder API. Where compileSource
creates a builder, the withXYZ
functions configure that builder, and isError
is the build()
method, e.g.
val pf = new ProcessorFactoryBuilder(uri)
.withDistinguishedRootNode(name, namespace)
.build()
It wouldn't make any sense to create a ProcessorFactory
when the builder is initialized, and then a new one when .build()
is called. That's essentially what we were doing before, only we weren't actually creating a new one which was the bug causing withDistinguishedRootNode
to be ignored.
If that is the justification and the motivation for it is compelling enough (avoid doing a considerable amount of work more than once)
When you call pf.isError
, it calls the isError
function on the SchemaSet
. And that function has this comment:
/**
* Asking if there are errors drives compilation. First the schema is validated, then
* the abstract syntax tree is constructed (which could be a source of errors)
* and finally the AST objects are checked for errors, which recursively
* demands that all other aspects of compilation occur.
*/
That's pretty much everything except asking for a runtime1 parser/unparser, which is what pf.onPath()
does. So it is a lot of work that really should be avoided. We don't want to do all that twice, especially if the first is just going to be thrown away.
Another option is to drop withDistinguishedRootNode
entirely, and require users to provide the root name/namespace to the compileSource
function. This is the only withXYZ
function on the ProcessorFactory
. I do kindof like the builder pattern though, makes it easy to add new configurations in the future without having to modify the new constructor.
I would say that maybe we should find a better API design
I'm not against this. It's certainly not obvious to the user that .isError is what triggers the work. It's definitely convenient, but does side effect things that isn't obvious.
There are a number of changes we've been considering that might warrant a 4.0.0 release (e.g. drop Java 8, new layer API), so if we do want to change our API, this would be a good time to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that another side effect of this change is that if a user calls ProcessorFactory.getDiagnostics prior to calling isError, then no work will have been done and so there will be no diagnostics. This was actually the case with the CLI and TDML runner, hence a couple of changes in this PR. We could avoid this by modifying getDiagnostics to call isError before returning the diags.
Yes, this is the only thing I would suggest changing. A programmer might think pf is already initialized and call pf.getDiagnostics before calling pf.onPath or pf.isError. I'd suggest that pf.getDiagnostics throw an assertion exception to ensure such a programming error gets fixed, instead of calling isError before returning the diagnostics or simply returning no diagnostics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest that pf.getDiagnostics throw an assertion exception to ensure such a programming error gets fixed, instead of calling isError before returning the diagnostics or simply returning no diagnostics.
Doesn't seem unreasonable to me. We would have to add a boolean var since we can't currently tell if a user calls isError or not, but that's not too big of a deal.
There is maybe an argument that if a user calls getDiagnostics
then they aren't planning on calling any more withXYZ
functions and we should just call isError
for them so they get the right result. This would mean calling isError
or getDiagnostics
will trigger work.
Regardless, we should definitely do one of these. There were two instances in our own codebase where we did this wrong, so it's not crazy think there might be other places that do it and we don't want getDiagnostics to silently return nothing. I don't feel strongly about which we choose. Calling isError
in getDiagnostics
is maybe less likely to break things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I discovered that DataProcessor.isError
always returns false:
This was surprising to me, but this codecov in the TDML runner confirms that we do never get a DataProcessor
with isError
set to true, at least in any of our TDML tests:
I think this goes to show how much work is done in ProcessorFactory.isError
, so much so that if we successfully create a ProcessorFactory
, there is no way for Daffodil to fail to create a DataProcessor
. So I think it is pretty important to avoid that extra work if a user calls withDistinguishedRootNode
.
This also means isError
doesn't do any work in the DataProcessor
. All the work to create a DataProcessor
is done in ProcessorFactory.onPath
. Not sure it matters, but I didn't realize that was the case. I thought isError
did all the work in both classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is maybe an argument that if a user calls getDiagnostics then they aren't planning on calling any more withXYZ functions and we should just call isError for them so they get the right result. This would mean calling isError or getDiagnostics will trigger work.
I find your argument persuasive. Let's trigger the work in both isError and getDiagnostics as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's trigger the work in both isError and getDiagnostics as you suggest
Done, and added some tests to check that this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I have some caveats, but I am okay with being more strict about calling pf.isError.
@@ -92,13 +89,10 @@ final class ProcessorFactory private ( | |||
validateDFDLSchemas, | |||
checkAllTopLevel, | |||
tunables, | |||
Some(sset), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think requiring that a user call isError before calling some ProcessorFactory methods is potentially confusing. I would like to know more about the use case(s) where someone gets a ProcessorFactory but abstains from calling isError on purpose. What is the justification or motivation for such use case(s)? Is the justification that a user might want to call various withXXX methods on the ProcessorFactory to fine tune the ProcessorFactory's parameters and configuration before actually using the ProcessorFactory to get a Processor?
If that is the justification and the motivation for it is compelling enough (avoid doing a considerable amount of work more than once), then this PR's changes will let the ProcessorFactory initialize only part of itself and let the user finish the initialization by calling isError only after completing all of the withXXX calls. I would say that maybe we should find a better API design, but I know it's not easy to design an API that can be used by both Java and Scala programmers. I will not argue with this ProcessorFactory withXXX / isError API design.
@@ -141,7 +141,6 @@ class Compiler private[japi] (private var sCompiler: SCompiler) { | |||
): ProcessorFactory = { | |||
|
|||
val pf = sCompiler.compileFile(schemaFile, Option(rootName), Option(rootNamespace)) | |||
pf.isError | |||
new ProcessorFactory(pf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 143 - 144 look strange at first glance. We get a pf which is a ProcessorFactory and wrap another ProcessorFactory around it. I do realize that pf is actually a core Daffodil ProcessorFactory class and the ProcessorFactory on line 144 is actually a Java API ProcessorFactory class. Nevertheless, I wish we could avoid using these Java and Scala API wrapper classes if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I really want to get rid of this wrapper stuff. It's complicated to maintain and provides very little real benefit. Pretty much the only real difference is sometimes the Scala API returns a Seq
but the Java API returns a List
. Getting rid of this would be a good candidate for a 4.0.0 release as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pattern for getting rid of this wrapper stuff is to have a java-usable abstract base/interface defined in a ".java" file, that has all the methods the Java API user would need to call including static factory methods. This ".java" file also has all the javadoc.
Then the scala implementation derives its actual classes from that base/interface, and those can all use scala types and such. Then the "API" for java users is just returning the object, upcasted to the java base/interface that shows only the java-usable methods.
This doesn't prevent the Java user from just downcasting and using something internal, but I think the whole wrapper approach, which does prevent the downcasting bypass, is just not worth the maintenance headache. Plus this is open source, so what are we really preventing anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that feels like the right approach.
java.io.File schemaFile = getResource("/test/japi/mySchema3.dfdl.xsd"); | ||
ProcessorFactory pf = c.compileFile(schemaFile); | ||
pf = pf.withDistinguishedRootNode("e4", null); | ||
DataProcessor dp = pf.onPath("/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we supposed to call pf.isError before we call pf.onPath("/")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in my big comment above, but isError it's not mandatory. The pf.onPath
function calls isError
. The only problem is if isError
is true then onPath
will thrown an assertion exception. But that's not a big deal in a test because a thrown exception will still cause the test to fail. It maybe isn't as elegant as an assertion, e.g. assertFalse(pf.isError)
, but ultimately tests things just the same.
val pf = c | ||
.compileFile(schemaFile) | ||
.withDistinguishedRootNode("e4", null) | ||
val dp1 = pf.onPath("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we supposed to call pf.isError before we call pf.onPath("/")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -141,7 +141,6 @@ class Compiler private[japi] (private var sCompiler: SCompiler) { | |||
): ProcessorFactory = { | |||
|
|||
val pf = sCompiler.compileFile(schemaFile, Option(rootName), Option(rootNamespace)) | |||
pf.isError | |||
new ProcessorFactory(pf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pattern for getting rid of this wrapper stuff is to have a java-usable abstract base/interface defined in a ".java" file, that has all the methods the Java API user would need to call including static factory methods. This ".java" file also has all the javadoc.
Then the scala implementation derives its actual classes from that base/interface, and those can all use scala types and such. Then the "API" for java users is just returning the object, upcasted to the java base/interface that shows only the java-usable methods.
This doesn't prevent the Java user from just downcasting and using something internal, but I think the whole wrapper approach, which does prevent the downcasting bypass, is just not worth the maintenance headache. Plus this is open source, so what are we really preventing anyway?
When calling withDistinguishedRootNode, we copy the ProcessorFactory and replace the root spec with the new node. When we do this, we also copy the existing SchemaSet. But this SchemaSet was created with the previous root spec, which means withDistinguishedRootNode doesn't actually do anything. Fortunately, there is a workaround which is to pass in the root name/namspace to the compileSource/File functions, so that the root spec is available when the SchemaSet is created, and avoid withDistinguishedRootNode entirely. But to fix this, this modifies withDistinguishedRootNode to not copy the SchemaSet, so a new SchemaSet is created and initialized with the correct root spec. Each ProcessorFactory now has a unique SchemaSet instance with the correct root name and namespace. This also discovered cases in the Compiler.compileSourceInternal and JAPI/SAPI compileFile functions where they called isError on the newly created ProcessorFactory before returning it to the caller. This meant that the SchemaSet inside the ProcessorFactory would be forced to be created, even though it might never be used, like in the case of withDistinguishedRootNode being called and a different SchemaSet being created. Really the isError function should only ever be called by the user when they are ready for the object to be evaluated, and not forced by Daffodil internals. This ensures we avoid unnecessary work until the user is ready, since isError is what eventually causes lazy vals to be evaluated. However, this does mean that if a user doesn't check isError they might get exceptions/assertions, but it is well documented in the JAPI/SAPI that isError should be checked before calling other functions. This also means that some debug logging that relies on knowing that status of isError must be moved elsewhere, since the user might not have called isError yet. This also discovered instances in the TDMLRunner and CLI where we queried diagnostics before checking isError--isError should always be done first, which is fixed. That said, this also modifies ProcessorFactory.getDiagnostics so that it calls isError before returning the diagnostics. This way if user calls getDiagnostics without first calling isError (like we used to do in the TDMLRunner and CLI), the compilation work will be done and correct diagnostics generated. Without this, there would be no diagnostics because no work would be done yet. Although technically a user should call isError before asking for diagnostics, there may be cases where users don't, so this allows that to keep working as expected. This is similar to how onPath calls isError even if a user doesn't. DAFFODIL-2864, DAFFODIL-697
3a9a221
to
d9aef91
Compare
When calling withDistinguishedRootNode, we copy the ProcessorFactory replacing the root spec with the new node. But when we do this, also copy the existing SchemaSet. However, this SchemaSet was created with the previous root spec, which means withDistinguishedRootNode doesn't actually do anything.
Instead, the only workaround is to pass in the root name/namspace to the compileSource/File function, so that the root spec is available when the SchemaSet is created.
To fix this, this modifies withDistinguishedRootNode to not copy the SchemaSet, so a new SchemaSet will be created and initialized with the correct root spec. Each ProcessorFactory now has a unique SchemaSet instance with the correct root name and namespace.
This also discovered cases in the Compiler.compileSourceInternal and JAPI/SAPI compileFile functions where they called isError on the newly created ProcessorFactory before returning it to the caller. This meant that the SchemaSet inside the ProcessorFactory would be forced to be created, even though it might never be used, like in the case of withDistinguishedRootNode being called and a different SchemaSet being created.
Really the isError function should only ever be called by the user when they are ready for the object to be evaluated, and not forced by Daffodil internals. This ensures we avoid unnecessary until the user is ready, since isError is what eventually causes lazy vals to be evaluated. However, this does mean that if a user doesn't call isError they might get exceptions/assertions, but this is well documented in the JAPI/SAPI that this function must be called first before calling other functions.
This also means that some debug logging that relies on knowing that status of isError must be moved elsewhere, since the user might not have called isError yet.
This also discovered instances in the TDMLRunner and CLI where we queried diagnostics before checking isError--isError should always be done first.
DAFFODIL-2864, DAFFODIL-697