-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration Mutation Isolation #4617
Comments
I agree and I have proposed this for some time: #1281. I think that since the DML processing is handled by SessionContext itself, this would work with your plan Something else that would be worth looking at to figure out would be the DataFrame API which has both the state and a logical plan: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/dataframe.rs I really like the idea of "TaskContext is created once the query is planned and is not mutable, and does not have shared state -- and it can perhaps make a copy of whatever parts of SessionState is needed to run the query" |
In the future, I'd like to add |
Is this related to prepared statement report #4539 🤔 |
To give a concrete example of the issue The optimizer can observe one execution start time as snapshotted here - https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/context.rs#L1001 But this value can change before the state is snapshotted again as part of running the query - https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/context.rs#L1051 |
* DataFrame owned SessionState (#4617) * Fix deadlock * Fix execution time
I think the |
I think the Binder he mentioned is a separated planning phase before the logical plan optimization, which bind the sql rel nodes(tables and columns) to the Catalog system. |
The changes LGTM except for #4633. |
I plan to make ExecutionProps a trait (#4629) implemented by DataFrame, with DataFrame becoming the "snapshotted state for planning and execution" |
We should make sure this design works for Ballista. I know it manages state a little differently (like it creates sessioncontext's on remote executors) |
I think we have completed the initial work on this item -- while there is for sure more to do here to make session config easier, it is much better than when this ticket was originally written, so closing this one for now |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Broadly speaking:
SessionContext
/SessionState
- state used to plan a queryExecutionProps
- state used to lower a logical expression to a physical expressionTaskContext
- state used to execute a queryWe then have the following
RuntimeEnv
- "global" configuration available at plan and query timeSessionConfig
- session configuration available at plan and query timeOf these
RuntimeEnv
,SessionState
andSessionConfig
are interior mutable, that is they can be modified without a mutable reference.The result is that queries can and do modify the session and runtime configuration during execution. This is important to support things like
CREATE TABLE
,SET
, etc... This is fine, however, the use of shared mutable state means that modifications will also impact in-flight queries. This feels at best surprising, and there is a fairly high probability of their being consistency bugs already resulting from this.Describe the solution you'd like
I would ideally like to use Rust's borrow checker to handle this for us, as this would not only eliminate a non-trivial amount of locking complexity from the DataFusion codebase, but would also more clearly communicate what state can be altered when.
This would require separating DDL from DML, with the latter requiring mutable access to the
SessionContext
. I'm inclined to think this is fine for a couple of reasons:SessionContext
still take&mut self
- refactor: relax the signature of register_* in SessionContext #4612SessionContext
in parallelSessionContext
in parallel will need async state management regardlessIt isn't a fully formed thought, but something that came out of #4607 is the need to be able to pre-parse a SQL statement. Perhaps we could provide some sort of
SqlStatement
wrapper containing a parsedSQL
statement. This would facilitate delegation of specific handling of mutating queries to the downstream system, which is far better placed to determine the desired semantics.Describe alternatives you've considered
Additional context
#4517 #3887 #4349 track improvements to DataFusion's configuration
#3777 tracks async catalog support which introduces another dimension to the out-of-band state modification
The text was updated successfully, but these errors were encountered: