Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration Mutation Isolation #4617

Closed
tustvold opened this issue Dec 14, 2022 · 10 comments
Closed

Configuration Mutation Isolation #4617

tustvold opened this issue Dec 14, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Broadly speaking:

  • SessionContext / SessionState - state used to plan a query
  • ExecutionProps - state used to lower a logical expression to a physical expression
  • TaskContext - state used to execute a query

We then have the following

  • RuntimeEnv - "global" configuration available at plan and query time
  • SessionConfig - session configuration available at plan and query time

Of these RuntimeEnv, SessionState and SessionConfig are interior mutable, that is they can be modified without a mutable reference.

The result is that queries can and do modify the session and runtime configuration during execution. This is important to support things like CREATE TABLE, SET, etc... This is fine, however, the use of shared mutable state means that modifications will also impact in-flight queries. This feels at best surprising, and there is a fairly high probability of their being consistency bugs already resulting from this.

Describe the solution you'd like

I would ideally like to use Rust's borrow checker to handle this for us, as this would not only eliminate a non-trivial amount of locking complexity from the DataFusion codebase, but would also more clearly communicate what state can be altered when.

This would require separating DDL from DML, with the latter requiring mutable access to the SessionContext. I'm inclined to think this is fine for a couple of reasons:

It isn't a fully formed thought, but something that came out of #4607 is the need to be able to pre-parse a SQL statement. Perhaps we could provide some sort of SqlStatement wrapper containing a parsed SQL statement. This would facilitate delegation of specific handling of mutating queries to the downstream system, which is far better placed to determine the desired semantics.

Describe alternatives you've considered

Additional context

#4517 #3887 #4349 track improvements to DataFusion's configuration

#3777 tracks async catalog support which introduces another dimension to the out-of-band state modification

@tustvold tustvold added the enhancement New feature or request label Dec 14, 2022
@alamb
Copy link
Contributor

alamb commented Dec 14, 2022

This would require separating DDL from DML, with the latter requiring mutable access to the SessionContext. I'm inclined to think this is fine for a couple of reasons:

I agree and I have proposed this for some time: #1281. I think that since the DML processing is handled by SessionContext itself, this would work with your plan

Something else that would be worth looking at to figure out would be the DataFrame API which has both the state and a logical plan: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/dataframe.rs

I really like the idea of "TaskContext is created once the query is planned and is not mutable, and does not have shared state -- and it can perhaps make a copy of whatever parts of SessionState is needed to run the query"

@jackwener
Copy link
Member

jackwener commented Dec 15, 2022

It isn't a fully formed thought, but something that came out of #4607 is the need to be able to pre-parse a SQL statement. Perhaps we could provide some sort of SqlStatement wrapper containing a parsed SQL statement. This would facilitate delegation of specific handling of mutating queries to the downstream system, which is far better placed to determine the desired semantics.

In the future, I'd like to add Binder for datafusion. It may provide something similar with above.

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Dec 15, 2022
@alamb
Copy link
Contributor

alamb commented Dec 15, 2022

In the future, I'd like to add Binder for datafusion. It may provide something similar with above.

Is this related to prepared statement report #4539 🤔

@tustvold
Copy link
Contributor Author

To give a concrete example of the issue

The optimizer can observe one execution start time as snapshotted here - https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/context.rs#L1001

But this value can change before the state is snapshotted again as part of running the query - https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/context.rs#L1051

tustvold added a commit that referenced this issue Dec 17, 2022
* DataFrame owned SessionState (#4617)

* Fix deadlock

* Fix execution time
@mingmwang
Copy link
Contributor

This would require separating DDL from DML, with the latter requiring mutable access to the SessionContext. I'm inclined to think this is fine for a couple of reasons:

I agree and I have proposed this for some time: #1281. I think that since the DML processing is handled by SessionContext itself, this would work with your plan

Something else that would be worth looking at to figure out would be the DataFrame API which has both the state and a logical plan: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/dataframe.rs

I really like the idea of "TaskContext is created once the query is planned and is not mutable, and does not have shared state -- and it can perhaps make a copy of whatever parts of SessionState is needed to run the query"

I think the TaskContext is already immutable after the creation. The mutability is within the SessionState and SessionConfig. I agree that any changes to the SessionState and SessionConfig should not impact any in-flight queries, no matter they are in the planning phase or running phase. That was something that I do not considered carefully when I do the refactoring work. I will take a closer look at the new PRs.

@mingmwang
Copy link
Contributor

In the future, I'd like to add Binder for datafusion. It may provide something similar with above.

Is this related to prepared statement report #4539 🤔

I think the Binder he mentioned is a separated planning phase before the logical plan optimization, which bind the sql rel nodes(tables and columns) to the Catalog system.

@mingmwang
Copy link
Contributor

The changes LGTM except for #4633.
And I have a feeling that ExecutionProps should be moved out from the SessionState, so that most of the time, SessionState is immutable. ExecutionProps should be related to a specific query statement/DataFrame, it lives much shorter that a Session.

@tustvold
Copy link
Contributor Author

tustvold commented Dec 20, 2022

I plan to make ExecutionProps a trait (#4629) implemented by DataFrame, with DataFrame becoming the "snapshotted state for planning and execution"

@alamb
Copy link
Contributor

alamb commented Dec 20, 2022

We should make sure this design works for Ballista. I know it manages state a little differently (like it creates sessioncontext's on remote executors)

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

I think we have completed the initial work on this item -- while there is for sure more to do here to make session config easier, it is much better than when this ticket was originally written, so closing this one for now

@alamb alamb closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants