Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Prepared Statement Support #4539

Closed
5 tasks done
alamb opened this issue Dec 7, 2022 · 13 comments
Closed
5 tasks done

[Epic] Prepared Statement Support #4539

alamb opened this issue Dec 7, 2022 · 13 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 7, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Prepared Statements are widely used by SQL clients when issuing queries to a database. Major use cases include improved transaction processing latency as well as preventing SQL injection attacks (parameterized query arguments are often implemented as a feature of prepared statements).

Supporting prepared statements will increase the number of client applications that can work with DataFusion.

Task List

Background
Here is a schematic of how prepared statements work:

                                                 (               )
╔═══════════╗      SELECT *                      │`─────────────'│
║           ║      FROM foo                      │               │
║  Client   ║      WHERE id = $1                 │    Database   │
║           ║━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━▶│     Server    │
║           ║                                    │               │
╚═══════════╝                                    │.─────────────.│
                                                 (               )
 Step 1: Client send parameterized query to       `─────────────' 
 server to "prepare"                                              
                                                                  
                                                                  
                  HANDLE: 0x.....                                 
                  SCHEMA: (VARCHAR, INT, FLOAT)                   
                  PARAMS: {$1: INT}                               
           ◀━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                   
 Step 2: Server prepares to run query and sends back opaque       
 "handle", result schema, and needed bind parameters to client    
                                                                  
                                                                  
                                                                  
                  HANDLE: 0x.....                                 
                  PARAMS: { $1 = 12345 }                          
           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━▶                   
                                                                  
 Step 3: Client returns the handle and values of "bind"           
 parameters to the server                                         
                                                                                                                                    
                                                                  
                                                                  
            Results: [('Hi', 12345, 5423.13)]                     
           ◀━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                   
                                                                  
Step 4: Server fills in bind parameters, and returns results      
as if the entire query had been supplied                          

Steps:

  1. The client sends two messages to the server. One to prepare the statement leaving placeholders called bind parameters.
  2. The server responds with a handle for the client to identify the prepared query, the result schema, and needed parameters.
  3. The client sends a second message with the handle and bind parameter values.
  4. The server fills in the parameter values, executes the query and returns the results. It is typically possible to execute the same prepared statement multiple times using different bind parameters with a single additional message each.

Some protocols (like the postgres FrontEnd - BackEnd , FEBE, protocol) allow optionally sending both messages in the same transmission to avoid a network round trip.

Describe the solution you'd like
We would like:

  1. Support for PREPARE statements.
  2. Support for EXECUTE statements PreparedStatements with bind parameters.

Both PREPARE and EXECUTE should offer a basic implementation in SessionContext and the ability to extend by other systems (similar to CREATE VIEW and CREATE TABLE)

cc @NGA-TRAN

@alamb alamb added the enhancement New feature or request label Dec 7, 2022
@mingmwang
Copy link
Contributor

There are other things that I'm not sure whether they should be included in this effort or not is the Plan Cache support.
When the client asks the DBServer to PREPARE the SQL statement, the server should generate some artifacts(a partial plan). Different binding variables might generate different partial plans(despite of the CBO rules which can generate alternative physical plans), in Datafusion, the TypeCoercion rule can generate different logical plans also.

@mingmwang
Copy link
Contributor

Oracle Adaptive Cursor Sharing:
https://oracle-base.com/articles/11g/adaptive-cursor-sharing-11gr1

@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2022

There are other things that I'm not sure whether they should be included in this effort or not is the Plan Cache support.
When the client asks the DBServer to PREPARE the SQL statement, the server should generate some artifacts(a partial plan). Different binding variables might generate different partial plans(despite of the CBO rules which can generate alternative physical plans), in Datafusion, the TypeCoercion rule can generate different logical plans also.

That is an interesting idea -- in general I suggest that DataFusion include very basic support for prepared statements and allow for more advanced usecases such as adaptive replanning or partial plans, to systems that use DataFusion (e.g Ballista or IOx). This is similar to how CREATE TABLE and CREATE VIEW is handled -- DataFusion provides basic support for DDL in SessionContext but users of DataFusion can choose not to use the more advanced usecases

The rationale for this separation is that the value of the various optimizations on shared cursors / replanning is likely specific for the particular usecase. For example, in IOx we don't plan to have any actual server side state for prepared statements and will basically replan the query on EXECUTE.

@NGA-TRAN
Copy link
Contributor

NGA-TRAN commented Dec 8, 2022

@alamb Before the 2 tasks in your list (PREPARE and EXECUTE statements), I have created 2 more tasks #4549 and #4550. Maybe add them in your list?

@alamb alamb changed the title Prepared Statement Support [EPIC] [Epic] Prepared Statement Support Mar 5, 2023
@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2023

Update here is that most of the backend work is complete (see #7785). However the SQL support is not hooked up

@lewiszlw
Copy link
Member

What's the best way to handle ? placeholder in datafusion?
For example, select * from t where a = ?, it could be converted to a logical plan in datafusion, but the plan can't be updated with param values.

@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2024

What's the best way to handle ? placeholder in datafusion? For example, select * from t where a = ?, it could be converted to a logical plan in datafusion, but the plan can't be updated with param values.

I tried this in datafusion CLI and it looks like ? is created with a logical plan with a Placeholder with id ?

> select '1' where 4 = ?;
This feature is not implemented: Physical plan does not support logical expression Placeholder(Placeholder { id: "?", data_type: Some(Int64) })

So you can probably use https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.with_param_values and use a ParamValues::Map (key name "?"

@lewiszlw
Copy link
Member

lewiszlw commented Jul 15, 2024

I tried ParamValues::Map, it needs key like $<name>.
Maybe we shouldn't replace ? in a logical plan, for example select * from t where a = ? and b = ?, the order of ? will be disrupted when converted to a logical plan.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Maybe we shouldn't replace ? in a logical plan, for example select * from t where a = ? and b = ?, the order of ? will be disrupted when converted to a logical plan.

I am not sure how MySQL works. Is select * from t where a = ? and b = ? the same as select * from t where a = $1 and b = $2? If so maybe we could translate the ? into the order in which it appears in the query and then use positional ParamValues::List

@lewiszlw
Copy link
Member

This sounds like a good solution. Thanks.

@comphead
Copy link
Contributor

comphead commented Dec 9, 2024

@alamb is this Epic ok to be closed ?

@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2024

Good call -- thanks @comphead

@alamb alamb closed this as completed Dec 9, 2024
@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2024

(Also thanks to @davisp and @jonahgao for finishing up the work)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants