-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single merge to perform update, delete and insert #602
Comments
I'm not sure if I got the full picture but first thing that come to my mind is to just create a new column for your insertDf and deleteDf and use it in your whenMatched clause ?
|
Hi All, I need your expertise here. I have got a similar problem here. |
Hi @himanshujindal - I think based on your description that @JassAbidi has shared a good solution (this is similar to how you would consume and apply CDF changes in Delta 2.0, for example). Can you please confirm that works for you, if you solved it otherwise, or if you're still blocked on this? Thanks! Hi @bennetryan I think your goal is covered by normal MERGE semantics with a single statement.
You can use the UPDATE statement to overwrite the entire matching row instead of explicitly deleting it and inserting a new one.
Please let me know if this doesn't address your case |
Hi @nkarpov, thank you for your reply. But the problem in my case is the entire id column value may not match which is why I'm using a substring. Also, in a table for substring of id there may be more than one rows, which is why on match of the partial id value I want to first perform a delete on the matched rows and then insert those rows. |
If think replaceWhere can get you some mileage to do an atomic delete + insert, but otherwise, today, MERGE will not allow you to match multiple source (snapDF) rows to replace the matching target rows. I think it's a good use case though. Would love to work together if you're open to contributing. |
Hi @nkarpov, I would be happy to work and contribute to it. |
There are many approaches... The least intrusive would be to roll your own custom transaction using Going further from there could involve going as far as modifying the existing MERGE. This would be quite complex but not impossible. There's a lot to consider, for example, the existing checks for no duplicate matches in the source table, which would have to be removed in this case. If you'd like to start down that path, please create and share a design doc similar to https://docs.google.com/document/d/1Gs4ZsTH19lMxth4BSdwlWjUNR-XhKHicDvBjd2RqNd8/edit |
Context: I am performing a merge command (https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge&language-java) on data in s3 stored using delta lake and I have changes that consist updates, inserts and deletes. Currently, I am using two merge commands to apply those changes. Here are how my merge commands look like
Problem: This results in the system doing two merges which drives down efficiency of my system. I am trying to figure out a way to apply all the updates in a single merge. However, the merge command only takes one data and one condition. So unless I create a condition using the value from the data and apply the inserts if the value of rows is in my insert data frame and deletes, if the value of rows is in my delete data set, I end up having to write two different merges. Am I missing something? Is there a feature request here that would help simplify applying changes to delta lake?
Note that I want to avoid creating queries using data from the rows as the data is coming from customers and could be prone to sql injection. Also, the condition string in that case could be awfully large since the changes I am applying could be ~1-2GB in size.
The text was updated successfully, but these errors were encountered: