Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Merge operation retrigers table lineage #3954

Open
2 of 8 tasks
FranArenas opened this issue Dec 11, 2024 · 2 comments
Open
2 of 8 tasks

[BUG] Merge operation retrigers table lineage #3954

FranArenas opened this issue Dec 11, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@FranArenas
Copy link

FranArenas commented Dec 11, 2024

Bug

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Describe the problem

When executing a delta merge operation, if the affected table is read t he lineage of it is affected. So if a DataFrame is checked before and after the merge operation, it can have different values even if cached.

Steps to reproduce

Pyspark code

from pyspark.sql.types import StructType, StructField, StringType
from delta.tables import DeltaTable
from pyspark.sql import DataFrame

print("Start")
table = "MYTABLE"
path = "MYPATH"

schema = StructType([StructField("id", StringType(), True)])

df = spark.createDataFrame(
    [("A",), ("B",), ("C",), ("D",)],
    schema
)

df_del = spark.createDataFrame(
    [("A",), ("B",),("OTHER",)],
    schema
)
df.write.format("delta").mode("overwrite").saveAsTable(table)
df_read = spark.read.format("delta").load(path).cache()
df_read.show() # First read


delta_table = DeltaTable.forPath(df.sparkSession, path)

delta_table.alias("target").merge(
    source=df_del.alias("source"),
    condition=" AND ".join([f"target.{pk} = source.{pk}" for pk in df_del.columns]),
).whenMatchedDelete().execute()

df_read.show() # Second read. It changed!

Observed results

+---+
| id|
+---+
|  A|
|  B|
|  C|
|  D|
+---+

TABLE END
+---+
| id|
+---+
|  C|
|  D|
+---+

Expected results

+---+
| id|
+---+
|  A|
|  B|
|  C|
|  D|
+---+

TABLE END
+---+
| id|
+---+
|  A|
|  B|
|  C|
|  D|
+---+

Further details

I am executing this code in Fabric.

Environment information

  • Delta Lake version: 3.2
  • Spark version: 3.5
  • Scala version: 2.12.17

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@FranArenas FranArenas added the bug Something isn't working label Dec 11, 2024
@allisonport-db
Copy link
Collaborator

There is no Delta 3.4 can you clarify your version used?

@FranArenas
Copy link
Author

There is no Delta 3.4 can you clarify your version used?

Sorry for the confusion, the delta version is 3.2. I will update my initial message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants