Doc DeltaOptions as public API or make it a delta private class #598

CodingCat · 2021-02-16T19:03:05Z

I was using Delta Lake and wanted set OVERWRITE_SCHEMA_OPTION to true, with

df.write.option(DeltaOptions. OVERWRITE_SCHEMA_OPTION, "true").format("delta")

however this application is broken in Databricks since the internal version is different with open source one and the support engineer said that DeltaOptions is private API

this actually confused users a lot, I saw a public class and straightforwardly referred to it only to find that we cannot use it in the platform of the company who created Delta Lake

I would suggest documenting DeltaOptions as a public API and committed to backward compatibility or make it a delta private class so others will not fall into the same issue

CodingCat · 2021-02-16T19:19:52Z

I am also happy to file a PR as long as we have the agreement on this

zsxwing · 2021-02-16T19:28:22Z

Only APIs showing up in the API doc are public. Totally agreed this is confusing. But we basically follow Spark.

CodingCat · 2021-02-16T19:31:57Z

Only APIs showing up in the API doc are public. Totally agreed this is confusing. But we basically follow Spark.

yeah, I mean the way to eliminate the confusion might be 1) make this as a public API, or 2) we make it as a package private class so people will not refer to it in application code (or at least not that easy to do it and when they do they implicitly accept the risk)

zsxwing · 2021-02-16T22:35:22Z

I found it's pretty useful when debugging issues in a notebook environment. I'm inclined to leave it as it is and improve the document.

CodingCat · 2021-02-19T04:17:17Z

how about DeltaLog...which is used even more, but it is still not a public API?

tdas · 2021-02-19T14:27:33Z

It is not. Because we dont guarantee API compatibility across versions and in the past we have refactored the structure and methods in that class. Furthermore, we didnt start with the intention of making Delta Log public so the way it is structured, it can expose a very larget surface area of internal classes to be publicly accessible. Its non-trivial to set up those public-private boundaries within and around the DeltaLog class. Are their any specific functionality that you are trying to access?

…

On Thu, Feb 18, 2021 at 11:17 PM Nan Zhu ***@***.***> wrote: how about DeltaLog...which is used even more, but it is still not a public API? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#598 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFB5LB6VYDPGJKEF4QGYI3S7XQ6HANCNFSM4XW7KIYA> .

CodingCat · 2021-02-20T07:46:35Z

Hi, @tdas . There are several methods of DeltaLog I am interested in, e.g. getChanges, and even as simple as checkpointInterval

in general our experience with delta lake is not that smooth in DBR, because there are many public classes we can get from our delta lake dependency (which has to be a open source version), and we can easily access all public classes under org.apache.spark.sql package...however, they will not run in DBR as all these classes are moved to somewhere else (we have to be use a very hacky way to make our code runnable in both our CI and DBR )

beyond the original purpose of filing this PR, I would like to know if there is any plan to make DBR and open source version more compatible which I personally feel is beneficial in terms of both user experience and community growth

TJZhou · 2021-07-23T03:19:12Z

Faced same problem while using DeltaOption to create DeltaSink in DBR. Apparently both these two classes are private, did you find any workaround to your problems @CodingCat ?

zsxwing · 2021-07-28T19:43:16Z

@TJZhou Which APIs in DeltaOption are you using? Is it possible to avoid using DeltaOption in your project?

TJZhou · 2021-07-29T00:29:03Z

We created a DeltaOption instance and pass it into DeltaSink, which looks like

new DeltaSink(spark.sqlContext, new Path(destination), partitionColumn, OutputMode.append(), new DeltaOption(xxx))

It's almost impossible to avoid it, and I also checked DeltaSink API is private in Databricks too.

zsxwing · 2021-07-29T16:54:58Z

@TJZhou DeltaSink is not supposed to use directly. We expect users to use it only using Spark APIs, such as df.writeStream.format("delta").save(...). Could you clarify what you are doing? Are you wrapping DeltaSink to do some different work?

TJZhou · 2021-07-29T18:17:42Z

Spark doesn't have good support for writing to multiple, dynamic output locations so we had to interact with a lower level of the Deltalake API when writing this. We customize the sink like the following.

    val enrichedTypes = df.persist(StorageLevel.MEMORY_ONLY)
    val splitedDF: ParMap[String, DataFrame] = splitDF(enrichedTypes)
    splitedDF.foreach {
      case (destination: String, data: DataFrame) =>
        val sink = outputSinks.getOrElseUpdate(destination,
          new DeltaSink(spark.sqlContext, new Path(destination), partitionColumn, OutputMode.append(), new DeltaOption(xxx))
        )
        sink.addBatch(batchId, data)
    }
    enrichedTypes.unpersist()

zsxwing · 2021-07-29T18:42:13Z

Hm, so you are trying to write to multiple delta tables in foreachBatch but still require exactly-once?

zsxwing · 2021-07-29T18:50:09Z

Could you try https://docs.databricks.com/delta/delta-streaming.html#idempotent-multi-table-writes instead?

TJZhou · 2021-07-29T23:22:45Z

Ahh that's some new features that we haven't tried before. Yep I shall give it a try. Thanks @zsxwing

zsxwing mentioned this issue Jul 22, 2021

java.lang.NoClassDefFoundError: org/apache/spark/sql/delta/DeltaOptions #717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc DeltaOptions as public API or make it a delta private class #598

Doc DeltaOptions as public API or make it a delta private class #598

CodingCat commented Feb 16, 2021

CodingCat commented Feb 16, 2021

zsxwing commented Feb 16, 2021

CodingCat commented Feb 16, 2021

zsxwing commented Feb 16, 2021

CodingCat commented Feb 19, 2021

tdas commented Feb 19, 2021 via email

CodingCat commented Feb 20, 2021

TJZhou commented Jul 23, 2021

zsxwing commented Jul 28, 2021

TJZhou commented Jul 29, 2021

zsxwing commented Jul 29, 2021

TJZhou commented Jul 29, 2021 •

edited

Loading

zsxwing commented Jul 29, 2021

zsxwing commented Jul 29, 2021

TJZhou commented Jul 29, 2021

Doc DeltaOptions as public API or make it a delta private class #598

Doc DeltaOptions as public API or make it a delta private class #598

Comments

CodingCat commented Feb 16, 2021

CodingCat commented Feb 16, 2021

zsxwing commented Feb 16, 2021

CodingCat commented Feb 16, 2021

zsxwing commented Feb 16, 2021

CodingCat commented Feb 19, 2021

tdas commented Feb 19, 2021 via email

CodingCat commented Feb 20, 2021

TJZhou commented Jul 23, 2021

zsxwing commented Jul 28, 2021

TJZhou commented Jul 29, 2021

zsxwing commented Jul 29, 2021

TJZhou commented Jul 29, 2021 • edited Loading

zsxwing commented Jul 29, 2021

zsxwing commented Jul 29, 2021

TJZhou commented Jul 29, 2021

TJZhou commented Jul 29, 2021 •

edited

Loading