Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc DeltaOptions as public API or make it a delta private class #598

Open
CodingCat opened this issue Feb 16, 2021 · 15 comments
Open

Doc DeltaOptions as public API or make it a delta private class #598

CodingCat opened this issue Feb 16, 2021 · 15 comments

Comments

@CodingCat
Copy link

I was using Delta Lake and wanted set OVERWRITE_SCHEMA_OPTION to true, with

df.write.option(DeltaOptions. OVERWRITE_SCHEMA_OPTION, "true").format("delta")

however this application is broken in Databricks since the internal version is different with open source one and the support engineer said that DeltaOptions is private API

this actually confused users a lot, I saw a public class and straightforwardly referred to it only to find that we cannot use it in the platform of the company who created Delta Lake

I would suggest documenting DeltaOptions as a public API and committed to backward compatibility or make it a delta private class so others will not fall into the same issue

@CodingCat
Copy link
Author

I am also happy to file a PR as long as we have the agreement on this

@zsxwing
Copy link
Member

zsxwing commented Feb 16, 2021

Only APIs showing up in the API doc are public. Totally agreed this is confusing. But we basically follow Spark.

@CodingCat
Copy link
Author

Only APIs showing up in the API doc are public. Totally agreed this is confusing. But we basically follow Spark.

yeah, I mean the way to eliminate the confusion might be 1) make this as a public API, or 2) we make it as a package private class so people will not refer to it in application code (or at least not that easy to do it and when they do they implicitly accept the risk)

@zsxwing
Copy link
Member

zsxwing commented Feb 16, 2021

I found it's pretty useful when debugging issues in a notebook environment. I'm inclined to leave it as it is and improve the document.

@CodingCat
Copy link
Author

how about DeltaLog...which is used even more, but it is still not a public API?

@tdas
Copy link
Contributor

tdas commented Feb 19, 2021 via email

@CodingCat
Copy link
Author

Hi, @tdas . There are several methods of DeltaLog I am interested in, e.g. getChanges, and even as simple as checkpointInterval

in general our experience with delta lake is not that smooth in DBR, because there are many public classes we can get from our delta lake dependency (which has to be a open source version), and we can easily access all public classes under org.apache.spark.sql package...however, they will not run in DBR as all these classes are moved to somewhere else (we have to be use a very hacky way to make our code runnable in both our CI and DBR )

beyond the original purpose of filing this PR, I would like to know if there is any plan to make DBR and open source version more compatible which I personally feel is beneficial in terms of both user experience and community growth

@TJZhou
Copy link

TJZhou commented Jul 23, 2021

Faced same problem while using DeltaOption to create DeltaSink in DBR. Apparently both these two classes are private, did you find any workaround to your problems @CodingCat ?

@zsxwing
Copy link
Member

zsxwing commented Jul 28, 2021

@TJZhou Which APIs in DeltaOption are you using? Is it possible to avoid using DeltaOption in your project?

@TJZhou
Copy link

TJZhou commented Jul 29, 2021

We created a DeltaOption instance and pass it into DeltaSink, which looks like

new DeltaSink(spark.sqlContext, new Path(destination), partitionColumn, OutputMode.append(), new DeltaOption(xxx))

It's almost impossible to avoid it, and I also checked DeltaSink API is private in Databricks too.

@zsxwing
Copy link
Member

zsxwing commented Jul 29, 2021

@TJZhou DeltaSink is not supposed to use directly. We expect users to use it only using Spark APIs, such as df.writeStream.format("delta").save(...). Could you clarify what you are doing? Are you wrapping DeltaSink to do some different work?

@TJZhou
Copy link

TJZhou commented Jul 29, 2021

Spark doesn't have good support for writing to multiple, dynamic output locations so we had to interact with a lower level of the Deltalake API when writing this. We customize the sink like the following.

    val enrichedTypes = df.persist(StorageLevel.MEMORY_ONLY)
    val splitedDF: ParMap[String, DataFrame] = splitDF(enrichedTypes)
    splitedDF.foreach {
      case (destination: String, data: DataFrame) =>
        val sink = outputSinks.getOrElseUpdate(destination,
          new DeltaSink(spark.sqlContext, new Path(destination), partitionColumn, OutputMode.append(), new DeltaOption(xxx))
        )
        sink.addBatch(batchId, data)
    }
    enrichedTypes.unpersist()

@zsxwing
Copy link
Member

zsxwing commented Jul 29, 2021

Hm, so you are trying to write to multiple delta tables in foreachBatch but still require exactly-once?

@zsxwing
Copy link
Member

zsxwing commented Jul 29, 2021

@TJZhou
Copy link

TJZhou commented Jul 29, 2021

Ahh that's some new features that we haven't tried before. Yep I shall give it a try. Thanks @zsxwing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants