Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaSource load all files to Spark Driver during every batch. For big Table it's a huge issue. #580

Open
GrigorievNick opened this issue Jan 12, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@GrigorievNick
Copy link
Contributor

This line of code uploads all files from delta-log to spark driver.

Seq("add")).as[IndexedFile].collect().toIterable

  1. This creates a very big overhead for spark driver memory and GC for big topics.
  2. It's pretty strange to see collect inside a function that returns an iterator. Does any concern exist to do not use toLocalIterator?
@zsxwing
Copy link
Member

zsxwing commented Jan 21, 2021

Good catch. Yep, it makes sense to use toLocalIterator. Feel free to submit a PR to fix it.

@zsxwing zsxwing added the enhancement New feature or request label Jan 21, 2021
@GrigorievNick
Copy link
Contributor Author

thx @zsxwing, right now we use spark 2.4.7, and the master branch of the delta.io compatible only with spark 3.
So I have two questions. Please suggest me.

  1. Can I do the fix as the patch to 0.6.1 tag and expect that it will be published with 0.6.2 delta.io version?
  2. I will do a fix to spark 3 as well, but right now I don't have the environment to test this fix, so does it ok If I will do the fix without internal testing? Do the delta.io test handle streaming API or it's better to publish MR with enhancement after my team adopts spark 3?

@fvaleye
Copy link
Contributor

fvaleye commented Feb 3, 2021

Hello @GrigorievNick, I submitted a PR related to your issue (only with the latest version of delta-io, with Spark 3). I hope it helps!

@GrigorievNick
Copy link
Contributor Author

Hi @fvaleye Thank you.
This does not affect my current project, because I use spark 2, and will proceed for the next few months.
So I just have my own fork of delta.io 0.6.1.
But this saves my time for contributing to things that I don't use tight now at work.
And also help to improve project quality in general.
So again, thank you.
It was very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants