Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetadataCleanUp may make the present versions not available #606

Open
YannByron opened this issue Feb 23, 2021 · 3 comments
Open

MetadataCleanUp may make the present versions not available #606

YannByron opened this issue Feb 23, 2021 · 3 comments
Labels
acknowledged This issue has been read and acknowledged by Delta admins

Comments

@YannByron
Copy link
Contributor

MetadataCleanup will delete the expired delta logs(json and checkpoint.parquet). But if the present version depends on these which will be cleaned up, it cannot replay to the whole commits. For example: we have delta logs from version 0 to version 10 as following: 000.json ~ 009.json, 010.json, 010.checkpoint.parquet.

When Commit 10 is operated, MetadataCleanup works. If we assume the logs before 9 (not contained) should be cleaned up, then the rest files are: 009.json, 010.json, 010.checkpoint.parquet.
In fact, Version 9 is not available, and only Version 10 is shown by desc history.

@zsxwing
Copy link
Member

zsxwing commented Feb 23, 2021

@YannByron could you explain a bit what's the behavior you expect? Making desc history show only versions you can time travel to?

@YannByron
Copy link
Contributor Author

@zsxwing
Based on the above example, I expect that in order to ensure the remaining versions are available, the necessary information that should have been cleaned up can be retained. That is, do not delete 000.json ~ 007.json.
If that, desc history still show all versions. So, we can mark those versions that user expects to clean up, adjust the desc history implement so that those are not displayed.

IMO, MetadataCleanup should keep the front checkpoint file (if it exists) closest to the smallest non-deleted version and the delta-log files(json) in between.

So we should 'cleanup as much as possible' and 'guarantee available within expectations'. After all, users don't intend to clean up the Version 8 and 9 as in the above example.

@YannByron
Copy link
Contributor Author

IMO, MetadataCleanup should keep the front checkpoint file (if it exists) closest to the smallest non-deleted version and the delta-log files(json) in between.

Rather that keep those files, making checkpoint for the smallest non-deleted version may be the better solution, so that we don't need to adjust desc history.

@vkorukanti vkorukanti added the acknowledged This issue has been read and acknowledged by Delta admins label Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins
Projects
None yet
Development

No branches or pull requests

3 participants