You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MetadataCleanup will delete the expired delta logs(json and checkpoint.parquet). But if the present version depends on these which will be cleaned up, it cannot replay to the whole commits. For example: we have delta logs from version 0 to version 10 as following: 000.json ~ 009.json, 010.json, 010.checkpoint.parquet.
When Commit 10 is operated, MetadataCleanup works. If we assume the logs before 9 (not contained) should be cleaned up, then the rest files are: 009.json, 010.json, 010.checkpoint.parquet.
In fact, Version 9 is not available, and only Version 10 is shown by desc history.
The text was updated successfully, but these errors were encountered:
@zsxwing
Based on the above example, I expect that in order to ensure the remaining versions are available, the necessary information that should have been cleaned up can be retained. That is, do not delete 000.json ~ 007.json.
If that, desc history still show all versions. So, we can mark those versions that user expects to clean up, adjust the desc history implement so that those are not displayed.
IMO, MetadataCleanup should keep the front checkpoint file (if it exists) closest to the smallest non-deleted version and the delta-log files(json) in between.
So we should 'cleanup as much as possible' and 'guarantee available within expectations'. After all, users don't intend to clean up the Version 8 and 9 as in the above example.
IMO, MetadataCleanup should keep the front checkpoint file (if it exists) closest to the smallest non-deleted version and the delta-log files(json) in between.
Rather that keep those files, making checkpoint for the smallest non-deleted version may be the better solution, so that we don't need to adjust desc history.
MetadataCleanup will delete the expired delta logs(json and checkpoint.parquet). But if the present version depends on these which will be cleaned up, it cannot replay to the whole commits. For example: we have delta logs from version 0 to version 10 as following: 000.json ~ 009.json, 010.json, 010.checkpoint.parquet.
When Commit 10 is operated, MetadataCleanup works. If we assume the logs before 9 (not contained) should be cleaned up, then the rest files are: 009.json, 010.json, 010.checkpoint.parquet.
In fact, Version 9 is not available, and only Version 10 is shown by
desc history
.The text was updated successfully, but these errors were encountered: