-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request - Auto Analyze Table #581
Comments
It would be very good if delta lake OSS have columns statistics for parquet file prunning. |
We are currently reviewing this issue and will follow up shortly. |
@scottsand-db do you have any updates on this? Is it expected for the next release? |
Hi @felipepessoto, thanks for following up. Delta Lake 1.1 included per-column file stats collection + data skipping. Does this meet your needs? |
I need to test it. In my experiments with Parquet and Delta, the ANALYZE TABLE made the queries ~40% faster than both Parquet (without ANALYZE TABLE) and Delta. |
BTW, you mean Delta 1.2? In 1.1 changelog I don't see these changes |
Yup, my bad. I meant 1.2. |
@scottsand-db in my test with 1.2 it didn't improve performance. Looking the query plan, they are the same as 1.1, except by PreparedDeltaFileIndex instead of TahoeLogFileIndex. Stats are expected to improve performance for queries like this? https://github.com/Agirish/tpcds/blob/master/query93.sql |
@felipepessoto did you re-generate the data for your tests? Stats are only written by Delta 1.2, so you would need to re-generate them in order to leverage the data skipping improvements added by Delta 1.2. |
UPDATE: I found the stats (min, max, null count) in delta log, but not sure why they are not being used during query Yes, I regenerate it. Do you know how I can check the stats? DESCRIBE EXTENDED doesn't show any: If I do the same using parquet, after running the ANALYZE Table: BTW, do you know the differences between both approaches? Delta stats and ANALYZE TABLE? Thanks |
Hi @felipepessoto. The main advantage of Column stats on the other hand are used to help skip files during scans. So, if you perform a filter and then a The reason Delta Lake currently doesn't support If I'm missing any details I'm sure that @zsxwing can fill them in. |
Some more questions, please:
I'm wondering how much effort would be to change ANALYZE to work with Delta, reading from transaction log like you said |
Yep.
Yep. ANALYZE is a Spark feature and technically it only supports built-in file formats. Today Spark doesn't provide an interface to non built-in data sources. In addition, it requires users to run ANALYZE by themselves, and it's easy to return incorrect answers when users forgot to run ANALYZE.
IIRC, you are basically asking whether Delta can support table-wise stats. If so, this is not on our roadmap. This won't be an easy project. Unlike parquet, Delta Lake needs to provide ACID, and updating the table-wise stats for each write with ACID guarantee is challenging. An alternative solution but no ideal is computing the table-wise stats based on the file-wise stats when reading a Delta table. It would take a bit time to compute the table-wise stats and won't be as fast as parquet. But it would provide ACID which is critical for Delta Lake users. This is just a brainstorm. Feel free to post your suggestions! |
@zsxwing, I was looking at this PR: #840 which fallback to v1 to implement ANALYZE TABLE. It seems the problem with analyzeTable is only to calculate the total size, because the calculateTotalSize method relies on catalogTable.storage.locationUri and scan everything inside it (or similarly when data is partitioned). For count it seems fine to me:
And computeColumnStats also looks good. Do you think the same? |
@felipepessoto yep. You are right. Do you know how Spark uses these stats? Only for optimization, or also use stats to return answers? |
From my experiment, it seems it is only for optimization. |
It is possible to enable an option to auto analyze a delta table?
For example:
ANALYZE TABLE x COMPUTE STATISTICS FOR ALL COLUMNS
The text was updated successfully, but these errors were encountered: