Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Auto Analyze Table #581

Open
devlucasc opened this issue Jan 13, 2021 · 17 comments
Open

Feature Request - Auto Analyze Table #581

devlucasc opened this issue Jan 13, 2021 · 17 comments
Labels
acknowledged This issue has been read and acknowledged by Delta admins enhancement New feature or request

Comments

@devlucasc
Copy link

devlucasc commented Jan 13, 2021

It is possible to enable an option to auto analyze a delta table?

For example: ANALYZE TABLE x COMPUTE STATISTICS FOR ALL COLUMNS

@rubenssoto
Copy link

It would be very good if delta lake OSS have columns statistics for parquet file prunning.

@scottsand-db scottsand-db added the enhancement New feature or request label Oct 7, 2021
@scottsand-db
Copy link
Collaborator

We are currently reviewing this issue and will follow up shortly.

@scottsand-db scottsand-db added the acknowledged This issue has been read and acknowledged by Delta admins label Oct 7, 2021
@felipepessoto
Copy link
Contributor

@scottsand-db do you have any updates on this? Is it expected for the next release?
Thanks

@scottsand-db
Copy link
Collaborator

Hi @felipepessoto, thanks for following up. Delta Lake 1.1 included per-column file stats collection + data skipping. Does this meet your needs?

@felipepessoto
Copy link
Contributor

I need to test it. In my experiments with Parquet and Delta, the ANALYZE TABLE made the queries ~40% faster than both Parquet (without ANALYZE TABLE) and Delta.

@felipepessoto
Copy link
Contributor

BTW, you mean Delta 1.2? In 1.1 changelog I don't see these changes

@scottsand-db
Copy link
Collaborator

Yup, my bad. I meant 1.2.

@felipepessoto
Copy link
Contributor

@scottsand-db in my test with 1.2 it didn't improve performance. Looking the query plan, they are the same as 1.1, except by PreparedDeltaFileIndex instead of TahoeLogFileIndex.

Stats are expected to improve performance for queries like this? https://github.com/Agirish/tpcds/blob/master/query93.sql
(with inner queries, joins)

@zsxwing
Copy link
Member

zsxwing commented May 27, 2022

@felipepessoto did you re-generate the data for your tests? Stats are only written by Delta 1.2, so you would need to re-generate them in order to leverage the data skipping improvements added by Delta 1.2.

@felipepessoto
Copy link
Contributor

felipepessoto commented May 27, 2022

UPDATE: I found the stats (min, max, null count) in delta log, but not sure why they are not being used during query

Yes, I regenerate it. Do you know how I can check the stats?

DESCRIBE EXTENDED doesn't show any:

image

If I do the same using parquet, after running the ANALYZE Table:

image

BTW, do you know the differences between both approaches? Delta stats and ANALYZE TABLE?

Thanks

@scottsand-db
Copy link
Collaborator

Hi @felipepessoto. The main advantage of ANALYZE is that it computes different stats than "column stats + data skipping". It computes stats on unique/distinct values in a table. These stats are useful in improving the performance of JOINs.

Column stats on the other hand are used to help skip files during scans. So, if you perform a filter and then a JOIN, that is when column stats will help.

The reason Delta Lake currently doesn't support ANALYZE is that it currently looks directly at parquet files. We would instead need to read the transaction log, and use that metadata to tell us which parquet files to read.

If I'm missing any details I'm sure that @zsxwing can fill them in.

@felipepessoto
Copy link
Contributor

Some more questions, please:

  1. Would be correct to say that Delta stats are file-wise while ANALYZE are table-wise?
  2. ANALYZE is a Spark feature, while Delta stats is part of Delta spec?
  3. There is any work in progress to support Delta ANALYZE?

I'm wondering how much effort would be to change ANALYZE to work with Delta, reading from transaction log like you said
Or alternatively, change Delta specs to add the other stats (if it makes sense).

@zsxwing
Copy link
Member

zsxwing commented May 31, 2022

  1. Would be correct to say that Delta stats are file-wise while ANALYZE are table-wise?

Yep.

2. ANALYZE is a Spark feature, while Delta stats is part of Delta spec?

Yep. ANALYZE is a Spark feature and technically it only supports built-in file formats. Today Spark doesn't provide an interface to non built-in data sources. In addition, it requires users to run ANALYZE by themselves, and it's easy to return incorrect answers when users forgot to run ANALYZE.

3. There is any work in progress to support Delta ANALYZE?

IIRC, you are basically asking whether Delta can support table-wise stats. If so, this is not on our roadmap. This won't be an easy project. Unlike parquet, Delta Lake needs to provide ACID, and updating the table-wise stats for each write with ACID guarantee is challenging.

An alternative solution but no ideal is computing the table-wise stats based on the file-wise stats when reading a Delta table. It would take a bit time to compute the table-wise stats and won't be as fast as parquet. But it would provide ACID which is critical for Delta Lake users. This is just a brainstorm. Feel free to post your suggestions!

@felipepessoto
Copy link
Contributor

felipepessoto commented Jun 23, 2022

@zsxwing, I was looking at this PR: #840 which fallback to v1 to implement ANALYZE TABLE.

It seems the problem with analyzeTable is only to calculate the total size, because the calculateTotalSize method relies on catalogTable.storage.locationUri and scan everything inside it (or similarly when data is partitioned).

For count it seems fine to me:

sparkSession.table(tableIdentWithDB).count()

And computeColumnStats also looks good.

Do you think the same?

@zsxwing
Copy link
Member

zsxwing commented Jun 24, 2022

@felipepessoto yep. You are right. Do you know how Spark uses these stats? Only for optimization, or also use stats to return answers?

@felipepessoto
Copy link
Contributor

From my experiment, it seems it is only for optimization.
I executed a SELECT COUNT(*) FROM TABLE query, Spark still scan the parquet files, instead of using the data available in the stats to return the results.
The same for MAX/MIN

@felipepessoto
Copy link
Contributor

apache/spark#41111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants