-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Limit together with pushdown_filters #13788
Conversation
Thank you @zhuqi-lucas. Could you, please, add a test which could serve as a reproducer for this issue? Ideally if it would be an sqllogictest simpler than the example from the description, something like statement ok
set datafusion.execution.parquet.pushdown_filters = true;
# this one is also required to make DF skip second file due to "sufficient" amount of rows
statement ok
set datafusion.execution.collect_statistics = true;
statement ok
CREATE EXTERNAL TABLE test_filter_with_limit
(
part_key INT,
value INT,
)
STORED AS PARQUET
LOCATION 'test_files/scratch/parquet/test_filter_with_limit/'
PARTITIONED BY (part_key);
# There will be 2 records filtered from the table to check that `limit 1` actually applied
statement ok
insert into test_filter_with_limit (part_key, value) values (1, 0), (1, 1), (1, 100), (2, 0), (2, 2), (2, 2), (2, 100);
query II
select * from test_filter_with_limit where value = 2 limit 1;
----
2 2 |
let mut statistic_file_limit = limit; | ||
if !filters.is_empty() { | ||
statistic_file_limit = None; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut statistic_file_limit = limit; | |
if !filters.is_empty() { | |
statistic_file_limit = None; | |
} | |
let statistic_file_limit = if filters.is_empty() { limit } else { None }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion @korowa !
Good suggestion! Thanks @korowa , i will address it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm going to merge it tomorrow, in case anyone else would like to review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas and @korowa
I tried to verify the test covers the code by reverting the code change:
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git diff
diff --git a/datafusion/core/src/datasource/listing/table.rs b/datafusion/core/src/datasource/listing/table.rs
index b12f37ed7..ffe49dd2b 100644
--- a/datafusion/core/src/datasource/listing/table.rs
+++ b/datafusion/core/src/datasource/listing/table.rs
@@ -843,13 +843,8 @@ impl TableProvider for ListingTable {
});
// TODO (https://github.com/apache/datafusion/issues/11600) remove downcast_ref from here?
let session_state = state.as_any().downcast_ref::<SessionState>().unwrap();
-
- // We should not limit the number of partitioned files to scan if there are filters and limit
- // at the same time. This is because the limit should be applied after the filters are applied.
- let statistic_file_limit = if filters.is_empty() { limit } else { None };
-
let (mut partitioned_file_lists, statistics) = self
- .list_files_for_scan(session_state, &partition_filters, statistic_file_limit)
+ .list_files_for_scan(session_state, &partition_filters, limit)
.await?;
// if no files need to be read, return an `EmptyExec`
And running the test:
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ cargo test --test sqllogictests -- pushdown
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.15s
Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-b3c91103e28d27e7)
Executed "parquet_filter_pushdown.slt". Took 24.915792ms
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$
However, the test seems to pass without the code changes 🤔
Thank you @alamb for review, i will try to change the testing, it may not hit the test case. |
@alamb @korowa Before the PR, the testing result: query II
select * from test_filter_with_limit where value = 2 limit 1;
---- After the PR, the testing result: query II
select * from test_filter_with_limit where value = 2 limit 1;
----
2 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that the test does indeed fail now without the change. Thank you very much @zhuqi-lucas and @korowa
[SQL] select * from test_filter_with_limit where value = 2 limit 1;
[Diff] (-expected|+actual)
- 2 2
at test_files/push_down_filter.slt:186
Error: Execution("1 failures")
error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`
Thank you!
It looks like there is another .slt in the logs:
while the new test was added into |
Which issue does this PR close?
Closes #13724
Rationale for this change
We always limit the partition file even with filter before limit, so it will return the wrong file group numbers for us to filter the data before the final limit.
Fix way, we should not limit the partition files when we have filter already with limit.
What changes are included in this PR?
Are these changes tested?
Testing code with generating files.
After this PR, the testing will print the right value instead of the empty value.
println!("{:?}", batch);
Are there any user-facing changes?
No