Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT * FROM subquery ignores ordering #13904

Closed
TheBuilderJR opened this issue Dec 25, 2024 · 4 comments
Closed

SELECT * FROM subquery ignores ordering #13904

TheBuilderJR opened this issue Dec 25, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@TheBuilderJR
Copy link
Contributor

TheBuilderJR commented Dec 25, 2024

Describe the bug

Generally for almost all other sql query engines I expect

SELECT * FROM (
  SOME_ORDERED_SUB_QUERY_HERE
) AS query LIMIT 10000

to maintain the same order as the subquery, eg.

SELECT * FROM (
SELECT
  session.access_token,
  COUNT(1) as c
FROM quizchat_chats
GROUP BY session.access_token
ORDER BY c
) AS query LIMIT 10000

But in datafusion, it doesn't! It seemingly returns a randomly ordered list.

To Reproduce

Run a SELECT * FROM over an ordered subquery

Expected behavior

ordering is preserved

Additional context

No response

@TheBuilderJR TheBuilderJR added the bug Something isn't working label Dec 25, 2024
@2010YOUY01
Copy link
Contributor

I think SQL standard doesn't enforce such ordering, but it's possible most engines implemented it as ordered

@jonahgao
Copy link
Member

A subquery is a derived table/relation. A relation is unordered, and you can't rely on its tuples being ordered; different database implementations or queries have different behaviors. In #12003, DataFusion intentionally optimized away unnecessary ordering in subqueries.

Some discussions of other databases:

@ozankabak
Copy link
Contributor

@jonahgao is correct -- this is not a bug. If the top-level query doesn't specify an ordering, the engine is free to optimize the subquery ordering away.

@comphead
Copy link
Contributor

comphead commented Dec 28, 2024

@TheBuilderJR thanks for opening the issue and please let us know if anything still needed. as @ozankabak correctly mentioned the if there is no top level ordering the engine cannot guarantee the final order.

The LIMIT clause gets evaluated after the ORDER clause, so if you rewrite your query

SELECT * FROM (
SELECT
  session.access_token,
  COUNT(1) as c
FROM quizchat_chats
GROUP BY session.access_token
) AS query ORDER BY c LIMIT 10000

it should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants