Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.21.0 is 5-9 times slower than 1.17.0 on collect on concattenated Azure blob parquet files. #20959

Open
2 tasks done
astrowonk opened this issue Jan 28, 2025 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@astrowonk
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Per this comment on #13381, I have a large collection of parquet blobs I can't be certain that all have the same column order so I create lazy frames like this, sometimes with hundreds of parquet files concatted:

df_test_lazy = (
    pl.concat([
        pl.scan_parquet(x, storage_options=storage_options)
        for x in [
            "az://some-blobs/col_order_test_1.parquet",
            "az://some-blobs/col_order_test_2.parquet"
            ]
        ], 
              how="diagonal")
    )

Since 1.20.0, operations on these lazy frames are 5-9 times slower than 1.17. The speed is fast without concat - if I have consistent columns and so I can avoid using pl.concat

Log output

slow run on 1.21.0, alas the log looks identical to the run on 1.17.0

UNION: union is run in parallel
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
POLARS PREFETCH_SIZE: 32
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5b0ac6b550))))
querying metadata of 1/1 files...
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb4c7210))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81a9250))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81a7850))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
build_object_store: clearing store cache (cache.len(): 8)
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb40f2d0))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81a6c50))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb34a390))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a124c4750))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a124c6310))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a124c6f50))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81ec5d0))))
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81cca90))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
build_object_store: clearing store cache (cache.len(): 8)
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb35b710))))
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb4ab750))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a124c7190))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81a8bd0))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
parquet row group must be read, statistics not sufficient for predicate.
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a94261ad0))))
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077332, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81a17d0))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
reading of 1/1 file...
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81b9bd0))))
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077332, expiry = 1738163308 (in 85976 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb4d1110))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
build_object_store: clearing store cache (cache.len(): 8)
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163309 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163309 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163309 (in 85976 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163309 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163309 (in 85976 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163309 (in 85976 seconds)
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb35c350))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
reading of 1/1 file...
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab819c910))))
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81914d0))))
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a9427a490))))
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab8614590))))
parquet row group must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81ed5d0))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
parquet file must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81d3010))))
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet scan with parallel = Columns
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a94248c10))))
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
build_object_store: clearing store cache (cache.len(): 8)
reading of 1/1 file...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81a03d0))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077333, expiry = 1738163308 (in 85975 seconds)
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb44dcd0))))
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a9426ea10))))
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077333, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
reading of 1/1 file...
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a94307850))))
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb498590))))
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81b9550))))
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a94274ad0))))
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb4b1e50))))
build_object_store: clearing store cache (cache.len(): 8)
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81bee90))))
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet scan with parallel = Columns
parquet row group must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5a9424f190))))
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
reading of 1/1 file...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab8192950))))
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5ab81d3b90))))
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
reading of 1/1 file...
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
parquet scan with parallel = Columns
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
[CloudOptions::build_azure]: Using credential provider Python(PythonCredentialProvider(PythonFunction(Py(0x7f5abb4dabd0))))
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Finish update_func: new expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738077334, expiry = 1738163308 (in 85974 seconds)
[FetchedCredentialsCache]: Call update_func: current_time = 1738077334, last_fetched_expiry = 0
parquet scan with parallel = Columns
parquet file must be read, statistics not sufficient for predicate.


(truncated)

Issue description

Speed of collect operations on diagonally concatenated parquet blobs from Azure storage are slower than 1.17.0

Expected behavior

The speed be the same as 1.17.0, before #20610 fixed the scan_parquet filter crash.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-4.18.0-553.27.1.el8_10.x86_64-x86_64-with-glibc2.28
Python:              3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
2.66.0CLI            
<not installed>ager  
5.5.0r               
1.19.0identity       
<not installed>      
3.1.0pickle          
<not installed>      
<not installed>      
<not installed>      
2024.10.0            
<not installed>      
2.36.0.auth          
<not installed>      
3.9.2otlib           
1.26.4               
3.1.5yxl             
2.1.4s               
16.1.0w              
<not installed>      
<not installed>      
1.4.53hemy           
<not installed>      
<not installed>      
3.1.1riter     ```

</details>
@astrowonk astrowonk added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 28, 2025
@deanm0000
Copy link
Collaborator

What if instead of doing a diagonal concat you prespecify the columns like:

df_test_lazy = (
    pl.concat([
        pl.scan_parquet(x, storage_options=storage_options).select(
            "apple", "banana", "carrot"
        )
        for x in [
            "az://some-blobs/col_order_test_1.parquet",
            "az://some-blobs/col_order_test_2.parquet"
            ]
        ])
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants