Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_batches support for read_database_uri with connectorx #21041

Open
GatienDoesStuff opened this issue Feb 1, 2025 · 3 comments
Open

iter_batches support for read_database_uri with connectorx #21041

GatienDoesStuff opened this issue Feb 1, 2025 · 3 comments
Labels
A-io-database Area: reading/writing to databases enhancement New feature or an improvement of an existing feature

Comments

@GatienDoesStuff
Copy link

Description

Currently, streaming results in batches is only available to read_database(), which allows to process data that would otherwise be too big to fit within system memory.

connectorx supports streaming batches, though it doesn't seem to expose the feature to it's python API yet.

Polars implements read_database_uri() for via connectorx's python module. Would it be within scope for the project to implement it from connectorx's rust API instead ? This would allow users in memory-constrained scenarios to benefit from the speedups & better type inference connectorx offers.

I'm currently using a PyO3 module that exposes the bare minimum I need for streaming efforts, perhaps polars could benefit from having the feature upstreamed ?

@GatienDoesStuff GatienDoesStuff added the enhancement New feature or an improvement of an existing feature label Feb 1, 2025
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 4, 2025

Would it be within scope for the project to implement it from connectorx's rust API instead?

If they expose it up to the Python API, we can certainly take advantage (and I'd be happy to integrate it), but I don't see us taking on their Rust API (it's a large non-trivial dependency).

@alexander-beedie alexander-beedie added the A-io-database Area: reading/writing to databases label Feb 4, 2025
@deanm0000
Copy link
Collaborator

Is it much different than doing something like:

def streaming_db(query, limit, uri):
    offset=0
    while True:
        batch = pl.read_database_uri(f"{query} limit {limit} offset {offset}", uri)
        if batch.shape[0]==0:
            raise StopIteration
        yield batch

Obviously if you were inputting a query that was already using limit offset then it'd break.

@alexander-beedie
Copy link
Collaborator

Is it much different than doing something like:

Yes; use of repeated limit/offset queries for pagination through a resultset is usually A Really Bad Idea™ :)

  • The resultset may no longer be internally consistent if a write occurs between consecutive queries.
  • Each advance further into the offset will usually come with a performance penalty as the entire query still has to be scanned each time to find the offset.

A "real" batched query can use a server-side cursor to mitigate both of these points; the query resultset will be internally consistent, and the query isn't repeatedly rerun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-database Area: reading/writing to databases enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants