`iter_batches` support for `read_database_uri` with connectorx #21041

GatienDoesStuff · 2025-02-01T17:51:16Z

Description

Currently, streaming results in batches is only available to read_database(), which allows to process data that would otherwise be too big to fit within system memory.

connectorx supports streaming batches, though it doesn't seem to expose the feature to it's python API yet.

Polars implements read_database_uri() for via connectorx's python module. Would it be within scope for the project to implement it from connectorx's rust API instead ? This would allow users in memory-constrained scenarios to benefit from the speedups & better type inference connectorx offers.

I'm currently using a PyO3 module that exposes the bare minimum I need for streaming efforts, perhaps polars could benefit from having the feature upstreamed ?

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2025-02-04T08:39:49Z

Would it be within scope for the project to implement it from connectorx's rust API instead?

If they expose it up to the Python API, we can certainly take advantage (and I'd be happy to integrate it), but I don't see us taking on their Rust API (it's a large non-trivial dependency).

deanm0000 · 2025-02-04T19:29:54Z

Is it much different than doing something like:

def streaming_db(query, limit, uri):
    offset=0
    while True:
        batch = pl.read_database_uri(f"{query} limit {limit} offset {offset}", uri)
        if batch.shape[0]==0:
            raise StopIteration
        yield batch

Obviously if you were inputting a query that was already using limit offset then it'd break.

alexander-beedie · 2025-02-05T12:55:24Z

Is it much different than doing something like:

Yes; use of repeated limit/offset queries for pagination through a resultset is usually A Really Bad Idea™ :)

The resultset may no longer be internally consistent if a write occurs between consecutive queries.
Each advance further into the offset will usually come with a performance penalty as the entire query still has to be scanned each time to find the offset.

A "real" batched query can use a server-side cursor to mitigate both of these points; the query resultset will be internally consistent, and the query isn't repeatedly rerun.

GatienDoesStuff added the enhancement New feature or an improvement of an existing feature label Feb 1, 2025

alexander-beedie added the A-io-database Area: reading/writing to databases label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`iter_batches` support for `read_database_uri` with connectorx #21041

`iter_batches` support for `read_database_uri` with connectorx #21041

GatienDoesStuff commented Feb 1, 2025

alexander-beedie commented Feb 4, 2025 •

edited

Loading

deanm0000 commented Feb 4, 2025

alexander-beedie commented Feb 5, 2025

iter_batches support for read_database_uri with connectorx #21041

iter_batches support for read_database_uri with connectorx #21041

Comments

GatienDoesStuff commented Feb 1, 2025

Description

alexander-beedie commented Feb 4, 2025 • edited Loading

deanm0000 commented Feb 4, 2025

alexander-beedie commented Feb 5, 2025

`iter_batches` support for `read_database_uri` with connectorx #21041

`iter_batches` support for `read_database_uri` with connectorx #21041

alexander-beedie commented Feb 4, 2025 •

edited

Loading