Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to explicitly specify Azure TokenCredential in storage_options #20635

Closed
sugibuchi opened this issue Jan 9, 2025 · 3 comments · Fixed by #21047
Closed

Allow to explicitly specify Azure TokenCredential in storage_options #20635

sugibuchi opened this issue Jan 9, 2025 · 3 comments · Fixed by #21047
Assignees
Labels
A-api Area: changes to the public API A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@sugibuchi
Copy link

Description

CredentialProviderAzure, introduced by #20384, uses DefaultAzureCredential by default.

DefaultAzureCredential works well in most cases. However, in some corner cases, we must use a more specialised TokenCredential or a custom TokenCredential.

To make the Entra ID-based authentication in Polars more flexible, I propose to allow users to optionally specify an Azure TokenCredential object in storage_options.

import polars as pl

df = pl.read_parquet(source_abfs_url, srorage_options={"credential": MyCustomCredential()})

The specified TokenCredential should be propagated to CredentialProviderAzure.

@sugibuchi sugibuchi added the enhancement New feature or an improvement of an existing feature label Jan 9, 2025
@nameexhaustion
Copy link
Collaborator

Could you try checking if you can read using a custom credential_provider?1 Something like -

def provider():
    token = MyCustomCredential().get_token()

    return {
        "bearer_token": token.token,
    }, token.expires_on

pl.read_parquet(..., credential_provider=provider)

Footnotes

  1. https://docs.pola.rs/user-guide/io/cloud-storage/#using-one-of-the-available-credentialprovider-utility-classes

@nameexhaustion nameexhaustion added the A-io-cloud Area: reading/writing to cloud storage label Jan 10, 2025
@nameexhaustion nameexhaustion self-assigned this Jan 10, 2025
@sugibuchi
Copy link
Author

Yes, it works. However, allowing optionally setting TokenCredential objects to CredentialProviderAzure would be better.

class CredentialProviderAzure(CredentialProvider):
    def __init__(
        self,
        *,
        scopes: list[str] | None = None,
        tenant_id: str | None = None,
	credential: TokenCredential | None = None,
        _verbose: bool = False,
    ) -> None:
...
        self.credential = credential or importlib.import_module("azure.identity").__dict__[
            "DefaultAzureCredential"
        ]()
credential_provider = CredentialProviderAzure(credential=MyCustomCredential())

df1 = pl.read_parquet(src1, credential_provider=credential_provider )
df2 = pl.read_parquet(src2, credential_provider=credential_provider )
...

Easy to use

The suggested approach requires understanding both PyPolars CredentialProviderFunctionReturn API and Azure TokenCredential API. We can hide this complexity in CredentialProviderAzure class.

TokenCredential is a stateful object

TokenCredential is not a stateless function. It is an object with some internal states, such as refresh tokens. Therefore, wrapping it in an object like CredentialProviderAzure makes more sense.

from azure.identity import InteractiveBrowserCredential

def provider():
    token = InteractiveBrowserCredential().get_token()

    return {
        "bearer_token": token.token,
    }, token.expires_on

df1 = pl.read_parquet(src1, credential_provider=provider)
df2 = pl.read_parquet(src2, credential_provider=provider)
...

This code pops up a Web browser for authentication every time when Polars reads a Parquet file.

credential_provider = CredentialProviderAzure(credential=InteractiveBrowserCredential())
df1 = pl.read_parquet(src1, credential_provider=credential_provider )
df2 = pl.read_parquet(src2, credential_provider=credential_provider )
...

This code pops up a Web browser only once. InteractiveBrowserCredential wrapped in CredentialProviderAzure caches and automatically refreshes tokens without re-authentication using browsers.

@daviewales
Copy link

I've been bumping into this here:
#18931 (comment)

Pandas (and fsspec) solve this by allowing one to directly pass the Azure Identity credential object into the storage_options dict as follows:

credential = DefaultAzureCredential()
storage_options = {"credential": credential}

However, Polars requires the storage_options dict to contain only string values.
As the credential object interface is likely platform specific, I'm guessing this is why Polars requires a credential_provider function to adapt various providers to a common spec.

A more ergonomic approach might be to allow passing the credential object directly, then automatically wrap it in the appropriate credential_provider based on what kind of credential it is. (If Azure credential, wrap with azure_credential_provider, if AWS credential, wrap with aws_credential_provider, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Done
4 participants