Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: read cloud creds for obstore from env #556

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alekzvik
Copy link
Contributor

@alekzvik alekzvik commented Dec 5, 2024

Closes

Description

This is POC.

Right now we init object_store in format.rs.
We use the object_store::parse_url_opts function to parse the URL, extract store kind, bucket, etc from it and build a store from it.

I yanked this function to change one line in a macro from builder::new() to buidler::from_env()
Pros:

  • do not need to think about the env reusing the builder logic there
  • do not need to explicitly make a store object we interact with

Cons:

  • still create a new ObjectStore for each format.get_opts() call
  • reads env a lot

After that, I created a per-bucket cache for object stores. The idea is not to expose the ObjectStore object anywhere but to keep using a combination of item URL & options to get the one you need. A similar approach to dealing with object stores is used in polars. I tried to re-use any URL parsing from the object_store crate to reduce the maintenance overhead.

Alternatives I would be happy to discuss:

  • making object_store more of an explicit item inside the code base, separating it from format and maybe starting to pass it around (e.g. init in cli entrypoint, pass down to others)
  • adding some long-living pool of per-bucket/creds object stores that are instantiated and utilized explicitly.

Overall, I'd like to discuss if it is a good approach to the problem or if you have other possible directions to explore.

Checklist

Delete any checklist items that do not apply (e.g. if your change is minor, it may not require documentation updates).

  • Unit tests
  • Documentation, including doctests
  • Git history is linear
  • Commit messages are descriptive
  • (optional) Git commit messages follow conventional commits
  • Code is formatted (cargo fmt)
  • cargo test
  • Changes are added to the CHANGELOG

@alekzvik alekzvik marked this pull request as ready for review December 10, 2024 12:01
@alekzvik alekzvik requested a review from gadomski as a code owner December 10, 2024 12:01
@alekzvik
Copy link
Contributor Author

alekzvik commented Dec 10, 2024

The first commit adds just reading from the env, the second adds a cache for object_store objects for more efficient re-use.

I have not finished dealing with errors in this, but I want your initial opinion to see if I should proceed in this direction.

The test failures are unrelated and due to 403 HTTP error in validation (should we mock the schema query there to be less dependent on the internet?)

Copy link
Member

@gadomski gadomski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! I think I'd prefer a struct-specific cache instead of a global one. So maybe a ObjectStore structure that holds the cache?

@gadomski
Copy link
Member

The test failures are unrelated and due to 403 HTTP error in validation (should we mock the schema query there to be less dependent on the internet?)

Yeah, I've run into this before, I forget how I solved it initially (it's something to do with cloudflare being cranky about our requests for something that's a redirect). While I do want to test network-based validation (the STAC ecosystem does it a lot) I agree that network-based tests are fragile and bad. I'd be down for even just ignoring them for now and adding a tracking issue to fix.

@alekzvik
Copy link
Contributor Author

Generally looks good! I think I'd prefer a struct-specific cache instead of a global one. So maybe a ObjectStore structure that holds the cache?

I am a bit unclear on this.

The current idea is that the ObjectStore client remains "hidden" from everyone and the only way to create it is to call get_opts or put_opts. Those functions will parse the URL and options and either make one (and put it in the cache) or get one from the static cache. The main goal here is if, for example, I use a CLI translate function to read and then write a STAC object, and both input & output paths use the same bucket, I will reuse the same client for both IO operations.

If we put the cache in a struct, who will be responsible for creating it? How do I make it outlive the get_opts and put_opts functions so that they can both reuse the same store?

Note

just a reminder, that I started learning rust a few weeks ago, and might miss very basic things.

@gadomski
Copy link
Member

If we put the cache in a struct, who will be responsible for creating it? How do I make it outlive the get_opts and put_opts functions so that they can both reuse the same store?

Yup, it's a good call-out. I'd like to expand our API to include stac::ObjectStore or some-such, a structure that folks can use to get and put STAC things. get_opts would create a oneshot instation of ObjectStore to do its request and then destroy it.

This mirrors the pattern of stuff like reqwests, which has reqwests::get for a single-use, and a Client for multi-use.

@alekzvik
Copy link
Contributor Author

alekzvik commented Dec 12, 2024

Intro

ok, now I understand what you want.

Let me explain why I think global cache is a better approach.

Before implementing this, I looked at who uses the object_store crate and how. Two examples jumped out: datafusion & polars. Both have the same concept of a cache for object_store clients but approach it differently.

Datafusion

The main datafusion interaction pattern is to create a SessionContext and interact with it (example from docs). This is very similar to the pattern with a session that reqwest uses. The context holds a RuntimeEnv (source) that has a pool for object stores (source). Stores are identified by a base URL and you can and must register them in advance. By default. only file-based object_store is initialized. If you want to use cloud storage, you need to register the store explicitly.

Polars

Polars' main interaction pattern is different, you deal with data frames (example from docs). Dataframes can be read from wherever. So, they build object stores based on a URL (source) and use global cache (source).

stac-rs

I would argue that our use cases and interaction pattern are much closer to polars than datafusion. I would even go one step further, and say that it is more dynamic.

Note

Important things to understand is that object_store operates on a per-bucket level

Our main "object" that we interact with is STAC. There is a recommendation that STACs should be colocated, but no guarantees. Consider the following example. I am working with a public STAC of raster imagery, I apply a set of filters, produce a new collection, and save it to my cloud storage. My collection is located in one bucket, my items are in a different bucket. This is a valid STAC, that we must deal with. When I receive a collection, there is no way for me to construct an object_store in advance, as I do not know where all the links would lead.

Because of the dynamic nature of the STAC, it can potentially be located in more than one bucket (or even cloud provider). At the same time, STAC as a format can have huge amounts of files. So, I would argue we need to focus on both the dynamic aspect of reacting to new locations and optimization to be able to fetch multiple files effectively. I guess the main point is, that we never have a guarantee that one object store is enough, so we need to be prepared to work with multiple. And exposing this outside seems cumbersome.

Alternative API

Let's try to explore the alternative, where we expose the registry that lives in its own trait and has a more limited lifetime

let url = URL::parse("<some stac url>");
let registry = ObjectStoreRegistry::new();
registry.register_store(url);
let value = stac::io::get(url, registry);
for link in value.links() {
    ???
}

In this example, if we use the datafusion approach, we need to somewhere decide if we need to create a new store. It seems that exposing the ObjectStoreRegistry does not give any benefits but rather more cumbersome interactions.

Conclusion

Given all of the above, I chose the global cache approach, as it is flexible enough, the implementation is straightforward, and user interaction is streamlined.

Pros

  • You can always use options to adjust the underlying store.
let value = stac::io::get_opts(stac_url, vec![("aws_secret", "****")]);
  • If you do not need to think about object stores, you do not have to
  • You get benefits of reusing one client

Cons

  • This cache trickery is implicit. Implicit things are usually bad. But I would argue, that in this case, as a user of a library, I do not want to deal with multiple object stores and rather focus on STACs
  • Typing and explicit/implicit choice. While you still have a way to configure your object store, it is all based on a "vec of tuples of strings", and you do not benefit from type checks. The alternative, where you can construct your own object store seems more robust. We can consider adding another function that accepts an object store and does IO with it or adds it to the cache.

@gadomski
Copy link
Member

I don't hate that, and I agree that we're more likely to be used in a polars-style use case. Thanks for writing it out.

Since it is so tricky, my initial instinct is to make a new crate for it, so we don't have to break the core crate's semver when we (inevitably) change everything.

@alekzvik alekzvik force-pushed the feat-read-cloud-creds-from-env branch from 128c144 to bf909cd Compare December 21, 2024 12:47
@alekzvik alekzvik force-pushed the feat-read-cloud-creds-from-env branch from bf909cd to 3e2f3e7 Compare December 21, 2024 12:47
@alekzvik
Copy link
Contributor Author

Changes since last time:

  • Move the code to a separate crate
  • Add new crate to CI
  • use the crate in core under already used object-store feature

Please let me know if anything can be improved.

Oh, and also, at some point I merged the main in my branch, then I saw that merge commits are not allowed, so I rebased and force-pushed. At this point, the history is linear, but merging is still blocked. Is there anything I can do?

Copy link
Member

@gadomski gadomski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, requested a couple of changes. However, doesn't seem to work:

cargo run --all-features -- translate https://raw.githubusercontent.com/radiantearth/stac-spec/refs/heads/master/examples/simple-item.json

This command works on main.

use url::Url;

// To avoid memory leaks, we clear the cache when it grows too big.
// The value does not have any meaning, other than polars use the same.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to the polars code that uses it? Credit where credit is due.

/// Parameter set to identify and cache an object store
#[derive(PartialEq, Eq, Hash, Debug)]
struct ObjectStoreIdentifier {
/// A base url to the bucket.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not always a bucket, right? I.e. Azure doesn't call them buckets.

{
let cache = OBJECT_STORE_CACHE.read().await;
if let Some(store) = (*cache).get(&object_store_id) {
return Ok((store.clone(), path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tracing::debug statements here and Below when we write to (and sometimes clear) the cache?

Comment on lines +179 to +183
{
let cache = OBJECT_STORE_CACHE.read().await;
println!("{cache:#?}")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
let cache = OBJECT_STORE_CACHE.read().await;
println!("{cache:#?}")
}

let (store, _path) = parse_url_opts(&url, options.clone()).await.unwrap();

let url2 = Url::parse("s3://other-bucket/item").unwrap();
// let options2: Vec<(String, String)> = vec![(String::from("some"), String::from("option"))];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// let options2: Vec<(String, String)> = vec![(String::from("some"), String::from("option"))];

//!
//! Features:
//! - cache used objects_stores based on url and options
//! - read cloud creadentials from env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! - read cloud creadentials from env
//! - Read cloud credentials from your environment

//! Work with [ObjectStore](object_store::ObjectStore) in STAC.
//!
//! Features:
//! - cache used objects_stores based on url and options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! - cache used objects_stores based on url and options
//! - Cache used object stores based on url and options

@gadomski
Copy link
Member

gadomski commented Dec 23, 2024

Also, as a note, nothing about this is specific to STAC, so we should keep an eye on either object-store or another package to provide this for us ... e.g. if someone could rip out the polars stuff to its own crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants