-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: read cloud creds for obstore from env #556
base: main
Are you sure you want to change the base?
Conversation
The first commit adds just reading from the env, the second adds a cache for object_store objects for more efficient re-use. I have not finished dealing with errors in this, but I want your initial opinion to see if I should proceed in this direction. The test failures are unrelated and due to 403 HTTP error in validation (should we mock the schema query there to be less dependent on the internet?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good! I think I'd prefer a struct-specific cache instead of a global one. So maybe a ObjectStore
structure that holds the cache?
Yeah, I've run into this before, I forget how I solved it initially (it's something to do with cloudflare being cranky about our requests for something that's a redirect). While I do want to test network-based validation (the STAC ecosystem does it a lot) I agree that network-based tests are fragile and bad. I'd be down for even just ignoring them for now and adding a tracking issue to fix. |
I am a bit unclear on this. The current idea is that the If we put the cache in a struct, who will be responsible for creating it? How do I make it outlive the Note just a reminder, that I started learning rust a few weeks ago, and might miss very basic things. |
Yup, it's a good call-out. I'd like to expand our API to include This mirrors the pattern of stuff like reqwests, which has |
Introok, now I understand what you want. Let me explain why I think global cache is a better approach. Before implementing this, I looked at who uses the DatafusionThe main PolarsPolars' main interaction pattern is different, you deal with data frames (example from docs). Dataframes can be read from wherever. So, they build object stores based on a URL (source) and use global cache (source). stac-rsI would argue that our use cases and interaction pattern are much closer to Note Important things to understand is that object_store operates on a per-bucket level Our main "object" that we interact with is STAC. There is a recommendation that STACs should be colocated, but no guarantees. Consider the following example. I am working with a public STAC of raster imagery, I apply a set of filters, produce a new collection, and save it to my cloud storage. My collection is located in one bucket, my items are in a different bucket. This is a valid STAC, that we must deal with. When I receive a collection, there is no way for me to construct an Because of the dynamic nature of the STAC, it can potentially be located in more than one bucket (or even cloud provider). At the same time, STAC as a format can have huge amounts of files. So, I would argue we need to focus on both the dynamic aspect of reacting to new locations and optimization to be able to fetch multiple files effectively. I guess the main point is, that we never have a guarantee that one object store is enough, so we need to be prepared to work with multiple. And exposing this outside seems cumbersome. Alternative APILet's try to explore the alternative, where we expose the registry that lives in its own trait and has a more limited lifetime
In this example, if we use the ConclusionGiven all of the above, I chose the global cache approach, as it is flexible enough, the implementation is straightforward, and user interaction is streamlined. Pros
Cons
|
I don't hate that, and I agree that we're more likely to be used in a polars-style use case. Thanks for writing it out. Since it is so tricky, my initial instinct is to make a new crate for it, so we don't have to break the core crate's semver when we (inevitably) change everything. |
128c144
to
bf909cd
Compare
bf909cd
to
3e2f3e7
Compare
Changes since last time:
Please let me know if anything can be improved. Oh, and also, at some point I merged the main in my branch, then I saw that merge commits are not allowed, so I rebased and force-pushed. At this point, the history is linear, but merging is still blocked. Is there anything I can do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, requested a couple of changes. However, doesn't seem to work:
cargo run --all-features -- translate https://raw.githubusercontent.com/radiantearth/stac-spec/refs/heads/master/examples/simple-item.json
This command works on main.
use url::Url; | ||
|
||
// To avoid memory leaks, we clear the cache when it grows too big. | ||
// The value does not have any meaning, other than polars use the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to the polars code that uses it? Credit where credit is due.
/// Parameter set to identify and cache an object store | ||
#[derive(PartialEq, Eq, Hash, Debug)] | ||
struct ObjectStoreIdentifier { | ||
/// A base url to the bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not always a bucket, right? I.e. Azure doesn't call them buckets.
{ | ||
let cache = OBJECT_STORE_CACHE.read().await; | ||
if let Some(store) = (*cache).get(&object_store_id) { | ||
return Ok((store.clone(), path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some tracing::debug
statements here and Below when we write to (and sometimes clear) the cache?
{ | ||
let cache = OBJECT_STORE_CACHE.read().await; | ||
println!("{cache:#?}") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | |
let cache = OBJECT_STORE_CACHE.read().await; | |
println!("{cache:#?}") | |
} |
let (store, _path) = parse_url_opts(&url, options.clone()).await.unwrap(); | ||
|
||
let url2 = Url::parse("s3://other-bucket/item").unwrap(); | ||
// let options2: Vec<(String, String)> = vec![(String::from("some"), String::from("option"))]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// let options2: Vec<(String, String)> = vec![(String::from("some"), String::from("option"))]; |
//! | ||
//! Features: | ||
//! - cache used objects_stores based on url and options | ||
//! - read cloud creadentials from env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! - read cloud creadentials from env | |
//! - Read cloud credentials from your environment |
//! Work with [ObjectStore](object_store::ObjectStore) in STAC. | ||
//! | ||
//! Features: | ||
//! - cache used objects_stores based on url and options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! - cache used objects_stores based on url and options | |
//! - Cache used object stores based on url and options |
Also, as a note, nothing about this is specific to STAC, so we should keep an eye on either object-store or another package to provide this for us ... e.g. if someone could rip out the polars stuff to its own crate. |
Closes
Description
This is POC.
Right now we init
object_store
informat.rs
.We use the
object_store::parse_url_opts
function to parse the URL, extract store kind, bucket, etc from it and build a store from it.I yanked this function to change one line in a macro from
builder::new()
tobuidler::from_env()
Pros:
Cons:
ObjectStore
for eachformat.get_opts()
callAfter that, I created a per-bucket cache for object stores. The idea is not to expose the ObjectStore object anywhere but to keep using a combination of item URL & options to get the one you need. A similar approach to dealing with object stores is used in polars. I tried to re-use any URL parsing from the
object_store
crate to reduce the maintenance overhead.Alternatives I would be happy to discuss:
object_store
more of an explicit item inside the code base, separating it fromformat
and maybe starting to pass it around (e.g. init in cli entrypoint, pass down to others)Overall, I'd like to discuss if it is a good approach to the problem or if you have other possible directions to explore.
Checklist
Delete any checklist items that do not apply (e.g. if your change is minor, it may not require documentation updates).
cargo fmt
)cargo test