Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
AssetGraphView, TemporalContext, and AssetSlice
### Summary & Motivation With the additional of AMP, asset partitions, dynamic partitioning, and other related features, the complexity of our system has outstripped the ability of our abstractions to model it. A shallow indication of this our repeated threading of current time/evaluation time, storage_id/cursors, and dynamic_partitions_store up and down our stack. Another is that also have one off "caching" classes like `CachingStaleStatusResolver`, `CachingInstanceQueryer`, `CachingDataTimeResolver` and perhaps others I do not know about that present a wide range of capabilities inconsistently. Superficially it is annoying to have to thread time, storage_id, and dynamic_partitions_store around everywhere and hard to understand what class to use when interrogating the asset graph. This belies a more profound problem: Some of our capabilities respect current time; some do not. Some of our capabilities respect storage_id; some do not. That means the engineers do not know what reads are volatile with respect to time and underlying storage. It is also difficult to know what is to safe to cache or not. This means also that as an engineer navigates the state of the asset graph, it is difficult to know what operations are cheap to compute, versus potentially extremely expensive to compute. This PR introduces `AssetGraphView`, `AssetSlice` and `TemporalContext` to address this issue. Temporal Context: TemporalContext represents an effective time, used for business logic, and last_event_id which is used to identify that state of the event log at some point in time. Put another way, the value of a TemporalContext represents a point in time and a snapshot of the event log. Asset Graph View. It is a view of the asset graph and its state from the perspective of a specific temporal context. From the asset graph you can access asset slices, the main API for navigating the view of an asset graph. Asset Slice: Represents a set of partitions attached to a particular asset. By having `AssetSlice` contain a reference to `AssetGraphView` this enables a more elegant traversal of an asset graph's partition space than before. AssetSlice strives to be "partition-native". Very specifically, it deliberately does not have properties like `is_partitioned`. Instead they are just represented a slice with a single asset partition. Right now we do an inordinate (and unnecessary) special casing for unpartitioned assets. This will take some adjustment but will result in much cleaner code and and mental model. e.g. Before: ```python def parent_slice( context, asset_graph: AssetGraph, child_asset_subset: AssetSubset, parent_key: AssetKey) -> AssetSubset: return asset_graph.get_parent_asset_subset( dynamic_partitions_store=context.instance_queryer, parent_asset_key=parent_asset_key, child_asset_subset=child_asset_subset, current_time=context.evaluation_dt, ) ``` After: ```python def parent_slice(child_slice: AssetSlice, parent_key: AssetKey) -> AssetSlice: return child_slice.compute_parent_slice(parent_key) ``` We also have naming conventions to indicate the performance characteristics of methods: Naming conventions * Properties guaranteed to be fast. * Methods prefixed with `get_` do some work in-memory but not hugely expensive * Methods prefixed with `compute_` do potentially expensive work, like compute * partition mappings and query the instance. * Methods using "materialize" indicate that they fully materialize partition sets These can potentially be very expensive if the underlying partition set has an in-memory representation that involves large time windows. I.e. if the underlying PartitionsSubset in the ValidAssetSubset is a TimeWindowPartitionsSubset Usage of these methods should be avoided if possible if you are potentially dealing with slices with large time-based partition windows. This PR also addes two type aliases: ```python PartitionKey: TypeAlias = Optional[str] AssetPartition: TypeAlias = AssetKeyPartitionKey ``` I consider the rename of `AssetKeyPartitionKey` to `AssetPartition` merely an acknowledgement of the current ground truth in the code base, where nearly all local variables and method names refer to `asset_partition` because `asset_key_partition_key` is self evidently gross. The `PartitionKey` alias is perhaps more controversial. FAQ: * Why add `AssetSlice` and not reuse `ValidAssetSubset`? * `AssetSlice` is different as it contains a reference to the asset graph view. This makes it fundamentally different. It also allow for elegant traversal of the graph without having to thread a datetime and a reference to an instance or instance queryer everywhere, or having to convert between `ValidAssetSubset` and `AssetSubset`. * Second I think the term "Subset" is extremely confusing, given that the "Subset" can actually refer to a _complete_ set of partitions to an asset. * Introducing a new name allows for disambiguation via a single word. `slice` and `subset` as local variables is very clear. * Slice also seeks to be "partition-native" and treat partitioned and unpartitioned assets uniformly. I consider properties like `bool_value` and `subset_value` on `AssetSubset` to be quite gross, so this abstractions seeks to encapsulate that. The plan here is to introduce this abstraction then use it instead of direct use of various "resolver" and "queryer" classes throughout the code base. This will allow capabilities such as AMP, the backfill daemon, and state/outdated calculations int the web server to be more consistent and done with less code. Two concrete items this will help with immediately: * Consistent treatment of last_storage_id in AMP * Canonicalization of "outdated" logic in AMP and "stale" logic in dagster-webserver under a new single concept, likely named "unsynced" (but subject to discussion). Another objective is to allow user-defined AMP rules against a higher level API that is either this or something directly backed by this. ## How I Tested These Changes Included unit tests
- Loading branch information