Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination S3V2: Skip full metadata search when sync mode is append #52094

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johnny-schmidt
Copy link
Contributor

@johnny-schmidt johnny-schmidt commented Jan 22, 2025

What

High level: skip metadata search when we're not truncating

Definition of "not truncating":

  • import type = OVERWRITE (legacy def)
  • OR minGenerationId = generationId (actual def)

How

The big complication is that we use the initial search both to

  • get a list of stuff to delete after the sync
  • build a cache of key -> count we can use to make -1, -2 suffixes to avoid overwrite

So if we skip the search we still have to provide uniqueness guards:

If append-only:

  • paths aren't listed, metadata isn't searched, AND unique key guard isn't initialized
  • SO when we check for key uniqueness, we fall back to making a single list call and caching the results

If truncate:

  • we keep the existing behavior

Also

I dropped some unused constants and one unused method & promoted the unique key count to a mutable map (it should be even if behind a lock, because it's a suspend function)

@johnny-schmidt johnny-schmidt requested a review from a team as a code owner January 22, 2025 21:41
Copy link

vercel bot commented Jan 22, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 23, 2025 11:07pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Jan 22, 2025
@johnny-schmidt johnny-schmidt requested review from a team and removed request for a team January 22, 2025 22:15
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/skip-metadata-on-append branch from 6fc0d9b to 56b88b1 Compare January 23, 2025 19:02
@@ -183,6 +219,15 @@ class ObjectStorageFallbackPersister(
// Add a suffix matching an OPTIONAL -[0-9]+ ordinal
val matcher =
pathFactory.getPathMatcher(stream, suffixPattern = OPTIONAL_ORDINAL_SUFFIX_PATTERN)

if (!(stream.importType == Overwrite || stream.minimumGenerationId > 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment explaining why we we can do this optimization in each of these cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely include something about these calls being particularly pathological for the append case due to the number of objects we page through increasing over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should consider making this boolean logic an extension function to encapsulate the goofy booleaness.

@@ -126,18 +133,47 @@ class ObjectStorageDestinationState(
}

/** Used to guarantee the uniqueness of a key */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment describing how we ensure uniqueness by looking up the ordinal postfix by searching s3 and incrementing here.

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/skip-metadata-on-append branch from 56b88b1 to c2cb1b0 Compare January 23, 2025 23:01
runTest {
every { stream.importType } returns Overwrite
every { stream.generationId } returns 1L
every { stream.minimumGenerationId } returns 1L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably test the two conditions (minimumGenerationId > 0 and importType == Overwrite) separately — OR hide them behind a boolean method of some sort which we unit test separately.

@@ -32,7 +32,8 @@ class ObjectStorageDestinationStateTest {
)

companion object {
val stream1 = MockDestinationCatalogFactory.stream1
private val streamSource = MockDestinationCatalogFactory.stream1
val stream1 = streamSource.copy(minimumGenerationId = streamSource.generationId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary?

assertEquals("cat-1", state.ensureUnique("cat"))
assertEquals("turtle-101", state.ensureUnique("turtle"))
assertEquals("turtle-102", state.ensureUnique("turtle"))
assertEquals("spider", state.ensureUnique("spider"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some test cases to validate keys that contain numbers—e.g. the cat-1-1 scenario.

As an aside, I really like parameterized tests for testing over a large corpus of inputs. Could be useful here.

Copy link
Contributor

@tryangul tryangul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 once confident

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants