Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine non-atree domain payloads into atree payloads #3584

Open
Tracked by #6577
fxamacker opened this issue Sep 13, 2024 · 7 comments · Fixed by #3664
Open
Tracked by #6577

Combine non-atree domain payloads into atree payloads #3584

fxamacker opened this issue Sep 13, 2024 · 7 comments · Fixed by #3664

Comments

@fxamacker
Copy link
Member

fxamacker commented Sep 13, 2024

Why (Objective)

We can reduce memory, storage use, etc. by combining non-atree domain registers into 1 atree payload per account. We can further reduce execution state size by using atree register inlining feature added in Atree v0.8.0.

This improvement can be deployed via HCU and implements first on-the-fly migration to avoid causing further downtime.

The full impact on memory reduction (TBD) will be realized after all the accounts eventually migrate all their non-atree domain registers to a single combined atree payload per account.

NOTE: Memory reduction is only 1 benefit from reducing payload count.

How will we measure success (Key Results) ?

After all the non-atree domain registers are migrated to 1 atree map per account we expect:

  • reduce execution state size by 10-12% (~40GB as of Sept. 2024 mainnet state)
  • Less network bandwidth needed for chunk data packs (TBD)
  • reduce payload count by 20-30% (by combining payloads)
  • reduce mtrie nodes by 20-30% (by eliminating mtrie nodes)
  • reduce EN memory use by TBD (usually better than state size reduction)

Requires HCU.
If we use on-the-fly migration, then the full impact won't be realized until all accounts are migrated.

Requires "cleanup" migration for accounts that were not migrated on-the-fly.

Estimate

~3dev/weeks to get ready for deployment.

DACI

Role Assigned
Driver Technical: @fxamacker , EM: @j1010001
Approver @dete
Consulted @ramtinms, @zhangchiqing, @turbolent
Informed Flow engineering team, Product team: @vishalchangrani, Marketing: @SeanRobb

Details

New domains were added in Cadence 1.0 and domain register count increased to 150 million (was 80 million on Sept. 4 pre-spork). Each 8-byte domain register also requires ~2 mtrie nodes (~2x96 byte overhead).

Nearly 25% of total payloads are 8-byte domain registers.

These 8-byte "domain registers" refers to 8-byte account domain payloads. They are not atree payloads.

As of Sept 10, 2024, mainnet state had:

  • total payload count: 610 million
  • atree payload count: 324,800,993
  • domain register (aka payload) count: 150,127,970
  • etc.

Suggestion

Currently, all 150+ million account domain registers are not using atree.

I think we should replace non-atree domain registers with 1 atree register (atree map) per account. Currently, there are 35 million accounts.

Also, we can use atree inlining to inline small domain data. So instead of reducing payload count from 150 million to 35 million, we can reduce payload count by 174 million.

UPDATED Estimates (Nov. 2024)

Results based on full migration test run using Nov 1, 2024 mainnet state:

  • eliminate mtrie nodes: -440 million (-28.8%)
  • reduce payload count: -180 million (-28.8%)
  • reduce register reads (inlined domain data)
  • reduce state size by -43.6GB (-11.4% of state size)
  • reduce peak RAM use by TBD (usually better than state size reduction)
  • provide single point of entry for account data

We can reduce payload count by ~28% (instead of being limited to ~24%) due to onflow/atree#429.

Full migration test run passed storage health check, diff-states, and used these:

DRAFT Estimates (Sept. 2024)

🔍 Initial estimates

Based on preliminary estimates using Sept. 17, 2024 mainnet state, I think we can:

  • eliminate mtrie nodes: -425 million (-28.5%)
  • reduce payload count: -174 million (-28.5%)
  • reduce register reads (inlined domain data)
  • reduce state size by ~40GB (roughly 10-12% of state size)
  • reduce peak RAM use by TBD (usually better than state size)
  • provide single point of entry for account data

We can reduce payload count by 28% (instead of being limited to 24%) due to atree inlining.

Misc

This issue replaces my older issue #2867.

EDIT: add DRAFT estimates and mention why we can reduce 174 million payload count if there are only 150 million domain registers.

Effort Estimation

2dev/weeks to wrap-up testing and open PR for review. Will need @turbolent to review.

Request from marketing:

please write a small Flow forum post (author still TBD, this shouldn't necessarily default onto Faye - though Faye could be an author)

@fxamacker
Copy link
Member Author

I was asked about impact of this, so I used mainnet state (Sept. 17 checkpoint) to create DRAFT estimates:

  • eliminate mtrie nodes: -425 million (-28.5%)
  • reduce payload count: -174 million (-28.5%)
  • reduce register reads (inlined domain data)
  • reduce state size -40GB and peak RAM by more
  • provide single point of entry for account data

We can reduce payload count by 28% (instead of being limited to 24%) due to atree inlining.

Reduction in payload count and mtrie nodes isn't as big as atree v.0.8.0 (Sept. 4, 2024) but goals are similar.

In general, benefits from reducing payload counts and eliminating mtrie nodes include:

  • Slow down growth rate of execution state.
  • Reduce peak RAM use on execution nodes.
  • Reduce SSD storage use (e.g. checkpoint file sizes).
  • Speedup network upgrades (e.g. future migrations processing fewer payloads).
  • Future efficiency of other servers, components, databases (files/cache/index), etc. that benefit from fewer payloads.

Each coding change can't always eliminate 1 billion mtrie nodes, but smaller reductions can eventually become greater.

@fxamacker fxamacker changed the title Reduce register reads and domain registers, etc. (currently 150+ million domain registers and 300+ million mtrie nodes) Reduce register reads and domain registers (-28% payload count and -425 million mtrie nodes on mainnet) Sep 27, 2024
@bluesign
Copy link
Contributor

it seems like there is also 140M registers ( ~ 4 per account ) used by flow-go, ( account status, keys etc ) maybe it can be good to consider them also. Would be nice to migrate everything to atree. ( as flow-go now uses atree anyway for EVM stuff )

@fxamacker
Copy link
Member Author

it seems like there is also 140M registers ( ~ 4 per account ) used by flow-go, ( account status, keys etc ) maybe it can be good to consider them also.

Yes, definitely worth considering! 👍

@fxamacker
Copy link
Member Author

I replaced text promising 25% reduction of EN memory (that estimate isn't from me).

Payload count can be reduced by 20-30% by combining their content, so their content is not lost. However, we can safely eliminate 20-30% of mtrie nodes (96 bytes each).

Note

If we use on-the-fly migration, then the full impact won't be realized until all accounts are eventually migrated.

Unless we do a full migration, we won't immediately see the full benefits.

After all the non-atree domain registers are migrated, we should get:

  • reduced payload count by 20-30%
  • reduced mtrie count by 20-30%
  • reduced execution state size by roughly 40GB (given Sept. 2024 mainnet state size)
  • reduced EN memory use by TBD (usually more than state size reduction)

While at it, I listed other benefits (memory reduction is only 1 benefit). Also, more than EN can benefit when we reduce state size by reducing payload counts, etc.

@fxamacker fxamacker changed the title Reduce register reads and domain registers (-28% payload count and -425 million mtrie nodes on mainnet) Combine non-atree domain payloads into atree payloads Oct 9, 2024
@j1010001 j1010001 self-assigned this Oct 10, 2024
@fxamacker
Copy link
Member Author

On mainnet, average number of domains per account is just over 4.

For each new account having 4 domains, I think we would save around 1152 bytes.

@fxamacker
Copy link
Member Author

To test the impact of PR #3664, I created a full migration program and ran it on mainnet data (Nov 1, 2024 state).

The full migration program on a test vm produced expected results for reducing number of payloads and mtrie nodes by 28.7% each. 🎉

However, the plan is to deploy PR #3664 as HCU and use its on-the-fly migration feature (no downtime), so the full impact won't be seen until all accounts (including read-only and idle accounts) are eventually migrated.

@fxamacker
Copy link
Member Author

Progress Update (Nov 4-15)

On Nov 4 and Nov 13, we ran full migrations with mainnet data and test results were better than estimated. They also satisfy the "How do we measure success" criteria. All full migration test runs detected no problems 🎉 and Nov 13 results show:

  • -11.4% state size reduction
  • -28.8% reduction in payload count
  • -28.8% reduction in mtrie node count

On Nov 5, PR reviews started for this OKR.

On Nov 12, we merged all PRs 🎉 (to feature branch) for the approved work for using HCU to migrate accounts having write ops.

  • We merged PR 3664 (+5026 lines) to feature branch. This PR was opened on Oct 31, requires HCU, and migrates accounts with write ops as discussed in Sept. It includes MigrateAccount() and MigrateAccounts(). The caller can provide any criteria for migrating each account by passing a pred function to MigrateAccounts().

  • We replaced domain name string with integer (Bastian's suggestion) reduced state size by extra -1.2GB (-0.3%). Old tech debt was a blocker to this extra optimization, so it was fixed first (23 files changed in master branch).

The remaining work of refactoring to make code easier to understand was combined with new requirements, since replacing HCU with feature flag would also require refactoring work that was done.

New Requirements (HCU vs Feature Flag)

On Nov 5, we started discussing and exploring possibility of using feature flag to replace HCU requirement.

On Nov 12, Bastian opened a draft PR 3678 to explore replacing HCU requirement with feature flag, as well as other worthwhile changes discussed since Nov 5.

By Nov 15, most of the feedback for the new PR 3678 were already incorporated:

Many thanks to Cadence team for reviews and amazing work by Bastian 🙌 on these new PRs. Extra work and findings in this project should benefit other projects in the future that involve feature flags or on-the-fly migrations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants