Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try including the generated grammar in-tree #2141

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jneem
Copy link
Member

@jneem jneem commented Jan 22, 2025

This is an attempt at #2123, by checking in a version of the generated grammar. The checked in file is 22MB (!), but I checked that the compressed cargo package size is more reasonable (1.2 MB). crates.io has a limit of 10MB per crate.

Stepping back, the main goal here is to avoid making users of the published nickel-lang-core generate the grammar when building. But we also need to make sure that it isn't too painful to modify the grammar. I think there are two basic approaches

  1. check in the generated grammar, and require some workarounds when modifying it
  2. generate the grammar at publish time, so it's in the published crate but not in the git repo.

Option 2 seems nicer in principle, but I'm worried that because publishing is infrequent and mostly automated, it will be prone to undetected breakage. So this PR does option 1, but I think the mechanism in the build script would work with either one; option 2 would just need some more work on the release script.

@jneem jneem requested a review from yannham January 22, 2025 16:16
@yannham
Copy link
Member

yannham commented Jan 22, 2025

I haven't reviewed, but I want to point out one of LALRPOP co-maintainer's interesting remark: https://github.com/orgs/lalrpop/discussions/1031#discussioncomment-11908105. There's a checksum of the lalrpop grammar file included somewhere, so maybe we can just include a check in the build script so that if the file has changed we rebuild it automatically?

@jneem jneem force-pushed the check-in-the-grammar branch from f4cd082 to 36ce7a7 Compare January 23, 2025 06:50
@jneem jneem force-pushed the check-in-the-grammar branch from 36ce7a7 to c1a867e Compare January 23, 2025 07:29
@jneem
Copy link
Member Author

jneem commented Jan 23, 2025

Ah, neat! I beefed up the build script a little so that it checks the hash before deciding what to do, and I also added a pre-commit check that will fail if you update the grammar and forget to copy the new generated file into the source tree.

enable = true;
description = "Check that the lalrpop output is up-to-date.";
files = "\\.lalrpop";
entry = "${check-fresh-grammar}/bin/check-fresh-grammar";
Copy link
Member

@yannham yannham Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a detail, but would a lib.getExe or the like work here, so that we don't even have to rely on the actual file name Nix generates?

Copy link
Member

@aspiwack aspiwack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we merge.

Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download. Ever time the grammar is change will add another 20+MB?

My intuition is that no, we absolutely don't want to do that, this is awful. Of course, I'm no longer the boss here, the person you have to convince that this is reasonable is @aherrmann . But let's be quite prudent about it. In fact, let me give you a little red cross just as a way to draw attention to the issue. This shouldn't be done lightly.

@aherrmann
Copy link
Member

Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download.

That sounds like a valid concern to raise. A fresh checkout of Nickel today fetches about 30 MiB. So, this is almost doubling the repo size. Completely anecdotal, but I observed 15.00 MiB/s for fetching objects, so naively this adds >1s fresh checkout time for every full replacement of the grammar.

Ever time the grammar is change will add another 20+MB?

If changes are incremental then the impact of changes could be much smaller. But, unless LALRPOP makes some guarantees around that, its behavior could change any time and we could end up with 20+MiB per grammar change.

generate the grammar at publish time, so it's in the published crate but not in the git repo.

This sounds like a reasonable middle ground.

Option 2 seems nicer in principle, but I'm worried that because publishing is infrequent and mostly automated, it will be prone to undetected breakage.

Could the relevant step be tested on CI?

@yannham
Copy link
Member

yannham commented Jan 24, 2025

Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download. Ever time the grammar is change will add another 20+MB?

It's a good question, but I'm not entirely convinced by the argument:

My intuition is that no, we absolutely don't want to do that, this is awful

Currently anything that uses Nickel as a Rust dependency is spending much more time compiling the grammar than it would take to download those extra 22MB (by the way, it's probably less traffic, since git does apply some compression, right?). The grammar needs to be generated anyway for a build, it has to land somewhere, so we don't save much space either. Maybe it can get collected currently though, while not so much if it has to be there in the source derivation?

To give some concrete numbers around #2123, simply using the same settings on Topiary than for Nickel shaved off 33%, that is 3min, of build time for Topiary on my machine. I suspect that the LALRPOP grammar generation is still taking a non-trivial amount of the rest of the build, so who knows how much we would win with the pre-checked file. As we advertise at least in part Nickel as an easy to embed configuration language into other Rust projects, adding a 4 to 5 minutes penalty on a fresh build isn't so great.

Though you have a good point that this is irreversible, for anyone that clones the repo entirely. I'm not arguing for or against any choice at this point, but I think that "checking a 22MB file is bad" is a bit abstract for me at the moment, while the impact of the full unoptimized LALRPOP build is concrete. How much do you think it's going to impact things in practice? @aherrmann gives a bit of perspective here, as this is almost doubling the size of the repo, which is indeed substantial (although it's not the same for a small repo with an almost instantaneous check-out versus doubling the size of, say, Nixpkgs).

I know it probably wasn't designed for text file, but could Git LFS be of some help here? Letting users actually downloading the grammar if they want, and the build script would act accordingly (if the file is downloaded, use it, otherwise scrap it).

Otherwise maybe option 2 is the way to go. As Andreas mentions, it shouldn't be too hard to check at release time that we ship something with a pre-generated grammar file.

@jneem
Copy link
Member Author

jneem commented Jan 24, 2025

I have an implementation of approach 2 up at #2145, just for comparison.

Could the relevant step be tested on CI?

Certain things could, but the main release script is pretty hard to test because it does things like push to the stable branch here on github.

it shouldn't be too hard to check at release time that we ship something with a pre-generated grammar file.

It's a little bit more complicated than that, because the build script will have different code paths for the published cargo package containing the pre-generated grammar vs. building from git. But I think it's manageable. I'm not super confident that #2145 will work out of the box, but probably after a couple of releases it will be ok...

@aherrmann
Copy link
Member

aherrmann commented Jan 24, 2025

Currently anything that uses Nickel as a Rust dependency is spending much more time compiling the grammar than it would take to download those extra 22MB

My concern is less around the build side and more about the contributing and source control side. With the compression mentioned above, the impact on the release tarball doesn't sound too bad. And as I understand it approach 2 would look the same to library users, correct?

it's probably less traffic, since git does apply some compression

True, if that's similar to what cargo reportedly achieves, then it's much less

How much do you think it's going to impact things in practice?

Answering these questions should help to understand that:

  • Will the generated grammar file change?
  • How frequently will it change?
  • How large will the diff be?

GitHub documents recommendations for file and repository sizes. How many grammar or LALRPOP changes are we away from hitting those limits?

A question for the library use-case: What happens if the user's build resolves a different LALRPOP version? Could that change the validity of the checked in file? (Apologies if that's a stupid question, but I'm not terribly familiar with LALRPOP, and I didn't see the question addressed in the discussion.)

@yannham
Copy link
Member

yannham commented Jan 24, 2025

My concern is less around the build side and more about the contributing and source control side. With the compression mentioned above, the impact on the release tarball doesn't sound too bad. And as I understand it approach 2 would look the same to library users, correct?

I think the point stands that, even for a contributor, this should speeds up the initial build quite a lot, which I suppose you need to do in any situation (if only to get the rust-analyzer LSP working, you need to run the build script). It's a classical space vs time trade-off.

Answering these questions should help to understand that:

Will the generated grammar file change?

Yes, it can change, when we modify the original LALRPOP grammar definition. We'll need to regenerate the grammar file and check it in.

How frequently will it change?

It's supposed to change less and less, as the language matures. Although there are always bug fixes and the possibility of new constructs, touching the grammar is by nature often a noticeable change, as we change the surface syntax of Nickel. That's the theory, but if we look at the numbers (commits that touched the grammar in Mar-Dec 2024), there's still quite a bit of activity:

  • March, 29
  • May 2, 6 and 13
  • June 7, 10, 13, 24
  • July 3, 8, 17, 25
  • Sept 12, 18
  • Oct 16
  • Nov 1, 18, 27
  • Dec 6

So, based on this year, it did change rather often, for a grammar. Often dates that are close are related to one concern that was implemented in a series of PRs. But it's still as many different changes to the grammar.

How large will the diff be?

Hard to know without trying. But it might be large, it's a generated state machine so I don't think we have any guarantee that a small change in the grammar entails a small change in the generated output.

A question for the library use-case: What happens if the user's build resolves a different LALRPOP version? Could that change the validity of the checked in file? (Apologies if that's a stupid question, but I'm not terribly familiar with LALRPOP, and I didn't see the question addressed in the discussion.)

On paper, I don't think the LALRPOP version makes a difference, since we only use the hash of the original source of the grammar definition as a way to know if we should use LALRPOP to regenerate the parser or not. This should be independent from the precise version of LALRPOP.

@aspiwack
Copy link
Member

Remember, checking the grammar in the Git repository is an irreversible choice. Once done, it's forever, you cannot take it back in the future, even if you find a better solution. It has a very high bar of evidence to clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants