-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try including the generated grammar in-tree #2141
base: master
Are you sure you want to change the base?
Conversation
Bencher Report
Click to view all benchmark results
|
I haven't reviewed, but I want to point out one of LALRPOP co-maintainer's interesting remark: https://github.com/orgs/lalrpop/discussions/1031#discussioncomment-11908105. There's a checksum of the lalrpop grammar file included somewhere, so maybe we can just include a check in the build script so that if the file has changed we rebuild it automatically? |
f4cd082
to
36ce7a7
Compare
36ce7a7
to
c1a867e
Compare
Ah, neat! I beefed up the build script a little so that it checks the hash before deciding what to do, and I also added a pre-commit check that will fail if you update the grammar and forget to copy the new generated file into the source tree. |
enable = true; | ||
description = "Check that the lalrpop output is up-to-date."; | ||
files = "\\.lalrpop"; | ||
entry = "${check-fresh-grammar}/bin/check-fresh-grammar"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a detail, but would a lib.getExe
or the like work here, so that we don't even have to rely on the actual file name Nix generates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we merge.
Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download. Ever time the grammar is change will add another 20+MB?
My intuition is that no, we absolutely don't want to do that, this is awful. Of course, I'm no longer the boss here, the person you have to convince that this is reasonable is @aherrmann . But let's be quite prudent about it. In fact, let me give you a little red cross just as a way to draw attention to the issue. This shouldn't be done lightly.
That sounds like a valid concern to raise. A fresh checkout of Nickel today fetches about 30 MiB. So, this is almost doubling the repo size. Completely anecdotal, but I observed 15.00 MiB/s for fetching objects, so naively this adds >1s fresh checkout time for every full replacement of the grammar.
If changes are incremental then the impact of changes could be much smaller. But, unless LALRPOP makes some guarantees around that, its behavior could change any time and we could end up with 20+MiB per grammar change.
This sounds like a reasonable middle ground.
Could the relevant step be tested on CI? |
It's a good question, but I'm not entirely convinced by the argument:
Currently anything that uses Nickel as a Rust dependency is spending much more time compiling the grammar than it would take to download those extra 22MB (by the way, it's probably less traffic, since git does apply some compression, right?). The grammar needs to be generated anyway for a build, it has to land somewhere, so we don't save much space either. Maybe it can get collected currently though, while not so much if it has to be there in the source derivation? To give some concrete numbers around #2123, simply using the same settings on Topiary than for Nickel shaved off 33%, that is 3min, of build time for Topiary on my machine. I suspect that the LALRPOP grammar generation is still taking a non-trivial amount of the rest of the build, so who knows how much we would win with the pre-checked file. As we advertise at least in part Nickel as an easy to embed configuration language into other Rust projects, adding a 4 to 5 minutes penalty on a fresh build isn't so great. Though you have a good point that this is irreversible, for anyone that clones the repo entirely. I'm not arguing for or against any choice at this point, but I think that "checking a 22MB file is bad" is a bit abstract for me at the moment, while the impact of the full unoptimized LALRPOP build is concrete. How much do you think it's going to impact things in practice? @aherrmann gives a bit of perspective here, as this is almost doubling the size of the repo, which is indeed substantial (although it's not the same for a small repo with an almost instantaneous check-out versus doubling the size of, say, Nixpkgs). I know it probably wasn't designed for text file, but could Git LFS be of some help here? Letting users actually downloading the grammar if they want, and the build script would act accordingly (if the file is downloaded, use it, otherwise scrap it). Otherwise maybe option 2 is the way to go. As Andreas mentions, it shouldn't be too hard to check at release time that we ship something with a pre-generated grammar file. |
I have an implementation of approach 2 up at #2145, just for comparison.
Certain things could, but the main release script is pretty hard to test because it does things like push to the
It's a little bit more complicated than that, because the build script will have different code paths for the published cargo package containing the pre-generated grammar vs. building from git. But I think it's manageable. I'm not super confident that #2145 will work out of the box, but probably after a couple of releases it will be ok... |
My concern is less around the build side and more about the contributing and source control side. With the compression mentioned above, the impact on the release tarball doesn't sound too bad. And as I understand it approach 2 would look the same to library users, correct?
True, if that's similar to what cargo reportedly achieves, then it's much less
Answering these questions should help to understand that:
GitHub documents recommendations for file and repository sizes. How many grammar or LALRPOP changes are we away from hitting those limits? A question for the library use-case: What happens if the user's build resolves a different LALRPOP version? Could that change the validity of the checked in file? (Apologies if that's a stupid question, but I'm not terribly familiar with LALRPOP, and I didn't see the question addressed in the discussion.) |
I think the point stands that, even for a contributor, this should speeds up the initial build quite a lot, which I suppose you need to do in any situation (if only to get the rust-analyzer LSP working, you need to run the build script). It's a classical space vs time trade-off.
Yes, it can change, when we modify the original LALRPOP grammar definition. We'll need to regenerate the grammar file and check it in.
It's supposed to change less and less, as the language matures. Although there are always bug fixes and the possibility of new constructs, touching the grammar is by nature often a noticeable change, as we change the surface syntax of Nickel. That's the theory, but if we look at the numbers (commits that touched the grammar in Mar-Dec 2024), there's still quite a bit of activity:
So, based on this year, it did change rather often, for a grammar. Often dates that are close are related to one concern that was implemented in a series of PRs. But it's still as many different changes to the grammar.
Hard to know without trying. But it might be large, it's a generated state machine so I don't think we have any guarantee that a small change in the grammar entails a small change in the generated output.
On paper, I don't think the LALRPOP version makes a difference, since we only use the hash of the original source of the grammar definition as a way to know if we should use LALRPOP to regenerate the parser or not. This should be independent from the precise version of LALRPOP. |
Remember, checking the grammar in the Git repository is an irreversible choice. Once done, it's forever, you cannot take it back in the future, even if you find a better solution. It has a very high bar of evidence to clear. |
This is an attempt at #2123, by checking in a version of the generated grammar. The checked in file is 22MB (!), but I checked that the compressed
cargo package
size is more reasonable (1.2 MB). crates.io has a limit of 10MB per crate.Stepping back, the main goal here is to avoid making users of the published
nickel-lang-core
generate the grammar when building. But we also need to make sure that it isn't too painful to modify the grammar. I think there are two basic approachesOption 2 seems nicer in principle, but I'm worried that because publishing is infrequent and mostly automated, it will be prone to undetected breakage. So this PR does option 1, but I think the mechanism in the build script would work with either one; option 2 would just need some more work on the release script.