Try including the generated grammar in-tree #2141

jneem · 2025-01-22T16:15:54Z

This is an attempt at #2123, by checking in a version of the generated grammar. The checked in file is 22MB (!), but I checked that the compressed cargo package size is more reasonable (1.2 MB). crates.io has a limit of 10MB per crate.

Stepping back, the main goal here is to avoid making users of the published nickel-lang-core generate the grammar when building. But we also need to make sure that it isn't too painful to modify the grammar. I think there are two basic approaches

check in the generated grammar, and require some workarounds when modifying it
generate the grammar at publish time, so it's in the published crate but not in the git repo.

Option 2 seems nicer in principle, but I'm worried that because publishing is infrequent and mostly automated, it will be prone to undetected breakage. So this PR does option 1, but I think the mechanism in the build script would work with either one; option 2 would just need some more work on the release script.

github-actions · 2025-01-22T16:31:37Z

Bencher Report

Branch	check-in-the-grammar
Testbed	ubuntu-latest

Click to view all benchmark results

Benchmark	Latency	microseconds (µs)
fibonacci 10	📈 view plot 🚷 view threshold	499.54
foldl arrays 50	📈 view plot 🚷 view threshold	1,807.70
foldl arrays 500	📈 view plot 🚷 view threshold	6,864.10
foldr strings 50	📈 view plot 🚷 view threshold	7,256.20
foldr strings 500	📈 view plot 🚷 view threshold	63,292.00
generate normal 250	📈 view plot 🚷 view threshold	48,957.00
generate normal 50	📈 view plot 🚷 view threshold	2,143.70
generate normal unchecked 1000	📈 view plot 🚷 view threshold	3,297.00
generate normal unchecked 200	📈 view plot 🚷 view threshold	790.71
pidigits 100	📈 view plot 🚷 view threshold	3,230.20
pipe normal 20	📈 view plot 🚷 view threshold	1,530.90
pipe normal 200	📈 view plot 🚷 view threshold	10,448.00
product 30	📈 view plot 🚷 view threshold	869.94
scalar 10	📈 view plot 🚷 view threshold	1,563.40
sum 30	📈 view plot 🚷 view threshold	877.08

🐰 View full continuous benchmarking report in Bencher

yannham · 2025-01-22T17:03:13Z

I haven't reviewed, but I want to point out one of LALRPOP co-maintainer's interesting remark: https://github.com/orgs/lalrpop/discussions/1031#discussioncomment-11908105. There's a checksum of the lalrpop grammar file included somewhere, so maybe we can just include a check in the build script so that if the file has changed we rebuild it automatically?

jneem · 2025-01-23T07:56:26Z

Ah, neat! I beefed up the build script a little so that it checks the hash before deciding what to do, and I also added a pre-commit check that will fail if you update the grammar and forget to copy the new generated file into the source tree.

yannham · 2025-01-23T11:03:54Z

flake.nix

+            enable = true;
+            description = "Check that the lalrpop output is up-to-date.";
+            files = "\\.lalrpop";
+            entry = "${check-fresh-grammar}/bin/check-fresh-grammar";


It's a detail, but would a lib.getExe or the like work here, so that we don't even have to rely on the actual file name Nix generates?

aspiwack

Before we merge.

Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download. Ever time the grammar is change will add another 20+MB?

My intuition is that no, we absolutely don't want to do that, this is awful. Of course, I'm no longer the boss here, the person you have to convince that this is reasonable is @aherrmann . But let's be quite prudent about it. In fact, let me give you a little red cross just as a way to draw attention to the issue. This shouldn't be done lightly.

aherrmann · 2025-01-24T09:48:20Z

Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download.

That sounds like a valid concern to raise. A fresh checkout of Nickel today fetches about 30 MiB. So, this is almost doubling the repo size. Completely anecdotal, but I observed 15.00 MiB/s for fetching objects, so naively this adds >1s fresh checkout time for every full replacement of the grammar.

Ever time the grammar is change will add another 20+MB?

If changes are incremental then the impact of changes could be much smaller. But, unless LALRPOP makes some guarantees around that, its behavior could change any time and we could end up with 20+MiB per grammar change.

generate the grammar at publish time, so it's in the published crate but not in the git repo.

This sounds like a reasonable middle ground.

Option 2 seems nicer in principle, but I'm worried that because publishing is infrequent and mostly automated, it will be prone to undetected breakage.

Could the relevant step be tested on CI?

yannham · 2025-01-24T10:40:17Z

Is checking in a 22MB file really something we want to do? This is irreversible. Every checkout of Nix from then on will have an extra 22MB to download. Ever time the grammar is change will add another 20+MB?

It's a good question, but I'm not entirely convinced by the argument:

My intuition is that no, we absolutely don't want to do that, this is awful

Currently anything that uses Nickel as a Rust dependency is spending much more time compiling the grammar than it would take to download those extra 22MB (by the way, it's probably less traffic, since git does apply some compression, right?). The grammar needs to be generated anyway for a build, it has to land somewhere, so we don't save much space either. Maybe it can get collected currently though, while not so much if it has to be there in the source derivation?

To give some concrete numbers around #2123, simply using the same settings on Topiary than for Nickel shaved off 33%, that is 3min, of build time for Topiary on my machine. I suspect that the LALRPOP grammar generation is still taking a non-trivial amount of the rest of the build, so who knows how much we would win with the pre-checked file. As we advertise at least in part Nickel as an easy to embed configuration language into other Rust projects, adding a 4 to 5 minutes penalty on a fresh build isn't so great.

Though you have a good point that this is irreversible, for anyone that clones the repo entirely. I'm not arguing for or against any choice at this point, but I think that "checking a 22MB file is bad" is a bit abstract for me at the moment, while the impact of the full unoptimized LALRPOP build is concrete. How much do you think it's going to impact things in practice? @aherrmann gives a bit of perspective here, as this is almost doubling the size of the repo, which is indeed substantial (although it's not the same for a small repo with an almost instantaneous check-out versus doubling the size of, say, Nixpkgs).

I know it probably wasn't designed for text file, but could Git LFS be of some help here? Letting users actually downloading the grammar if they want, and the build script would act accordingly (if the file is downloaded, use it, otherwise scrap it).

Otherwise maybe option 2 is the way to go. As Andreas mentions, it shouldn't be too hard to check at release time that we ship something with a pre-generated grammar file.

jneem · 2025-01-24T13:02:31Z

I have an implementation of approach 2 up at #2145, just for comparison.

Could the relevant step be tested on CI?

Certain things could, but the main release script is pretty hard to test because it does things like push to the stable branch here on github.

it shouldn't be too hard to check at release time that we ship something with a pre-generated grammar file.

It's a little bit more complicated than that, because the build script will have different code paths for the published cargo package containing the pre-generated grammar vs. building from git. But I think it's manageable. I'm not super confident that #2145 will work out of the box, but probably after a couple of releases it will be ok...

aherrmann · 2025-01-24T13:32:10Z

Currently anything that uses Nickel as a Rust dependency is spending much more time compiling the grammar than it would take to download those extra 22MB

My concern is less around the build side and more about the contributing and source control side. With the compression mentioned above, the impact on the release tarball doesn't sound too bad. And as I understand it approach 2 would look the same to library users, correct?

it's probably less traffic, since git does apply some compression

True, if that's similar to what cargo reportedly achieves, then it's much less

How much do you think it's going to impact things in practice?

Answering these questions should help to understand that:

Will the generated grammar file change?
How frequently will it change?
How large will the diff be?

GitHub documents recommendations for file and repository sizes. How many grammar or LALRPOP changes are we away from hitting those limits?

A question for the library use-case: What happens if the user's build resolves a different LALRPOP version? Could that change the validity of the checked in file? (Apologies if that's a stupid question, but I'm not terribly familiar with LALRPOP, and I didn't see the question addressed in the discussion.)

yannham · 2025-01-24T14:22:06Z

My concern is less around the build side and more about the contributing and source control side. With the compression mentioned above, the impact on the release tarball doesn't sound too bad. And as I understand it approach 2 would look the same to library users, correct?

I think the point stands that, even for a contributor, this should speeds up the initial build quite a lot, which I suppose you need to do in any situation (if only to get the rust-analyzer LSP working, you need to run the build script). It's a classical space vs time trade-off.

Answering these questions should help to understand that:

Will the generated grammar file change?

Yes, it can change, when we modify the original LALRPOP grammar definition. We'll need to regenerate the grammar file and check it in.

How frequently will it change?

It's supposed to change less and less, as the language matures. Although there are always bug fixes and the possibility of new constructs, touching the grammar is by nature often a noticeable change, as we change the surface syntax of Nickel. That's the theory, but if we look at the numbers (commits that touched the grammar in Mar-Dec 2024), there's still quite a bit of activity:

March, 29
May 2, 6 and 13
June 7, 10, 13, 24
July 3, 8, 17, 25
Sept 12, 18
Oct 16
Nov 1, 18, 27
Dec 6

So, based on this year, it did change rather often, for a grammar. Often dates that are close are related to one concern that was implemented in a series of PRs. But it's still as many different changes to the grammar.

How large will the diff be?

Hard to know without trying. But it might be large, it's a generated state machine so I don't think we have any guarantee that a small change in the grammar entails a small change in the generated output.

A question for the library use-case: What happens if the user's build resolves a different LALRPOP version? Could that change the validity of the checked in file? (Apologies if that's a stupid question, but I'm not terribly familiar with LALRPOP, and I didn't see the question addressed in the discussion.)

On paper, I don't think the LALRPOP version makes a difference, since we only use the hash of the original source of the grammar definition as a way to know if we should use LALRPOP to regenerate the parser or not. This should be independent from the precise version of LALRPOP.

aspiwack · 2025-01-27T01:41:23Z

Remember, checking the grammar in the Git repository is an irreversible choice. Once done, it's forever, you cannot take it back in the future, even if you find a better solution. It has a very high bar of evidence to clear.

jneem requested a review from yannham January 22, 2025 16:16

Try including the generated grammar in-tree

bbe9c3c

jneem force-pushed the check-in-the-grammar branch from f4cd082 to 36ce7a7 Compare January 23, 2025 06:50

Use the lalrpop-generated hash to control rebuilding

c1a867e

jneem force-pushed the check-in-the-grammar branch from 36ce7a7 to c1a867e Compare January 23, 2025 07:29

Maybe toybox is available on darwin

69a266e

yannham reviewed Jan 23, 2025

View reviewed changes

yannham approved these changes Jan 23, 2025

View reviewed changes

aspiwack requested changes Jan 23, 2025

View reviewed changes

jneem mentioned this pull request Jan 24, 2025

Dont check in the grammar, but generate it at release time #2145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try including the generated grammar in-tree #2141

Try including the generated grammar in-tree #2141

jneem commented Jan 22, 2025

github-actions bot commented Jan 22, 2025 •

edited

Loading

yannham commented Jan 22, 2025

jneem commented Jan 23, 2025

yannham Jan 23, 2025 •

edited

Loading

aspiwack left a comment

aherrmann commented Jan 24, 2025

yannham commented Jan 24, 2025

jneem commented Jan 24, 2025

aherrmann commented Jan 24, 2025 •

edited

Loading

yannham commented Jan 24, 2025 •

edited

Loading

aspiwack commented Jan 27, 2025

Try including the generated grammar in-tree #2141

Are you sure you want to change the base?

Try including the generated grammar in-tree #2141

Conversation

jneem commented Jan 22, 2025

github-actions bot commented Jan 22, 2025 • edited Loading

Bencher Report

yannham commented Jan 22, 2025

jneem commented Jan 23, 2025

yannham Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

aspiwack left a comment

Choose a reason for hiding this comment

aherrmann commented Jan 24, 2025

yannham commented Jan 24, 2025

jneem commented Jan 24, 2025

aherrmann commented Jan 24, 2025 • edited Loading

yannham commented Jan 24, 2025 • edited Loading

aspiwack commented Jan 27, 2025

github-actions bot commented Jan 22, 2025 •

edited

Loading

yannham Jan 23, 2025 •

edited

Loading

aherrmann commented Jan 24, 2025 •

edited

Loading

yannham commented Jan 24, 2025 •

edited

Loading