ca: unsplit issuance flow #8014

jsha · 2025-02-14T23:43:49Z

Add a new RPC to the CA: IssueCertificate covers issuance of both the precertificate and the final certificate. In between, it calls out to the RA's new method GetSCTs.

The RA calls the new CA.IssueCertificate if the UnsplitIssuance feature flag is true.

The RA had a metric that counted certificates by profile name and hash. Since the RA doesn't receive a profile hash in the new flow, simply record the total number of issuances.

Fixes #7983

github-actions · 2025-02-15T04:49:50Z

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

github-actions · 2025-02-15T04:50:04Z

@jsha, this PR adds one or more new feature flags: UnsplitIssuance. As such, this PR must be accompanied by a review of the Let's Encrypt CP/CPS to ensure that our behavior both before and after this flag is flipped is compliant with that document.

Please conduct such a review, then add your findings to the PR description in a paragraph beginning with "CPS Compliance Review:".

aarongable · 2025-02-18T20:48:07Z

Note from alert triage: after this lands and the flag is flipped, we should have SRE alert on total latency of the new SctProvider.GetSCTs method, rather than alerting on the latency of individual logs as they do today. (We could have done this ages ago with a manual timing metric wrapping the whole log-racing section, but I didn't think about it until now.)

edit: I see now that we actually do have this manual timing metric already, so ignore me.

jprenken · 2025-02-18T21:28:57Z

ca/ca.go

+		DER:             resp.DER,
+		SCTs:            scts.SctDER,
+		RegistrationID:  issueReq.RegistrationID,
+		OrderID:         issueReq.RegistrationID,


Shouldn't this be issueReq.OrderID?

jprenken · 2025-02-18T21:33:15Z

mocks/ca.go

+		DER:             resp.DER,
+		SCTs:            nil,
+		RegistrationID:  req.RegistrationID,
+		OrderID:         req.RegistrationID,


Same comment as above: this should probably be req.OrderID.

jsha · 2025-02-18T22:34:02Z

There's a CI flake here:

22:24:21.714821 3 boulder-ra u4D09g0 [AUDIT] Asynchronous finalization failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.77.77.77:9394: connect: connection refused"

This is the CA reporting that it got a connection refused from an RA during GetSCTs (note: the CA returns that error to the RA, so this is logged by the RA).

However, I'm a bit confused by this because startservers.py waits until all services are up. It's true there's a now dependency cycle between the CA and the RA that isn't expressed in startservers.py. But in practical terms that shouldn't generate this error. The boulder-wfe2 service depends on boulder-ra-1 and boulder-ra-2, so it shouldn't come up until those are ready and thus nothing should be requesting issuance until the RAs are ready.

jsha · 2025-02-18T22:39:38Z

Also interesting that the RA reports coming up more than a second before the connection refused error:

22:24:20.585406 6 boulder-ra 2InI3wk grpc listening on :9394
22:24:21.714821 3 boulder-ra u4D09g0 [AUDIT] Asynchronous finalization failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.77.77.77:9394: connect: connection refused"

So maybe the problem is not "service not yet listening," but something else. Maybe "service crashed silently?" 🤔

aarongable

I'm not sure what's up with the integration test. The circular dependency does make me a bit nervous...

ca/proto/ca.proto

ca/ca.go

mocks/ca.go

ca/ca.go

ra/ra.go

test/config-next/ra.json

jsha · 2025-02-19T01:17:17Z

FYI: I pushed updates addressing feedback but still consider merging blocked on figuring out what's up with the integration test. Fortunately I can occasionally reproduce locally, so I'm off to the races.

aarongable

LGTM % flaky integration test

jsha · 2025-02-19T02:03:00Z

Here's what I think is going on:

The CA starts up, and fires up its health checking loop to talk to the SCTProvider (boulder-ra).
That causes an immediate connection attempt, which gets "connection refused". Based on the current service topology, the RA depends on the CA, so it's started after the CA comes up.
The gRPC code remembers this connection refused and goes into exponential backoff (I haven't been able to confirm where this happens; could be wrong).
The CA finishes coming up and the rest of the services are started.
A test case tries to issue a certificate.
The CA, in trying to call out to its SCTProvider, short-circuits the connection because it's in exponential backoff mode. It reports connection refused again based on the cached status (again, need to confirm where this happens).

If I set "noWaitForReady": false for the "sctService" in config-next/ca.json, in a handful of local runs this doesn't recur. That makes sense, because instead of immediately returning connection refused, the CA waits for the connection to be ready and then sends the request, getting a success.

Still need to think about the best way to solve the problem though. I'd prefer for all our production RPC calls to consistently use NoWaitForReady (and in fact was hoping to deprecate that feature flag). But also this might turn up as a similar problem with health checks in prod, depending on how exactly we use them?

One possibility is to run the SCT Provider service from a different set of RAs. Or run a totally different binary. That would nicely solve the topology issue in both test and prod.

jsha added 7 commits February 14, 2025 15:28

connect the dots

e9ed283

more dots

f9616ea

Restore some of ra.go and turn on feature

1914c23

fix tests

ff6c898

small updates

54b7ec2

Update test

f05e215

Revert startservers.py

95afcf2

jsha marked this pull request as ready for review February 15, 2025 04:49

jsha requested a review from a team as a code owner February 15, 2025 04:49

jsha requested a review from beautifulentropy February 15, 2025 04:49

jsha requested review from aarongable and jprenken and removed request for beautifulentropy February 18, 2025 18:33

jprenken requested changes Feb 18, 2025

View reviewed changes

Review feedback

f29453e

jprenken previously approved these changes Feb 18, 2025

View reviewed changes

aarongable requested changes Feb 18, 2025

View reviewed changes

Review feedback

8d0db46

jsha dismissed jprenken’s stale review via 8d0db46 February 19, 2025 01:16

aarongable approved these changes Feb 19, 2025

View reviewed changes

jprenken approved these changes Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ca: unsplit issuance flow #8014

ca: unsplit issuance flow #8014

jsha commented Feb 14, 2025 •

edited by aarongable

Loading

github-actions bot commented Feb 15, 2025

github-actions bot commented Feb 15, 2025

aarongable commented Feb 18, 2025 •

edited

Loading

jprenken Feb 18, 2025

jprenken Feb 18, 2025

jsha commented Feb 18, 2025

jsha commented Feb 18, 2025

aarongable left a comment

jsha commented Feb 19, 2025

aarongable left a comment

jsha commented Feb 19, 2025

ca: unsplit issuance flow #8014

Are you sure you want to change the base?

ca: unsplit issuance flow #8014

Conversation

jsha commented Feb 14, 2025 • edited by aarongable Loading

github-actions bot commented Feb 15, 2025

github-actions bot commented Feb 15, 2025

aarongable commented Feb 18, 2025 • edited Loading

jprenken Feb 18, 2025

Choose a reason for hiding this comment

jprenken Feb 18, 2025

Choose a reason for hiding this comment

jsha commented Feb 18, 2025

jsha commented Feb 18, 2025

aarongable left a comment

Choose a reason for hiding this comment

jsha commented Feb 19, 2025

aarongable left a comment

Choose a reason for hiding this comment

jsha commented Feb 19, 2025

jsha commented Feb 14, 2025 •

edited by aarongable

Loading

aarongable commented Feb 18, 2025 •

edited

Loading