Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ca: unsplit issuance flow #8014

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

ca: unsplit issuance flow #8014

wants to merge 9 commits into from

Conversation

jsha
Copy link
Contributor

@jsha jsha commented Feb 14, 2025

Add a new RPC to the CA: IssueCertificate covers issuance of both the precertificate and the final certificate. In between, it calls out to the RA's new method GetSCTs.

The RA calls the new CA.IssueCertificate if the UnsplitIssuance feature flag is true.

The RA had a metric that counted certificates by profile name and hash. Since the RA doesn't receive a profile hash in the new flow, simply record the total number of issuances.

Fixes #7983

@jsha jsha marked this pull request as ready for review February 15, 2025 04:49
@jsha jsha requested a review from a team as a code owner February 15, 2025 04:49
Copy link
Contributor

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

Copy link
Contributor

@jsha, this PR adds one or more new feature flags: UnsplitIssuance. As such, this PR must be accompanied by a review of the Let's Encrypt CP/CPS to ensure that our behavior both before and after this flag is flipped is compliant with that document.

Please conduct such a review, then add your findings to the PR description in a paragraph beginning with "CPS Compliance Review:".

@jsha jsha requested review from aarongable and jprenken and removed request for beautifulentropy February 18, 2025 18:33
@aarongable
Copy link
Contributor

aarongable commented Feb 18, 2025

Note from alert triage: after this lands and the flag is flipped, we should have SRE alert on total latency of the new SctProvider.GetSCTs method, rather than alerting on the latency of individual logs as they do today. (We could have done this ages ago with a manual timing metric wrapping the whole log-racing section, but I didn't think about it until now.)

edit: I see now that we actually do have this manual timing metric already, so ignore me.

ca/ca.go Outdated
DER: resp.DER,
SCTs: scts.SctDER,
RegistrationID: issueReq.RegistrationID,
OrderID: issueReq.RegistrationID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be issueReq.OrderID?

mocks/ca.go Outdated
DER: resp.DER,
SCTs: nil,
RegistrationID: req.RegistrationID,
OrderID: req.RegistrationID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: this should probably be req.OrderID.

@jsha
Copy link
Contributor Author

jsha commented Feb 18, 2025

There's a CI flake here:

22:24:21.714821 3 boulder-ra u4D09g0 [AUDIT] Asynchronous finalization failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.77.77.77:9394: connect: connection refused"

This is the CA reporting that it got a connection refused from an RA during GetSCTs (note: the CA returns that error to the RA, so this is logged by the RA).

However, I'm a bit confused by this because startservers.py waits until all services are up. It's true there's a now dependency cycle between the CA and the RA that isn't expressed in startservers.py. But in practical terms that shouldn't generate this error. The boulder-wfe2 service depends on boulder-ra-1 and boulder-ra-2, so it shouldn't come up until those are ready and thus nothing should be requesting issuance until the RAs are ready.

@jsha
Copy link
Contributor Author

jsha commented Feb 18, 2025

Also interesting that the RA reports coming up more than a second before the connection refused error:

22:24:20.585406 6 boulder-ra 2InI3wk grpc listening on :9394
22:24:21.714821 3 boulder-ra u4D09g0 [AUDIT] Asynchronous finalization failed: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.77.77.77:9394: connect: connection refused"

So maybe the problem is not "service not yet listening," but something else. Maybe "service crashed silently?" 🤔

jprenken
jprenken previously approved these changes Feb 18, 2025
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's up with the integration test. The circular dependency does make me a bit nervous...

ca/proto/ca.proto Outdated Show resolved Hide resolved
ca/ca.go Outdated Show resolved Hide resolved
ca/ca.go Outdated Show resolved Hide resolved
mocks/ca.go Outdated Show resolved Hide resolved
ca/ca.go Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
test/config-next/ra.json Outdated Show resolved Hide resolved
@jsha
Copy link
Contributor Author

jsha commented Feb 19, 2025

FYI: I pushed updates addressing feedback but still consider merging blocked on figuring out what's up with the integration test. Fortunately I can occasionally reproduce locally, so I'm off to the races.

Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % flaky integration test

@jsha
Copy link
Contributor Author

jsha commented Feb 19, 2025

Here's what I think is going on:

  • The CA starts up, and fires up its health checking loop to talk to the SCTProvider (boulder-ra).
  • That causes an immediate connection attempt, which gets "connection refused". Based on the current service topology, the RA depends on the CA, so it's started after the CA comes up.
  • The gRPC code remembers this connection refused and goes into exponential backoff (I haven't been able to confirm where this happens; could be wrong).
  • The CA finishes coming up and the rest of the services are started.
  • A test case tries to issue a certificate.
  • The CA, in trying to call out to its SCTProvider, short-circuits the connection because it's in exponential backoff mode. It reports connection refused again based on the cached status (again, need to confirm where this happens).

If I set "noWaitForReady": false for the "sctService" in config-next/ca.json, in a handful of local runs this doesn't recur. That makes sense, because instead of immediately returning connection refused, the CA waits for the connection to be ready and then sends the request, getting a success.

Still need to think about the best way to solve the problem though. I'd prefer for all our production RPC calls to consistently use NoWaitForReady (and in fact was hoping to deprecate that feature flag). But also this might turn up as a similar problem with health checks in prod, depending on how exactly we use them?

One possibility is to run the SCT Provider service from a different set of RAs. Or run a totally different binary. That would nicely solve the topology issue in both test and prod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ca: unsplit precert/CT/cert issuance flow
3 participants