The flip has
completed
as of July 24, 2020! That is, k8s.gcr.io
now refers to
{asia,eu,us}.gcr.io/k8s-artifacts-prod
, not
{asia,eu,us}.gcr.io/google-containers
.
Currently, k8s.gcr.io
is a vanity domain that points to
gcr.io/google-containers
(Google-owned and managed). This is a problem because
k8s.gcr.io
is an alias used throughut the Kubernetes codebase. As Kubernetes
is a community-owned project, k8s.gcr.io
should instead point to a
community-controlled repo.
The community has created a new repo called {asia,eu,us}.gcr.io/k8s-artifacts-prod
, and it
has been agreed that the community should use it as the new place to push
production images (instead of gcr.io/google-containers
). We can solve the
above problem by flipping the vanity domain (k8s.gcr.io
) from
gcr.io/google-containers
to {asia,eu,us}.gcr.io/k8s-artifacts-prod
. This way, no change
needs to be made in the Kubernetes codebase.
The minimum prerequisite is that the existing images in google-containers
must
be copied into k8s-artifacts-prod
in order to ensure that the domain flip
happens transparently without incurring any interruptions. However there are
other infrastructural improvements that the community has designed, such as
explicit backups, disaster recovery, and also auditing and alerting.
The rest of this document explains the infrastructural improvements surrounding
{asia,eu,us}.gcr.io/k8s-artifacts-prod
.
To get new images into the old gcr.io/google-containers
, a Googler must approve a
change in Google's private repository.
On the other hand, the new {asia,eu,us}.gcr.io/k8s-artifacts-prod
is integrated with a
publicly-visible GitHub repository, named k8s.io. The promoter
watches this repository for changes and promotes images. In addition, a system
of setting up staging repos, and promoting from them into
{asia,eu,us}.gcr.io/k8s-artifacts-prod
has been created so that owners of
subprojects in the community can take control of how their images are released.
The Container Image Promoter (henceforth "the promoter") is the OSS
rewrite of the promoter used internally within Google. It
works by reading in a set of promoter manifests (YAMLs) that describe the
desired state of a Docker registry's image contents, and proceeds to copy in any
missing images. Currently, the k8s.gcr.io
directory of this repo defines such a set of promoter manifests.
The act of invoking the promoter as a postsubmit against the k8s.io repo is done
by Prow, as the post-k8sio-image-promo
Prow job. There are other Prow jobs
that integrate with the promoter, and the ones relevant to this doc are outlined
in the list below:
pull-k8sio-image-promo
(logs) Dry run version ofpost-k8sio-image-promo
. It is run as a presubmit check to any PR against k8s.io Github repo. In particular, it catches things like tag moves (which are disallowed). Unlikepost-k8sio-image-promo
, it does not run in the trusted cluster, because it does not need to use prod credentials (in fact, it doesn't use any creds).post-k8sio-image-promo
(logs) Postsubmit job against k8s.io repo holding promoter manifests. The promoter manifests here are those that promote from the various staging subproject repos to{asia,eu,us}.gcr.io/k8s-artifacts-prod/<subproject>/<image>
. It uses thek8s-infra-gcr-promoter@k8s-artifacts-prod.iam.gserviceaccount.com
service account to write to{asia,eu,us}.gcr.io/k8s-artifacts-prod
. For all intents and purposes, this is the gatekeeper for new images going intok8s-artifacts-prod
.ci-k8sio-image-promo
(logs) Likepost-k8sio-image-promo
, but runs periodically. This is to ensure that even if images are accidentally deleted from{asia,eu,us}.gcr.io/k8s-artifacts-prod
, they are automatically copied back. It also acts as a kind of sanity check, to ensure that the promoter can run at all.pull-cip-e2e
(logs) Runs an E2E test for changes to the promoter source code. This test checks that the promoter can promote images (its main purpose). It uses the[email protected]
service account to use thek8s-cip-test-prod
GCP project resources for its test cases (creation/deletion of GCR images, etc.).
In addition there are some jobs that act solely as a sanity check on the promoter's own codebase:
pull-cip-unit-tests
(logs) This runs unit tests for the promoter codebase, and are part of the PR presubmit checks.pull-cip-lint
(logs) This runs golangci-lint for the promoter codebase (which is primarily written in Go).
In order for a user to push to k8s-artifacts-prod
, they must:
- Ensure that they have a subproject staging repo (e.g.,
gcr.io/k8s-staging-foo
for thefoo
subproject). - Add the promotion metadata in the manifests subdirectory in the k8s.io repo.
- Write-once: Images promoted to production will NOT be deleted, unless under extreme, emergency circumstances that require human supervision (see "Breakglass" section below).
- Immutable tags: New images added to the promoter manifests cannot use an existing tag for the same image. In other words, tags (once created for an image) cannot be deleted.
- Mandatory subproject prefix: Images must be prefixed in production by the
name of the subproject. For example, the subproject named
foo
must only push images to{asia,eu,us}.gcr.io/k8s-artifacts-prod/foo/...
.
Images in k8s-artifacts-prod
are not normally deletable. For emergencies,
however, you can reach the GCR admins listed in the
[email protected]
group here who have write
access to GCR.
The GCR images in k8s-artifacts-prod
are backed up every 12 hours, by region.
This is done with the ci-k8sio-backup
Prow job. All
images are backed up, even legacy images that appeared before the promoter went
online that were not tagged and can only be referenced by their digest.
The backup GCR locations are:
- https://asia.gcr.io/k8s-artifacts-prod-bak
- https://eu.gcr.io/k8s-artifacts-prod-bak
- https://us.gcr.io/k8s-artifacts-prod-bak
ci-k8sio-backup
(logs) Runs a backup of all GCR images in{asia,eu,us}.gcr.io/k8s-artifacts-prod
to{asia,eu,us}.gcr.io/k8s-artifacts-prod-bak/...
.pull-k8sio-backup
(logs) Checks that changes to the backup scripts are valid. Like thepull-cip-e2e
andpull-cip-auditor-e2e
jobs, this job uses GCP resources to check that the backup scripts work as intended inci-k8sio-backup
.
In the event that the k8s-artifacts-prod
GCR is compromised, a human from the
[email protected]
group must restore from a known-good
backup snapshot. An example might be:
for region in asia eu us; do
gcrane cp -r ${region}.gcr.io/k8s-artifacts-prod-bak/2020/01/01/00 ${region}.gcr.io/k8s-artifacts-prod
done
All GCR stateful changes to {asia,eu,us}.gcr.io/k8s-artifacts-prod
are
detected by the auditor, which runs as a service in Cloud Run in the
k8s-artifacts-prod
project. If the change fits with the intent of the
promoter manifests, nothing happens. However, if there is a
disagreement, then the GCR transaction is marked as "REJECTED" and an alert is
sent to Stackdriver Error Reporting, where by default it currently notifies the
project owner via email.
The step-by-step process is:
- An image is created (new tag), deleted, etc on the
k8s-artifacts-prod
GCR. - Cloud Pub/Sub message with the stateful change contents is sent over HTTPS to
the
cip-auditor
service in Cloud Run. cip-auditor
clones a fresh copy of promoter manifests at https://git.k8s.io/k8s.io.cip-auditor
checks the Pub/Sub message contents against the promoter manifests.- If the message agrees with the promoter manifests, nothing happens. Otherwise, a call is made to the Stackdriver Error Reporting API with a stack trace with a log of the message contents.
The logs of the auditor are available by using gcloud
:
gcloud \
logging \
read \
--format='value(textPayload)' \
$(printf "resource.type=project logName=%s resource.labels.project_id=%s" cip-audit-log k8s-artifacts-prod)
The configuration for deploying the prod Cloud Run instance is here.
pull-cip-auditor-e2e
(logs) Likepull-cip-e2e
, but runs E2E tests for the auditing mechanism built into the promoter. While the actual auditing mechanism (known as "cip-auditor") runs in production in thek8s-artifacts-prod
project, the E2E tests here run in the test-only project namedk8s-gcr-audit-test-prod
which is dedicated solely to this purpose. The auditor code lives here, but the E2E tests for it live here. The E2E test use thek8s-infra-gcr-promoter@k8s-gcr-audit-test-prod.iam.gserviceaccount.com
GCP project resources for creating/deleting Cloud Run services ink8s-gcr-audit-test-prod
, as well as clearing Pub/Sub messages and Stackdriver logs to run its tests. Note that it uses a separate GCP project than thepull-cip-e2e
, so that the two tests are isolated from each other.
The Stackdriver Error Reporting leg of the auditing process is responsible for sending alerts to humans about the rejected GCR change.
Currently, an email is sent to the project owner(s) of the k8s-artifacts-prod
GCP project.
The auditing mechanism uses 3 service accounts:
- The Pub/Sub service account, which GCR will use to emit Pub/Sub messages for
all changes detected in the project's GCR. This is named
service-[PROJECT_NUMBER]@gcp-sa-pubsub.iam.gserviceaccount.com
. No permissions changes are necessary for this service account (which should already exist in the project by default). - The Pub/Sub Push service account, which the Pub/Sub subscription will use to
push messages to the Cloud Run push endpoint. This service account needs the
roles/run.invoker
role (Cloud Run checks for this authorization) in order to push messages securely to the Cloud Run endpoint. - The Cloud Run service account, which the Cloud Run instance will run as. This
service account only needs those permissions that the auditing mechanism
needs, which are:
roles/logging.logWriter
roles/errorreporting.writer
The [email protected]
googlegroup manages the
auditor service. Its members are listed here.
- GCR: Google Container Registry
- GCS: Google Cloud Storage