- OpenShift Regional Disaster Recovery
These instructions show how to use OpenShift GitOps to deploy two OpenShift clusters paired in disaster-recovery mode.
The disaster-recovery is rooted in two components:
The idea is to create two OpenShift clusters in two different cloud regions and pair them via RHACM.
- OpenShift Cluster 4.11.x or higher to host the RHACM hub cluster
-
From the Administrator's perspective, navigate to the OperatorHub page.
-
Search for "Red Hat OpenShift GitOps." Click on the tile and then click on "Install."
-
Keep the defaults in the wizard and click on "Install."
-
Wait for it to appear in the " Installed Operators list." If it doesn't install correctly, you can check its status on the "Installed Operators" page in the
openshift-operators
namespace.
-
Open a terminal and ensure you have the OpenShift CLI installed:
oc version --client # Client Version: 4.10.42
Ideally, the client's minor version should be at most one iteration behind the server version. Most commands here are pretty basic and will work with more significant differences, but keep that in mind if you see errors about unrecognized commands and parameters.
If you do not have the CLI installed, follow these instructions.
-
Create the
Subscription
resource for the operator:cat << EOF | oc apply -f - --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: openshift-gitops-operator namespace: openshift-operators spec: channel: stable installPlanApproval: Automatic name: openshift-gitops-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF
Follow these instructions for a GitOps-based approach to the installation or follow the RHACM product documentation for the official installation procedure.
ocp_dr_gitops_url=https://github.com/nastacio/ocp-dr
ocp_dr_gitops_branch=main
image_set_ref=img4.11.20-x86-64-appsub
argocd app create ocp-dr-app \
--project default \
--dest-server https://kubernetes.default.svc \
--repo ${ocp_dr_gitops_url:?} \
--path config/app/ \
--helm-set repoURL=${ocp_dr_gitops_url:?} \
--helm-set targetRevision=${ocp_dr_gitops_branch:?} \
--helm-set metadata.cluster.image_set_ref=${image_set_ref:-img4.11.20-x86-64-appsub} \
--sync-policy automated \
--revision ${ocp_dr_gitops_branch:?} \
--upsert \
&& argocd app wait -l app.kubernetes.io/instance=ocp-dr-app \
--sync \
--health \
--operation
These are the ArgoCD application folders in this repository.
Folder | Description |
config/app | Top-level App-of-Apps for all the other GitOps Application resources. |
config/rhacm | All additions required to enable a RHACM cluster for orchestrating disaster recovery between peered clusters. |
config/odf | OpenShift Data Foundation cluster(only tested with AWS). |
config/clusters | Pair of managed clusters (created via RHACM API) with non-overlapping networks. |
This step is simple in concept - create a generic secret in the cluster - but the contents of the secret vary with each target Cloud.
The "Managing Credentials" section of the RHACM documentation contains the detailed process for adding credentials to the cluster, including links to the respective content sources.
This repository was only tested in AWS, but you should be able to modify it to use other providers, indicating the name of the secret and respective namespace under the .metadata.rhacm.secret
and .metadata.rhacm.secret_namespace
parameters of the ocp-dr-app
Argo application.
The ODF 4.11 instructions, unlike the ODF 4.9 instructions, does not tell the user to modify the Ramen configuration.
I can see that the ODF Multicluster Orchestrator operator eventually updates the ConfigMap
named ramen-hub-operator-config
in the openshift-operators
namespace (the ODF 4.11 instructions are very clear about using that namespace.)
However, when I create a DRPolicy
resource, its status complains about the lack of a profile in the Ramen configuration. That same message shows up in the status of the DRCluster
resources
oc get drcluster ocpdr2 -o yaml
apiVersion: ramendr.openshift.io/v1alpha1
kind: DRCluster
...
spec:
region: 3ce47eb5-4815-4933-b776-3c74fcc6709f
s3ProfileName: s3profile-ocpdr2-ocs-storagecluster
status:
conditions:
- lastTransitionTime: "2023-01-05T20:51:15Z"
message: 's3profile-ocpdr2-ocs-storagecluster: failed to get profile s3profile-ocpdr2-ocs-storagecluster
for caller drpolicy validation, s3 profile s3profile-ocpdr2-ocs-storagecluster
not found in RamenConfig'
observedGeneration: 1
reason: s3ConnectionFailed
status: "False"
type: Validated
phase: Available
When I tried to inspect ths "RamenConfig" (which I somehow inferred to be the same as a ConfigMap
named ramen-hub-operator-config
, I realized there are two of them: one in the openshift-operators
, the other in the openshift-dr-system
namespace:
oc get ConfigMap -A | grep ramen-hub-operator-config
openshift-dr-system ramen-hub-operator-config 1
openshift-operators ramen-hub-operator-config 1
The ConfigMap
in the namespace openshift-operators
was patched. The one in the namespace openshift-dr-system
was not.
So I had to re-add that configuration modification to the GitOps folder in this repo: config/cluster-pairng/0300-sync-s3-config.yaml
. That entire block of code recreates the appropriate configuration under the namespace openshift-dr-system
.
In conclusion, it looks like the ODF Multicluster Orchestrator operator knows what to do in terms of updating the Ramen configuration - bacause it makes them in the ConfigMap ramen-hub-operator-config
in the openshift-operators
namespace, but it it does not make the same modifications in the openshift-dr-system
.
Using OCP 4.11.20 on both Hub and Managed clusters.
Testing this setup takes a while and running all these clusters is relatively expensive.
Since I am using OCP clusters created from RHACM, hibernation of clusters when not in use is an option.
For that reason, I always hibernated the managed clusters and then the hub cluster before ending the day, then restarted them in reverse order the next day - first the hub cluster, then the managed clusters from the hub cluster console.
The hub cluster is created from another RHACM instance, so I can hibernate the hub cluster from that instance's console. Once the hub cluster is running, I can go to its console and resume the managed clusters.
This arrangement worked well late in December (unclear which OCP versions I was using at the time.)
A few weeks later (today,) this restart sequence did work cleanly anymore. Once I brought the hub cluster from hibernation, the console never came back.
Inspecting the logs for the console pods showed the following messages in a loop:
oc logs console-5698b44df6-nmfl4
I0106 13:42:09.203384 1 config.go:378] Successfully parsed configs for 2 managed cluster(s).
W0106 13:42:09.203536 1 main.go:220] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
I0106 13:42:09.203546 1 main.go:230] The following console plugins are enabled:
I0106 13:42:09.203555 1 main.go:232] - odf-multicluster-console
I0106 13:42:09.203562 1 main.go:232] - acm
I0106 13:42:09.203569 1 main.go:232] - mce
I0106 13:42:09.203576 1 main.go:232] - odf-console
I0106 13:42:09.203639 1 main.go:295] Configuring managed cluster ocpdr1
I0106 13:42:09.203903 1 main.go:295] Configuring managed cluster ocpdr2
I0106 13:42:09.204119 1 main.go:364] cookies are secure!
E0106 13:42:14.739239 1 auth.go:232] error contacting auth provider (retrying in 10s): Get "https://api.ocpdr1.cloudpak-bringup.com:6443/.well-known/oauth-authorization-server": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:42:24.798209 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com": EOF
E0106 13:42:34.816804 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com": EOF
E0106 13:42:49.826961 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:06.753144 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:21.840267 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:36.917351 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:48.455609 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": EOF
To be clear, these managed clusters are not the typical managed clusters created from RHACM, since they are part of a DRPolicy
and a lot of these DR functions are clearly marked as "Development Preview," which also entails patching the hub cluster to enable the "multicluster web console".
(This feature makes me suspect that integration is what is causing the problems during the cluster restart)
Without the console pods being restarted, I resorted to logging into the hub cluster via oc
CLI, then resumed the managed clusters - only the ones created through the instructions in this README page.
oc get ManagedCluster \
-l app.kubernetes.io/instance=ocp-dr-clusters \
-o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' \
| xargs -I {} \
oc patch ClusterDeployment {} \
-n {} \
--type merge \
--patch '{"spec":{"PowerState": "Running"}}'
Once the manged clusters resumed from hibernation, then I restarted the hub cluster console pods (maybe they would have restarted on their own, but I didn't want to wait):
oc rollout restart Deployment/console -n openshift-console \
&& oc rollout status Deployment/console -n openshift-console