Skip to content

OpenShift clusters paired in disaster-recovery mode

Notifications You must be signed in to change notification settings

nastacio/ocp-dr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

OpenShift Regional Disaster Recovery

Overview

These instructions show how to use OpenShift GitOps to deploy two OpenShift clusters paired in disaster-recovery mode.

The disaster-recovery is rooted in two components:

The idea is to create two OpenShift clusters in two different cloud regions and pair them via RHACM.

Component

Prerequisites

  • OpenShift Cluster 4.11.x or higher to host the RHACM hub cluster

Install the OpenShift GitOps operator in the Hub cluster

Using the OCP console

  1. From the Administrator's perspective, navigate to the OperatorHub page.

  2. Search for "Red Hat OpenShift GitOps." Click on the tile and then click on "Install."

  3. Keep the defaults in the wizard and click on "Install."

  4. Wait for it to appear in the " Installed Operators list." If it doesn't install correctly, you can check its status on the "Installed Operators" page in the openshift-operators namespace.

Using a terminal

  1. Open a terminal and ensure you have the OpenShift CLI installed:

    oc version --client
    
    # Client Version: 4.10.42

    Ideally, the client's minor version should be at most one iteration behind the server version. Most commands here are pretty basic and will work with more significant differences, but keep that in mind if you see errors about unrecognized commands and parameters.

    If you do not have the CLI installed, follow these instructions.

  2. Login to the OpenShift CLI

  3. Create the Subscription resource for the operator:

    cat << EOF | oc apply -f -
    ---
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
       name: openshift-gitops-operator
       namespace: openshift-operators
    spec:
       channel: stable
       installPlanApproval: Automatic
       name: openshift-gitops-operator
       source: redhat-operators
       sourceNamespace: openshift-marketplace
    EOF

Install RHACM

Follow these instructions for a GitOps-based approach to the installation or follow the RHACM product documentation for the official installation procedure.

Add this repo to Hub Cluster

ocp_dr_gitops_url=https://github.com/nastacio/ocp-dr
ocp_dr_gitops_branch=main
image_set_ref=img4.11.20-x86-64-appsub
argocd app create ocp-dr-app \
      --project default \
      --dest-server https://kubernetes.default.svc \
      --repo ${ocp_dr_gitops_url:?} \
      --path config/app/ \
      --helm-set repoURL=${ocp_dr_gitops_url:?} \
      --helm-set targetRevision=${ocp_dr_gitops_branch:?} \
      --helm-set metadata.cluster.image_set_ref=${image_set_ref:-img4.11.20-x86-64-appsub} \
      --sync-policy automated \
      --revision ${ocp_dr_gitops_branch:?}  \
      --upsert \
&& argocd app wait -l app.kubernetes.io/instance=ocp-dr-app \
      --sync \
      --health \
      --operation

Folders

These are the ArgoCD application folders in this repository.

Folder Description
config/app Top-level App-of-Apps for all the other GitOps Application resources.
config/rhacm All additions required to enable a RHACM cluster for orchestrating disaster recovery between peered clusters.
config/odf OpenShift Data Foundation cluster(only tested with AWS).
config/clusters Pair of managed clusters (created via RHACM API) with non-overlapping networks.

Create credential on RHACM server

This step is simple in concept - create a generic secret in the cluster - but the contents of the secret vary with each target Cloud.

The "Managing Credentials" section of the RHACM documentation contains the detailed process for adding credentials to the cluster, including links to the respective content sources.

This repository was only tested in AWS, but you should be able to modify it to use other providers, indicating the name of the secret and respective namespace under the .metadata.rhacm.secret and .metadata.rhacm.secret_namespace parameters of the ocp-dr-app Argo application.

Issues

Ramen configuration not updated in the openshift-dr-system namespace

The ODF 4.11 instructions, unlike the ODF 4.9 instructions, does not tell the user to modify the Ramen configuration.

I can see that the ODF Multicluster Orchestrator operator eventually updates the ConfigMap named ramen-hub-operator-config in the openshift-operators namespace (the ODF 4.11 instructions are very clear about using that namespace.)

However, when I create a DRPolicy resource, its status complains about the lack of a profile in the Ramen configuration. That same message shows up in the status of the DRCluster resources

oc get drcluster ocpdr2 -o yaml

apiVersion: ramendr.openshift.io/v1alpha1
kind: DRCluster
...
spec:
  region: 3ce47eb5-4815-4933-b776-3c74fcc6709f
  s3ProfileName: s3profile-ocpdr2-ocs-storagecluster
status:
  conditions:
  - lastTransitionTime: "2023-01-05T20:51:15Z"
    message: 's3profile-ocpdr2-ocs-storagecluster: failed to get profile s3profile-ocpdr2-ocs-storagecluster
      for caller drpolicy validation, s3 profile s3profile-ocpdr2-ocs-storagecluster
      not found in RamenConfig'
    observedGeneration: 1
    reason: s3ConnectionFailed
    status: "False"
    type: Validated
  phase: Available

When I tried to inspect ths "RamenConfig" (which I somehow inferred to be the same as a ConfigMap named ramen-hub-operator-config, I realized there are two of them: one in the openshift-operators, the other in the openshift-dr-system namespace:

oc get ConfigMap -A | grep ramen-hub-operator-config
openshift-dr-system               ramen-hub-operator-config              1
openshift-operators               ramen-hub-operator-config              1

The ConfigMap in the namespace openshift-operators was patched. The one in the namespace openshift-dr-system was not. So I had to re-add that configuration modification to the GitOps folder in this repo: config/cluster-pairng/0300-sync-s3-config.yaml. That entire block of code recreates the appropriate configuration under the namespace openshift-dr-system.

In conclusion, it looks like the ODF Multicluster Orchestrator operator knows what to do in terms of updating the Ramen configuration - bacause it makes them in the ConfigMap ramen-hub-operator-config in the openshift-operators namespace, but it it does not make the same modifications in the openshift-dr-system.

Hub cluster console not restarting when managed clusters are hibernating

Using OCP 4.11.20 on both Hub and Managed clusters.

Testing this setup takes a while and running all these clusters is relatively expensive.

Since I am using OCP clusters created from RHACM, hibernation of clusters when not in use is an option.

For that reason, I always hibernated the managed clusters and then the hub cluster before ending the day, then restarted them in reverse order the next day - first the hub cluster, then the managed clusters from the hub cluster console.

The hub cluster is created from another RHACM instance, so I can hibernate the hub cluster from that instance's console. Once the hub cluster is running, I can go to its console and resume the managed clusters.

This arrangement worked well late in December (unclear which OCP versions I was using at the time.)

A few weeks later (today,) this restart sequence did work cleanly anymore. Once I brought the hub cluster from hibernation, the console never came back.

Inspecting the logs for the console pods showed the following messages in a loop:

oc logs  console-5698b44df6-nmfl4  
I0106 13:42:09.203384       1 config.go:378] Successfully parsed configs for 2 managed cluster(s).
W0106 13:42:09.203536       1 main.go:220] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
I0106 13:42:09.203546       1 main.go:230] The following console plugins are enabled:
I0106 13:42:09.203555       1 main.go:232]  - odf-multicluster-console
I0106 13:42:09.203562       1 main.go:232]  - acm
I0106 13:42:09.203569       1 main.go:232]  - mce
I0106 13:42:09.203576       1 main.go:232]  - odf-console
I0106 13:42:09.203639       1 main.go:295] Configuring managed cluster ocpdr1
I0106 13:42:09.203903       1 main.go:295] Configuring managed cluster ocpdr2
I0106 13:42:09.204119       1 main.go:364] cookies are secure!
E0106 13:42:14.739239       1 auth.go:232] error contacting auth provider (retrying in 10s): Get "https://api.ocpdr1.cloudpak-bringup.com:6443/.well-known/oauth-authorization-server": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:42:24.798209       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com": EOF
E0106 13:42:34.816804       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com": EOF
E0106 13:42:49.826961       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr1.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:06.753144       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:21.840267       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:36.917351       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0106 13:43:48.455609       1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com/oauth/token failed: Head "https://oauth-openshift.apps.ocpdr2.cloudpak-bringup.com": EOF

To be clear, these managed clusters are not the typical managed clusters created from RHACM, since they are part of a DRPolicy and a lot of these DR functions are clearly marked as "Development Preview," which also entails patching the hub cluster to enable the "multicluster web console".

(This feature makes me suspect that integration is what is causing the problems during the cluster restart)

Without the console pods being restarted, I resorted to logging into the hub cluster via oc CLI, then resumed the managed clusters - only the ones created through the instructions in this README page.

oc get ManagedCluster \
   -l app.kubernetes.io/instance=ocp-dr-clusters \
   -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' \
   | xargs -I {} \
        oc patch ClusterDeployment {} \
          -n {} \
          --type merge \
          --patch '{"spec":{"PowerState": "Running"}}'

Once the manged clusters resumed from hibernation, then I restarted the hub cluster console pods (maybe they would have restarted on their own, but I didn't want to wait):

oc rollout restart Deployment/console -n openshift-console \
&& oc rollout status Deployment/console -n openshift-console

References

About

OpenShift clusters paired in disaster-recovery mode

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages