Files

Shen Wei f09834b5a5 Update nexus: fix conflicts and sync local changes

2026-04-26 12:06:50 +08:00

17 KiB

Raw Blame History

ESM-Cloud-Disaster-and-Recovery-Guide_686087723

Introduction

The guide based on the latest ESM disaster and recovery solution, backing up data from source farm and restoring it to a new target farm

Which means you will discard current farm and restore on it on the new farm(cross AWS account, cross region).

Backup all the data from the source farm

Backup Data
- Backup efs server for cms, smax, oomt, prometheus
  - Backup RDS server for cms, smax, oomt, audit service
  - Backup vertica db if CGRO is enabled in the source farm(optional)
  - Backup all the k8s configuration files using velero
  - Backup all cert files in target farm(smax, cdf, cms, oomt, audit) - /mnt/efs/var/vols/itom/itsma/global-volume/certificate/
Transfer Data
- Transfer all the snapshots to target farm(maybe takes time, depends on the size of data)
Push all images
- Push all images to target farm
  - To make sure data is consistent, the creation time for all the backups should not be too far way, better to sit within 2 hours.
Tips
- Using backup vault to transer efs backups.
  - Copy and share rds snapshots with customer key
  - Refer to the link: How to share an RDS snapshot

Prepare new EKS cluster in the new target farm

Shutdown the farm that is running in target farm(optional, if available ip is enough you can skip it)
Build new vpc & subnet from CloudFormation(Make sure you are not using saml login into AWS console, instead you should login with your or service account) - in this case, we don't do this but just reuse the existing resources
Build new EKS cluster from CloudFormation(add or update tag for 3 private subnets: kubernetes.io/cluster/ =shared; kubernetes.io/role/internal-elb=1)
Build new EKS worker nodes: smax, cms, oomt, prometheus(NodeInstanceRole: value from Outputs tab when you create EKS cluster)
Check the node groups are exactly the same as source farm(instance type, instance number, kubernetes labels)
Build new EKS bastion(kubectl get nodes returns the correct output)
Security group inbound rule check(Add sg of bastion server to EKS control panel SG inbound rule; Add EKS control panel SG to new EFS SG inbound rule)

Refer to the link: https://docs.microfocus.com/doc/SMAX/23.4/TasksOnAWS

Setting up velero

Download velero binary and copy into $PATH(wget https://github.com/vmware-tanzu/velero/releases/download/v1.4.2/velero-v1.4.2-linux-amd64.tar.gz && tar -zxvf velero-v1.4.2-linux-amd64.tar.gz && cd velero-* && chmod a+x velero && mv velero /usr/local/bin/)
Create bucket in S3 for velero
Setup velero deployment(velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.1.0 --bucket $BUCKET --backup-location-config region=$REGION --snapshot-location-config region=$REGION --secret-file./credentials-velero
Check velero functions by running: velero backup create test1

Refer to the link to install velero: https://github.com/vmware-tanzu/velero-plugin-for-aws

You can also refer to the link to config velero backups automatically: https://github.houston.softwaregrp.net/smax-saas-ops/saas-devops-tools/blob/master/velero_backup.sh

Velero should be installed on both source farm and target farm, in saas farm we have setup one user for DR in each farm, please install velero using that account

Restore infra in target farm

Restore new smax rds server from snapshot - pay attention to the RDS type, storage type & size
Restore new cms rds server from snapshot
Restore new oomt rds server from snapshot(if has)
Restore new audit rds server from snapshot(if has)
Restore vetical db for CGRO(optional, if CGRO is enabled in the farm)
Restore new smax efs server from snapshot(you should Add mount target after restore so that IPs will be assigned, same for cms & prometheus efs servers) - T ime consume task
Restore new cms efs server from snapshot
Restore new oomt efs server from snapshot
Restore new prometheus efs server from snapshot(optional, if you care about promehteus data)

To save time, these tasks in the section can be done parallely

Update K8S resources

Download current CDF installtion bundel in new bastion and run:./install --capabilities Tools=true,Monitoring=false,LogCollection=false,DeploymentManagement=false,ClusterManagement=false
Download velero backups and shell script, which is used to batch update the parameters in K8S resources(put shell script under the directory of backups so that we have 9 files in total)
Replace all images from velero backups, e.g.: sh replaceVeleroConf.sh " 551360491748.dkr.ecr.us-west-2.amazonaws.com \/ hpeswitom " " 551360491750.dkr.ecr.us-west-1.amazonaws.com \/ hpeswitomsandbox " false
Replace aws account(if changed): sh replaceVeleroConf.sh source_aws_account target_aws_account false
Replace region(if changed): sh replaceVeleroConf.sh us-west-2 us-west-1 false
Replace org name(if changed): sh replaceVeleroConf.sh "\" hpeswitom \"" "\" hpeswitomsandbox \"" false
Replace fqdn(if changed): sh replaceVeleroConf.sh " us2-smax.saas.microfocus.com " " us2-smax-testing.saas.microfocus.com " false - use change fqdn script is another appraoch. take care of the certificate and saml
Replace efs server from velero backups(if you are restoring on the same farm and restoring to the same efs, you can skip this step since efs server endpoints never changed)

sh replaceVeleroConf.sh source_smax_efs target_smax_efs false
sh replaceVeleroConf.sh source_cms_efs target_cms_efs false
sh replaceVeleroConf.sh source_oomt_efs target_oomt_efs false
sh replaceVeleroConf.sh source_prometheus_efs target_prometheus_efs false

Replace vertica server from velero backups(optional) - sh replaceVeleroConf.sh source_vertica_ip target_vertica_ip false (if you are restoring on the same farm, you can skip this step if vertica ip not changed)
Replace rds server from velero backups(if you are restoring on the same farm, you can skip this step if rds endpoints not changed)

sh replaceVeleroConf.sh source_smax_rds target_smax_rds false
sh replaceVeleroConf.sh source_cms_rds target_cms_rds false
sh replaceVeleroConf.sh source_oomt_rds target_oomt_rds false
sh replaceVeleroConf.sh source_audit_rds target_audit_rds true

Upload the updated backup files to target S3 bucket(rm -rf replaceVeleroConf.sh && cd.. && aws s3 cp --recursive backup_Name / s3://target_bucket/backups/backup_Name/)
Check you have get the correct backups(velero backup get - should return the backup from source farm now)

Perform restore in target farm

Disable smtp in target farm(optional) - doing this by adding outbound rule for Network ACLs(id:99 Port 25, Deny all)
Mount new efs server for smax, cms, oomt, prometheus in new bastion(mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 :/ /mnt/efs) - You can skip this step if efs endpoints not changed
Also add efs in /etc/fstab, otherwise mount point lost after a VM restart - You can skip this step if efs endpoints not changed
Delete pv in case pv is already created
Delete ns of itsma- xxxxx, cms, core, oomt, audit, prometheus if the namespaces is still there
Perform full restore: velero restore create --from-backup <backup.all.example1> --wait
Only restore one namespace(optional): velero restore create --from-backup <backup.all.example1> --wait --include-namespaces= cms
Update credentials of itom-vault container in case the pod can not up

PASSPHRASE=$(kubectl get secret vault-passphrase -n core -o json 2>/dev/null | jq -r '.data.passphrase')
VAULT_CREDENTIAL_SECRET=$(kubectl get secret vault-credential -n core -o json 2>/dev/null )
ENCRYPTED_ROOT_TOKEN=$(echo ${VAULT_CREDENTIAL_SECRET} | jq -r '.data."root.token"')
VAULT_TOKEN=$(echo ${ENCRYPTED_ROOT_TOKEN} | openssl aes-256-cbc -md sha256 -a -d -pass pass:"${PASSPHRASE}")
echo ${VAULT_TOKEN}

kubectl exec -it $(kubectl get pod -ncore -ocustom-columns=NAME:.metadata.name |grep itom-vault| head -1) -ncore -- bash
export VAULT_ADDR= https://itom-vault.core:8200
export VAULT_TOKEN=<VAULT_TOKEN>
vault write -tls-skip-verify auth/kubernetes/config kubernetes_host=" https://kubernetes.default " kubernetes_ca_cert=@/var/run/secrets/ kubernetes.io/serviceaccount/ca.crt

Helm upgrade apphub for cdf - All helm releases should update, this includes core, cms and maybe oomt in the future

/root/cdf/bin/helm get values apphub -n core > apphub.yaml
update apphub.yaml with new values(dburl, host,registry,orgName,externalAccessHost)
/root/cdf/bin/helm upgrade apphub /root/cdf/charts/apphub-1.20.0+20211100.219.tgz -f apphub.yaml -n core

Helm upgrade cms releases - update smax.crt,database.host,smax.host,orgName,registry,externalAccessHost,idmAuthUrl,idmServiceUrl(pay attention to host and idmServiceUrl, we have different values between saas farms)
Helm upgrade apphub for prometheus - update orgName,registry,externalAccessHost
Helm upgrade oomt releases(optional, if you have enabled oomt)
Helm upgrade audit service releases(optional, if you have enabled audit service)
Wait until all the pods are up(kubectl get pod --all-namespaces|grep -vE '1/1|2/2|3/3|4/4|Completed')
There is a known issue if smax transformed to helm, you will have to do the helm upgrade for itsma since most DND pods are waiting for the jobs
Sometimes dnd-upgrade-jobs failed, just deleted the pods and related pods that are in Init states

Certificates

Update SMAX cert by:./replaceExternalAccessHost.sh -c <certificate_path> -k <key_path> -t <cacert_path> -n
Update CMS and SAM cert by:

Get current cms cert from source farm:

helm ls -n cms && helm get values cms-release -n cms > /tmp/cms.yaml

Put cms cert files under the directory of: /mnt/efs/var/vols/itom/itsma/global-volume/certificate/source/, the cert files will be imported automatically(make sure 1999:1999 is set)
Restart platfrom and platform-offline pods

You can also update cert files for DND and OO: https://docs.microfocus.com/doc/SMAX/23.4/SMAXChangeFQDN
Update cert files for OOMT if OOMT is enabled
Update cert files for Audit service if audit is enabled

Application load balancer

Configure Load balancer for smax - refer to: https://docs.microfocus.com/doc/SMAX/23.4/EKSDeploySuite
Configure Load balancer for management portal: 5443
Configure Load balancer for prometheus(optional)
Rebuild ALB controller in kube-system(delete the deployment of aws-load-balancer-controller, under the namespace of kube-system, and recreate it, pay attention to the values of cluster-name, region)
Delete and rebuild 3 ingress for cms - Please be noted that ALB name will be changed
Delete and rebuild 3 ingress for oomt(optional)
Delete and rebuild 3 ingress for audit(optional)
Bind DNS records in Route53 for smax, cms, oomt and audit service

You can config ALB controller following the guide: https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html

Also note that all nodePort will be changed, register ALB with new nodePort, so that we will have healthy status

And the ALB for cms also changed(append random string to DNS name, so you have to update route53 with correct values)

Manual updates after restore

Sensitive data update, depends on your business(update smtp server, update integration password, delete some customer tenants, update bo password for some tenant)
Update CMS integration url in BO page(Select tenant → Application settings → Configuration Management settings, update CMS gateway service)
Update SAM integration url in BO page(Select tenant → Capability settings → Software Asset Management, update CMS gateway url)
Update DND integration url in Agent portal(Open tenant agent page → Administration → Providers → Aggregation providers)
Update csa_access_point in DB(e.g. update dnd_ 339803511.csa_access_point set uri=' https://us2-smax-testing.saas.microfocus.com/339803511/oo ' where uuid=' 8a50b56d7406291f01740629c9f9013a ';)
Update OOMT Integration URL in BO page(Select tenant → Capability settings, update OO integration URL and OO login URL)
Update OPB agent and endpoints in Agent portal
Update topology in OO Deployment Operations(ras server has to be reconfigured)
Update settings in prometheus and granfa - Optional

Update cm of itom-granfa(append below values to data.grafana.ini.root_url)
root_url = https://us2-smax-testing.saas.microfocus.com/grafana
[smtp]
enabled = true
host = email-smtp.us-west-2.amazonaws.com:25
user = aws_access_key_id
password = aws_secret_access_key
skip_verify = true
from_address = sma_noreply@microfocus.com
from_name = US2Dev_Grafana
[rendering]
server_url = http://bitnami-grafana-image-renderer:8080/render
callback_url = https://itom-grafana:80/

Restart pod of itom-grafana-xxxxx
Open granfa and update the user of datasource, make sure you are using correct key in the right farm

Update yamls_outputs in SMAX efs server(better to change all yaml files to readonly)
Please note we have different cms integration url for different farm: e.g. https://int.cms.fqdn:445/cms-gateway in us2-dev and https://int.fqdn:445/cms-gateway in us2-prod

We have updated yamls for currently deployments, but values are still not changed in /mnt/efs/var/vols/itom/itsma/global-volume/yamls_output/, so if we execute the command:

kubectl delete -f xxxx.deployment.yaml & kubectl create -f xxxx.deployment.yaml

, pods can not up

Validation

Source farm not impacted
Disable or enable smtp in restored farm(optional, depends on your business)
Check the status of all the pods
Smax testing in restored farm(bo, ess page, agent page, idol search)
DND integration testing, try to execute one OO flow
CMS integration testing, try to open jmx-console, ucmdb-browser, and CI sync with smax
CGRO integration testing - Optional
Audit Service testing - Optional
Premetheus testing - All data is shown correctly in granfa, alertmanager works

Issues you may meet

kubectl get svc return none due to Fedrate login
pv not bound while restoring eks farm(add sg of EKS control panel to efs inbound rule)
cms can not up(only restore cms from velero backups solves it)
smartA pods failed to start up due to some files are not copyed from source farm(take smarta-saw-con for example, but you may meet other)

kubectl scale sts smarta-saw-con --replicas=0 -n itsma-xxxxx
delete all files under the directory of: /mnt/efs/var/vols/itom/itsma/itsma-smarta-saw-con-0/smarta-saw-con-0/data
kubectl scale sts smarta-saw-con --replicas=2 -n itsma-xxxxx

pods not up due to image not pushed to ECR(minor version difference between source & target farm)
ingress not created(reconfig ALB controller in kube-system)
Integration not work between cms & smax (manually update integration url in bo)
Integration not work between oo & smax (manually update integration url in agent & db)
Grafana alerts are sent as us2-prod but actually are from us2-dev(reconfig grafana)
SAML login not works(till now)
CMS integration not works due to different gateway url format(int.cms.fqdn in us2-dev but int.fqdn in us2-prod)
Not all rabbitmq nodes are added into cluster, in my case only infra-rabbitmq-0 is there

kubectl scale sts infra-rabbitmq -n itsma-ohs8f --replicas=1
delete all files under the directory of /mnt/efs/var/vols/itom/itsma/ rabbitmq-infra-rabbitmq-1(2) /data/xservices/rabbitmq/ 3.7.1.14 /mnesia
kubectl scale sts infra-rabbitmq -n itsma-ohs8f --replicas=3

Leftover

Not switched to spot instance yet
Old EFS server still there(other resources in AWS should have been deleted)
Backup plan should be changed to save cost
Contents of yaml_outputs in EFS server are from source farm, should be changed manually
Some records in parameter store are not updated and there are many invalid record
API call not stable according to wenjun(solved by infra rabbitmq)
Saml login still failed

17 KiB Raw Blame History