nexus/knowledgebase/csd-wiki/ICSD/ESM-Cloud-Disaster-and-Recovery-Guide_686087723.md

# ESM-Cloud-Disaster-and-Recovery-Guide_686087723
## Introduction

The guide based on the latest ESM disaster and recovery solution, backing up data from source farm and restoring it to a new target farm

Which means you will discard current farm and restore on it on the new farm(cross AWS account, cross region).

## Backup all the data from the source farm

- Backup Data
	- Backup efs server for cms, smax, oomt, prometheus
		- Backup RDS server for cms, smax, oomt, audit service
		- Backup vertica db if CGRO is enabled in the source farm(optional)
		- Backup all the k8s configuration files using velero
		- Backup all cert files in **target** farm(smax, cdf, cms, oomt, audit) - /mnt/efs/var/vols/itom/itsma/global-volume/certificate/
- Transfer Data
	- Transfer all the snapshots to target farm(maybe takes time, depends on the size of data)
- Push all images
	- Push all images to target farm
		- To make sure data is consistent, the creation time for all the backups should not be too far way, better to sit within 2 hours.
- Tips
	- Using backup vault to transer efs backups.
		- Copy and share rds snapshots with **customer key**
		- Refer to the link: How to share an RDS snapshot

## Prepare new EKS cluster in the new target farm

- Shutdown the farm that is running in target farm(optional, if available ip is enough you can skip it)
- Build new vpc & subnet from CloudFormation(Make sure you are **not** using **saml** login into AWS console, instead you should login with your or service account) - in this case, we don't do this but just reuse the existing resources
- Build new EKS cluster from CloudFormation(**add or update tag for 3 private subnets**: [kubernetes.io/cluster/ *<cluster-name>* =shared](http://kubernetes.io/cluster/); [kubernetes.io/role/internal-elb=1](http://kubernetes.io/role/internal-elb=1))
- Build new EKS worker nodes: smax, cms, oomt, prometheus(NodeInstanceRole: value from Outputs tab when you create EKS cluster)
- Check the node groups are exactly the same as source farm(instance type, instance number, kubernetes labels)
- Build new EKS bastion(kubectl get nodes returns the correct output)
- Security group inbound rule check(Add sg of bastion server to EKS control panel SG inbound rule; Add EKS control panel SG to new EFS SG inbound rule)

Refer to the link: [https://docs.microfocus.com/doc/SMAX/23.4/TasksOnAWS](https://docs.microfocus.com/doc/SMAX/23.4/TasksOnAWS)

## Setting up velero

- Download velero binary and copy into $PATH(wget [https://github.com/vmware-tanzu/velero/releases/download/v1.4.2/velero-v1.4.2-linux-amd64.tar.gz](https://github.com/vmware-tanzu/velero/releases/download/v1.4.2/velero-v1.4.2-linux-amd64.tar.gz) && tar -zxvf velero-v1.4.2-linux-amd64.tar.gz && cd velero-\* && chmod a+x velero && mv velero /usr/local/bin/)
- Create bucket in S3 for velero
- Setup velero deployment(velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.1.0 --bucket $BUCKET --backup-location-config region=$REGION --snapshot-location-config region=$REGION --secret-file./credentials-velero
- Check velero functions by running: velero backup create test1

Refer to the link to install velero: [https://github.com/vmware-tanzu/velero-plugin-for-aws](https://github.com/vmware-tanzu/velero-plugin-for-aws)

You can also refer to the link to config velero backups automatically: [https://github.houston.softwaregrp.net/smax-saas-ops/saas-devops-tools/blob/master/velero\_backup.sh](https://github.houston.softwaregrp.net/smax-saas-ops/saas-devops-tools/blob/master/velero_backup.sh)

Velero should be installed on both source farm and target farm, in saas farm we have setup one user for DR in each farm, please install velero using that account

## Restore infra in target farm

- Restore new smax rds server from snapshot - pay attention to the RDS type, storage type & size
- Restore new cms rds server from snapshot
- Restore new oomt rds server from snapshot(if has)
- Restore new audit rds server from snapshot(if has)
- Restore vetical db for CGRO(optional, if CGRO is enabled in the farm)
- Restore new smax efs server from snapshot(you should **Add mount target** after restore so that IPs will be assigned, same for cms & prometheus efs servers) - T **ime consume task**
- Restore new cms efs server from snapshot
- Restore new oomt efs server from snapshot
- Restore new prometheus efs server from snapshot(optional, if you care about promehteus data)

To save time, these tasks in the section can be done **parallely**

## Update K8S resources

- Download current CDF installtion bundel in new bastion and run:./install --capabilities Tools=true,Monitoring=false,LogCollection=false,DeploymentManagement=false,ClusterManagement=false
- Download velero backups and shell script, which is used to batch update the parameters in K8S resources(put shell script under the directory of **backups** so that we have **9** files in total)
- Replace all images from velero backups, e.g.: sh replaceVeleroConf.sh " [*551360491748*.dkr.ecr.*us-west-2*.amazonaws.com](http://551360491748.dkr.ecr.us-west-2.amazonaws.com/) \\/ *hpeswitom* " " [*551360491750*.dkr.ecr.*us-west-1*.amazonaws.com](http://551360491750.dkr.ecr.us-west-2.amazonaws.com/) \\/ *hpeswitomsandbox* " false
- Replace aws account(if changed): sh replaceVeleroConf.sh *source\_aws\_account* *target\_aws\_account* false
- Replace region(if changed): sh replaceVeleroConf.sh *us-west-2* *us-west-1* false
- Replace org name(if changed): sh replaceVeleroConf.sh "\\" *hpeswitom* \\"" "\\" *hpeswitomsandbox* \\"" false
- Replace fqdn(if changed): sh replaceVeleroConf.sh " *[us2-smax.saas.microfocus.com](http://us2-smax.saas.microfocus.com/)* " " *[us2-smax-testing.saas.microfocus.com](http://us2-smax-testing.saas.microfocus.com/)* " false - use change fqdn script is another appraoch. take care of the certificate and saml
- Replace efs server from velero backups(if you are restoring on the same farm and restoring to the same efs, you can skip this step since efs server endpoints never changed)

```
sh replaceVeleroConf.sh source_smax_efs target_smax_efs false
sh replaceVeleroConf.sh source_cms_efs target_cms_efs false
sh replaceVeleroConf.sh source_oomt_efs target_oomt_efs false
sh replaceVeleroConf.sh source_prometheus_efs target_prometheus_efs false
```

- Replace vertica server from velero backups(optional) - sh replaceVeleroConf.sh *source\_vertica\_ip* *target\_vertica\_ip* false (if you are restoring on the same farm, you can skip this step if vertica ip not changed)
- Replace rds server from velero backups(if you are restoring on the same farm, you can skip this step if rds endpoints not changed)

```
sh replaceVeleroConf.sh source_smax_rds target_smax_rds false
sh replaceVeleroConf.sh source_cms_rds target_cms_rds false
sh replaceVeleroConf.sh source_oomt_rds target_oomt_rds false
sh replaceVeleroConf.sh source_audit_rds target_audit_rds true
```

- Upload the updated backup files to target S3 bucket(rm -rf replaceVeleroConf.sh && cd.. && aws s3 cp --recursive *backup\_Name* / s3://target\_bucket/backups/backup\_Name/)
- Check you have get the correct backups(velero backup get - should return the backup from source farm now)

## Perform restore in target farm

- Disable smtp in target farm(optional) - doing this by adding outbound rule for Network ACLs(id:99 Port 25, Deny all)
- Mount new efs server for smax, cms, oomt, prometheus in new bastion(mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 *<SMAX EFS endpoint>:/* */mnt/efs*) - You can skip this step if efs endpoints not changed
- Also add efs in /etc/fstab, otherwise mount point lost after a VM restart - You can skip this step if efs endpoints not changed
- Delete pv in case pv is already created
- Delete ns of itsma- *xxxxx*, cms, core, oomt, audit, prometheus if the namespaces is still there
- Perform full restore: velero restore create --from-backup <backup.all.example1> --wait
- Only restore one namespace(optional): velero restore create --from-backup <backup.all.example1> --wait --include-namespaces= *cms*
- Update credentials of itom-vault container in case the pod can not up

PASSPHRASE=$(kubectl get secret vault-passphrase -n core -o json 2>/dev/null | jq -r '.data.passphrase')
VAULT\_CREDENTIAL\_SECRET=$(kubectl get secret vault-credential -n core -o json 2>/dev/null )
ENCRYPTED\_ROOT\_TOKEN=$(echo ${VAULT\_CREDENTIAL\_SECRET} | jq -r '.data."root.token"')
VAULT\_TOKEN=$(echo ${ENCRYPTED\_ROOT\_TOKEN} | openssl aes-256-cbc -md sha256 -a -d -pass pass:"${PASSPHRASE}")
echo ${VAULT\_TOKEN}

kubectl exec -it $(kubectl get pod -ncore -ocustom-columns=NAME:.[metadata.name](http://metadata.name/) |grep itom-vault| head -1) -ncore -- bash
export VAULT\_ADDR= [https://itom-vault.core:8200](https://itom-vault.core:8200/)
export VAULT\_TOKEN=<VAULT\_TOKEN>
vault write -tls-skip-verify auth/kubernetes/config kubernetes\_host=" [https://kubernetes.default](https://kubernetes.default/) " kubernetes\_ca\_cert=@/var/run/secrets/ [kubernetes.io/serviceaccount/ca.crt](http://kubernetes.io/serviceaccount/ca.crt)

- Helm upgrade apphub for cdf - All helm releases should update, this includes core, cms and maybe oomt in the future

/root/cdf/bin/helm get values apphub -n core > apphub.yaml
update apphub.yaml with new values(dburl, host,registry,orgName,externalAccessHost)
/root/cdf/bin/helm upgrade apphub /root/cdf/charts/apphub-1.20.0+20211100.219.tgz -f apphub.yaml -n core

- Helm upgrade cms releases - update smax.crt,database.host,smax.host,orgName,registry,externalAccessHost,idmAuthUrl,idmServiceUrl(pay attention to host and idmServiceUrl, we have different values between saas farms)
- Helm upgrade apphub for prometheus - update orgName,registry,externalAccessHost
- Helm upgrade oomt releases(optional, if you have enabled oomt)
- Helm upgrade audit service releases(optional, if you have enabled audit service)
- Wait until all the pods are up(kubectl get pod --all-namespaces|grep -vE '1/1|2/2|3/3|4/4|Completed')
- There is a known issue if smax transformed to helm, you will have to do the **helm upgrade** for itsma since most DND pods are waiting for the jobs
- Sometimes dnd-upgrade-jobs failed, just deleted the pods and related pods that are in Init states

## Certificates

- Update SMAX cert by:./replaceExternalAccessHost.sh -c *<certificate\_path>* -k *<key\_path>* -t *<cacert\_path>* -n *<new FQDN>*
- Update CMS and SAM cert by:

Get current cms cert from **source** farm:

```
helm ls -n cms && helm get values cms-release -n cms > /tmp/cms.yaml
```


Put cms cert files under the directory of: /mnt/efs/var/vols/itom/itsma/global-volume/certificate/source/, the cert files will be imported automatically(make sure 1999:1999 is set)
Restart platfrom and platform-offline pods

- You can also update cert files for DND and OO: [https://docs.microfocus.com/doc/SMAX/23.4/SMAXChangeFQDN](https://docs.microfocus.com/doc/SMAX/23.4/SMAXChangeFQDN)
- Update cert files for OOMT if OOMT is enabled
- Update cert files for Audit service if audit is enabled

## Application load balancer

- Configure Load balancer for smax - refer to: [https://docs.microfocus.com/doc/SMAX/23.4/EKSDeploySuite](https://docs.microfocus.com/doc/SMAX/23.4/EKSDeploySuite)
- Configure Load balancer for management portal: 5443
- Configure Load balancer for prometheus(optional)
- Rebuild ALB controller in kube-system(delete the deployment of aws-load-balancer-controller, under the namespace of kube-system, and recreate it, pay attention to the values of cluster-name, region)
- Delete and rebuild 3 ingress for cms - Please be noted that ALB name will be changed
- Delete and rebuild 3 ingress for oomt(optional)
- Delete and rebuild 3 ingress for audit(optional)
- Bind DNS records in Route53 for smax, cms, oomt and audit service

You can config ALB controller following the guide: [https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html)

Also note that all nodePort will be changed, register ALB with new nodePort, so that we will have healthy status

And the ALB for cms also changed(append random string to DNS name, so you have to update route53 with correct values)

## Manual updates after restore

- Sensitive data update, depends on your business(update smtp server, update integration password, delete some customer tenants, update bo password for some tenant)
- Update CMS integration url in BO page(Select tenant → Application settings → Configuration Management settings, update CMS gateway service)
- Update SAM integration url in BO page(Select tenant → Capability settings → Software Asset Management, update CMS gateway url)
- Update DND integration url in Agent portal(Open tenant agent page → Administration → Providers → Aggregation providers)
- Update csa\_access\_point in DB(e.g. update dnd\_ *339803511*.csa\_access\_point set uri=' *[https://us2-smax-testing.saas.microfocus.com/339803511/oo](https://us2-smax-testing.saas.microfocus.com/339803511/oo)* ' where uuid=' *8a50b56d7406291f01740629c9f9013a* ';)
- Update OOMT Integration URL in BO page(Select tenant → Capability settings, update OO integration URL and OO login URL)
- Update OPB agent and endpoints in Agent portal
- Update topology in OO Deployment Operations(ras server has to be reconfigured)
- Update settings in prometheus and granfa - Optional

Update cm of itom-granfa(append below values to data.grafana.ini.root\_url)
root\_url = [https://us2-smax-testing.saas.microfocus.com/grafana](https://us2-smax-testing.saas.microfocus.com:9000/grafana)
\[smtp\]
enabled = true
host = *[email-smtp.us-west-2.amazonaws.com](http://email-smtp.us-west-2.amazonaws.com/)*:25
user = *aws\_access\_key\_id*
password = *aws\_secret\_access\_key*
skip\_verify = true
from\_address = *[sma\_noreply@microfocus.com](mailto:sma_noreply@microfocus.com)*
from\_name = *US2Dev\_Grafana*
\[rendering\]
server\_url = [http://bitnami-grafana-image-renderer:8080/render](http://bitnami-grafana-image-renderer:8080/render)
callback\_url = [https://itom-grafana:80/](https://itom-grafana:80/)


**Restart pod of** **itom-grafana-xxxxx**
Open granfa and update the user of datasource, make sure you are using correct key in the right farm

- Update yamls\_outputs in SMAX efs server(better to change all yaml files to readonly)
- Please note we have different **cms integration url for different farm**: e.g. [https://int.cms.fqdn:445/cms-gateway](https://int.cms.fqdn:445/cms-gateway) in us2-dev and [https://int.fqdn:445/cms-gateway](https://int.fqdn:445/cms-gateway) in us2-prod

We have updated yamls for currently deployments, but values are still not changed in /mnt/efs/var/vols/itom/itsma/global-volume/yamls\_output/, so if we execute the command:

```
kubectl delete -f xxxx.deployment.yaml & kubectl create -f xxxx.deployment.yaml
```

, pods can not up

## Validation

- Source farm not impacted
- Disable or enable smtp in restored farm(optional, depends on your business)
- Check the status of all the pods
- Smax testing in restored farm(bo, ess page, agent page, idol search)
- DND integration testing, try to execute one OO flow
- CMS integration testing, try to open jmx-console, ucmdb-browser, and CI sync with smax
- CGRO integration testing - Optional
- Audit Service testing - Optional
- Premetheus testing - All data is shown correctly in granfa, alertmanager works

## Issues you may meet

- kubectl get svc return none due to Fedrate login
- pv not bound while restoring eks farm(add sg of EKS control panel to efs inbound rule)
- cms can not up(only restore cms from velero backups solves it)
- smartA pods failed to start up due to some files are not copyed from source farm(take smarta-saw-con for example, but you may meet other)

```
kubectl scale sts smarta-saw-con --replicas=0 -n itsma-xxxxx
delete all files under the directory of: /mnt/efs/var/vols/itom/itsma/itsma-smarta-saw-con-0/smarta-saw-con-0/data
kubectl scale sts smarta-saw-con --replicas=2 -n itsma-xxxxx
```

- pods not up due to image not pushed to ECR(minor version difference between source & target farm)
- ingress not created(reconfig ALB controller in kube-system)
- Integration not work between cms & smax (manually update integration url in bo)
- Integration not work between oo & smax (manually update integration url in agent & db)
- Grafana alerts are sent as us2-prod but actually are from us2-dev(reconfig grafana)
- SAML login not works(till now)
- CMS integration not works due to different gateway url format(int.**cms**.fqdn in us2-dev but int.fqdn in us2-prod)
- Not all rabbitmq nodes are added into cluster, in my case only infra-rabbitmq-0 is there

kubectl scale sts infra-rabbitmq -n *itsma-ohs8f* --replicas=1
delete all files under the directory of /mnt/efs/var/vols/itom/itsma/ *rabbitmq-infra-rabbitmq-1(2)* /data/xservices/rabbitmq/ *3.7.1.14* /mnesia
kubectl scale sts infra-rabbitmq -n *itsma-ohs8f* --replicas=3

## Leftover

- Not switched to spot instance yet
- Old EFS server still there(other resources in AWS should have been deleted)
- Backup plan should be changed to save cost
- Contents of yaml\_outputs in EFS server are from source farm, should be changed manually
- Some records in parameter store are not updated and there are many invalid record
- API call not stable according to wenjun(solved by infra rabbitmq)
- Saml login still failed