256 lines
17 KiB
Markdown
256 lines
17 KiB
Markdown
# ESM-Cloud-Disaster-and-Recovery-Guide_686087723
|
|
## Introduction
|
|
|
|
The guide based on the latest ESM disaster and recovery solution, backing up data from source farm and restoring it to a new target farm
|
|
|
|
Which means you will discard current farm and restore on it on the new farm(cross AWS account, cross region).
|
|
|
|
## Backup all the data from the source farm
|
|
|
|
- Backup Data
|
|
- Backup efs server for cms, smax, oomt, prometheus
|
|
- Backup RDS server for cms, smax, oomt, audit service
|
|
- Backup vertica db if CGRO is enabled in the source farm(optional)
|
|
- Backup all the k8s configuration files using velero
|
|
- Backup all cert files in **target** farm(smax, cdf, cms, oomt, audit) - /mnt/efs/var/vols/itom/itsma/global-volume/certificate/
|
|
- Transfer Data
|
|
- Transfer all the snapshots to target farm(maybe takes time, depends on the size of data)
|
|
- Push all images
|
|
- Push all images to target farm
|
|
- To make sure data is consistent, the creation time for all the backups should not be too far way, better to sit within 2 hours.
|
|
- Tips
|
|
- Using backup vault to transer efs backups.
|
|
- Copy and share rds snapshots with **customer key**
|
|
- Refer to the link: How to share an RDS snapshot
|
|
|
|
## Prepare new EKS cluster in the new target farm
|
|
|
|
- Shutdown the farm that is running in target farm(optional, if available ip is enough you can skip it)
|
|
- Build new vpc & subnet from CloudFormation(Make sure you are **not** using **saml** login into AWS console, instead you should login with your or service account) - in this case, we don't do this but just reuse the existing resources
|
|
- Build new EKS cluster from CloudFormation(**add or update tag for 3 private subnets**: [kubernetes.io/cluster/ *<cluster-name>* =shared](http://kubernetes.io/cluster/); [kubernetes.io/role/internal-elb=1](http://kubernetes.io/role/internal-elb=1))
|
|
- Build new EKS worker nodes: smax, cms, oomt, prometheus(NodeInstanceRole: value from Outputs tab when you create EKS cluster)
|
|
- Check the node groups are exactly the same as source farm(instance type, instance number, kubernetes labels)
|
|
- Build new EKS bastion(kubectl get nodes returns the correct output)
|
|
- Security group inbound rule check(Add sg of bastion server to EKS control panel SG inbound rule; Add EKS control panel SG to new EFS SG inbound rule)
|
|
|
|
Refer to the link: [https://docs.microfocus.com/doc/SMAX/23.4/TasksOnAWS](https://docs.microfocus.com/doc/SMAX/23.4/TasksOnAWS)
|
|
|
|
## Setting up velero
|
|
|
|
- Download velero binary and copy into $PATH(wget [https://github.com/vmware-tanzu/velero/releases/download/v1.4.2/velero-v1.4.2-linux-amd64.tar.gz](https://github.com/vmware-tanzu/velero/releases/download/v1.4.2/velero-v1.4.2-linux-amd64.tar.gz) && tar -zxvf velero-v1.4.2-linux-amd64.tar.gz && cd velero-\* && chmod a+x velero && mv velero /usr/local/bin/)
|
|
- Create bucket in S3 for velero
|
|
- Setup velero deployment(velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.1.0 --bucket $BUCKET --backup-location-config region=$REGION --snapshot-location-config region=$REGION --secret-file./credentials-velero
|
|
- Check velero functions by running: velero backup create test1
|
|
|
|
Refer to the link to install velero: [https://github.com/vmware-tanzu/velero-plugin-for-aws](https://github.com/vmware-tanzu/velero-plugin-for-aws)
|
|
|
|
You can also refer to the link to config velero backups automatically: [https://github.houston.softwaregrp.net/smax-saas-ops/saas-devops-tools/blob/master/velero\_backup.sh](https://github.houston.softwaregrp.net/smax-saas-ops/saas-devops-tools/blob/master/velero_backup.sh)
|
|
|
|
Velero should be installed on both source farm and target farm, in saas farm we have setup one user for DR in each farm, please install velero using that account
|
|
|
|
## Restore infra in target farm
|
|
|
|
- Restore new smax rds server from snapshot - pay attention to the RDS type, storage type & size
|
|
- Restore new cms rds server from snapshot
|
|
- Restore new oomt rds server from snapshot(if has)
|
|
- Restore new audit rds server from snapshot(if has)
|
|
- Restore vetical db for CGRO(optional, if CGRO is enabled in the farm)
|
|
- Restore new smax efs server from snapshot(you should **Add mount target** after restore so that IPs will be assigned, same for cms & prometheus efs servers) - T **ime consume task**
|
|
- Restore new cms efs server from snapshot
|
|
- Restore new oomt efs server from snapshot
|
|
- Restore new prometheus efs server from snapshot(optional, if you care about promehteus data)
|
|
|
|
To save time, these tasks in the section can be done **parallely**
|
|
|
|
## Update K8S resources
|
|
|
|
- Download current CDF installtion bundel in new bastion and run:./install --capabilities Tools=true,Monitoring=false,LogCollection=false,DeploymentManagement=false,ClusterManagement=false
|
|
- Download velero backups and shell script, which is used to batch update the parameters in K8S resources(put shell script under the directory of **backups** so that we have **9** files in total)
|
|
- Replace all images from velero backups, e.g.: sh replaceVeleroConf.sh " [*551360491748*.dkr.ecr.*us-west-2*.amazonaws.com](http://551360491748.dkr.ecr.us-west-2.amazonaws.com/) \\/ *hpeswitom* " " [*551360491750*.dkr.ecr.*us-west-1*.amazonaws.com](http://551360491750.dkr.ecr.us-west-2.amazonaws.com/) \\/ *hpeswitomsandbox* " false
|
|
- Replace aws account(if changed): sh replaceVeleroConf.sh *source\_aws\_account* *target\_aws\_account* false
|
|
- Replace region(if changed): sh replaceVeleroConf.sh *us-west-2* *us-west-1* false
|
|
- Replace org name(if changed): sh replaceVeleroConf.sh "\\" *hpeswitom* \\"" "\\" *hpeswitomsandbox* \\"" false
|
|
- Replace fqdn(if changed): sh replaceVeleroConf.sh " *[us2-smax.saas.microfocus.com](http://us2-smax.saas.microfocus.com/)* " " *[us2-smax-testing.saas.microfocus.com](http://us2-smax-testing.saas.microfocus.com/)* " false - use change fqdn script is another appraoch. take care of the certificate and saml
|
|
- Replace efs server from velero backups(if you are restoring on the same farm and restoring to the same efs, you can skip this step since efs server endpoints never changed)
|
|
|
|
```
|
|
sh replaceVeleroConf.sh source_smax_efs target_smax_efs false
|
|
sh replaceVeleroConf.sh source_cms_efs target_cms_efs false
|
|
sh replaceVeleroConf.sh source_oomt_efs target_oomt_efs false
|
|
sh replaceVeleroConf.sh source_prometheus_efs target_prometheus_efs false
|
|
```
|
|
|
|
- Replace vertica server from velero backups(optional) - sh replaceVeleroConf.sh *source\_vertica\_ip* *target\_vertica\_ip* false (if you are restoring on the same farm, you can skip this step if vertica ip not changed)
|
|
- Replace rds server from velero backups(if you are restoring on the same farm, you can skip this step if rds endpoints not changed)
|
|
|
|
```
|
|
sh replaceVeleroConf.sh source_smax_rds target_smax_rds false
|
|
sh replaceVeleroConf.sh source_cms_rds target_cms_rds false
|
|
sh replaceVeleroConf.sh source_oomt_rds target_oomt_rds false
|
|
sh replaceVeleroConf.sh source_audit_rds target_audit_rds true
|
|
```
|
|
|
|
- Upload the updated backup files to target S3 bucket(rm -rf replaceVeleroConf.sh && cd.. && aws s3 cp --recursive *backup\_Name* / s3://target\_bucket/backups/backup\_Name/)
|
|
- Check you have get the correct backups(velero backup get - should return the backup from source farm now)
|
|
|
|
## Perform restore in target farm
|
|
|
|
- Disable smtp in target farm(optional) - doing this by adding outbound rule for Network ACLs(id:99 Port 25, Deny all)
|
|
- Mount new efs server for smax, cms, oomt, prometheus in new bastion(mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 *<SMAX EFS endpoint>:/* */mnt/efs*) - You can skip this step if efs endpoints not changed
|
|
- Also add efs in /etc/fstab, otherwise mount point lost after a VM restart - You can skip this step if efs endpoints not changed
|
|
- Delete pv in case pv is already created
|
|
- Delete ns of itsma- *xxxxx*, cms, core, oomt, audit, prometheus if the namespaces is still there
|
|
- Perform full restore: velero restore create --from-backup <backup.all.example1> --wait
|
|
- Only restore one namespace(optional): velero restore create --from-backup <backup.all.example1> --wait --include-namespaces= *cms*
|
|
- Update credentials of itom-vault container in case the pod can not up
|
|
|
|
PASSPHRASE=$(kubectl get secret vault-passphrase -n core -o json 2>/dev/null | jq -r '.data.passphrase')
|
|
VAULT\_CREDENTIAL\_SECRET=$(kubectl get secret vault-credential -n core -o json 2>/dev/null )
|
|
ENCRYPTED\_ROOT\_TOKEN=$(echo ${VAULT\_CREDENTIAL\_SECRET} | jq -r '.data."root.token"')
|
|
VAULT\_TOKEN=$(echo ${ENCRYPTED\_ROOT\_TOKEN} | openssl aes-256-cbc -md sha256 -a -d -pass pass:"${PASSPHRASE}")
|
|
echo ${VAULT\_TOKEN}
|
|
|
|
kubectl exec -it $(kubectl get pod -ncore -ocustom-columns=NAME:.[metadata.name](http://metadata.name/) |grep itom-vault| head -1) -ncore -- bash
|
|
export VAULT\_ADDR= [https://itom-vault.core:8200](https://itom-vault.core:8200/)
|
|
export VAULT\_TOKEN=<VAULT\_TOKEN>
|
|
vault write -tls-skip-verify auth/kubernetes/config kubernetes\_host=" [https://kubernetes.default](https://kubernetes.default/) " kubernetes\_ca\_cert=@/var/run/secrets/ [kubernetes.io/serviceaccount/ca.crt](http://kubernetes.io/serviceaccount/ca.crt)
|
|
|
|
- Helm upgrade apphub for cdf - All helm releases should update, this includes core, cms and maybe oomt in the future
|
|
|
|
/root/cdf/bin/helm get values apphub -n core > apphub.yaml
|
|
update apphub.yaml with new values(dburl, host,registry,orgName,externalAccessHost)
|
|
/root/cdf/bin/helm upgrade apphub /root/cdf/charts/apphub-1.20.0+20211100.219.tgz -f apphub.yaml -n core
|
|
|
|
- Helm upgrade cms releases - update smax.crt,database.host,smax.host,orgName,registry,externalAccessHost,idmAuthUrl,idmServiceUrl(pay attention to host and idmServiceUrl, we have different values between saas farms)
|
|
- Helm upgrade apphub for prometheus - update orgName,registry,externalAccessHost
|
|
- Helm upgrade oomt releases(optional, if you have enabled oomt)
|
|
- Helm upgrade audit service releases(optional, if you have enabled audit service)
|
|
- Wait until all the pods are up(kubectl get pod --all-namespaces|grep -vE '1/1|2/2|3/3|4/4|Completed')
|
|
- There is a known issue if smax transformed to helm, you will have to do the **helm upgrade** for itsma since most DND pods are waiting for the jobs
|
|
- Sometimes dnd-upgrade-jobs failed, just deleted the pods and related pods that are in Init states
|
|
|
|
## Certificates
|
|
|
|
- Update SMAX cert by:./replaceExternalAccessHost.sh -c *<certificate\_path>* -k *<key\_path>* -t *<cacert\_path>* -n *<new FQDN>*
|
|
- Update CMS and SAM cert by:
|
|
|
|
Get current cms cert from **source** farm:
|
|
|
|
```
|
|
helm ls -n cms && helm get values cms-release -n cms > /tmp/cms.yaml
|
|
```
|
|
|
|
|
|
Put cms cert files under the directory of: /mnt/efs/var/vols/itom/itsma/global-volume/certificate/source/, the cert files will be imported automatically(make sure 1999:1999 is set)
|
|
Restart platfrom and platform-offline pods
|
|
|
|
- You can also update cert files for DND and OO: [https://docs.microfocus.com/doc/SMAX/23.4/SMAXChangeFQDN](https://docs.microfocus.com/doc/SMAX/23.4/SMAXChangeFQDN)
|
|
- Update cert files for OOMT if OOMT is enabled
|
|
- Update cert files for Audit service if audit is enabled
|
|
|
|
## Application load balancer
|
|
|
|
- Configure Load balancer for smax - refer to: [https://docs.microfocus.com/doc/SMAX/23.4/EKSDeploySuite](https://docs.microfocus.com/doc/SMAX/23.4/EKSDeploySuite)
|
|
- Configure Load balancer for management portal: 5443
|
|
- Configure Load balancer for prometheus(optional)
|
|
- Rebuild ALB controller in kube-system(delete the deployment of aws-load-balancer-controller, under the namespace of kube-system, and recreate it, pay attention to the values of cluster-name, region)
|
|
- Delete and rebuild 3 ingress for cms - Please be noted that ALB name will be changed
|
|
- Delete and rebuild 3 ingress for oomt(optional)
|
|
- Delete and rebuild 3 ingress for audit(optional)
|
|
- Bind DNS records in Route53 for smax, cms, oomt and audit service
|
|
|
|
You can config ALB controller following the guide: [https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html)
|
|
|
|
Also note that all nodePort will be changed, register ALB with new nodePort, so that we will have healthy status
|
|
|
|
And the ALB for cms also changed(append random string to DNS name, so you have to update route53 with correct values)
|
|
|
|
## Manual updates after restore
|
|
|
|
- Sensitive data update, depends on your business(update smtp server, update integration password, delete some customer tenants, update bo password for some tenant)
|
|
- Update CMS integration url in BO page(Select tenant → Application settings → Configuration Management settings, update CMS gateway service)
|
|
- Update SAM integration url in BO page(Select tenant → Capability settings → Software Asset Management, update CMS gateway url)
|
|
- Update DND integration url in Agent portal(Open tenant agent page → Administration → Providers → Aggregation providers)
|
|
- Update csa\_access\_point in DB(e.g. update dnd\_ *339803511*.csa\_access\_point set uri=' *[https://us2-smax-testing.saas.microfocus.com/339803511/oo](https://us2-smax-testing.saas.microfocus.com/339803511/oo)* ' where uuid=' *8a50b56d7406291f01740629c9f9013a* ';)
|
|
- Update OOMT Integration URL in BO page(Select tenant → Capability settings, update OO integration URL and OO login URL)
|
|
- Update OPB agent and endpoints in Agent portal
|
|
- Update topology in OO Deployment Operations(ras server has to be reconfigured)
|
|
- Update settings in prometheus and granfa - Optional
|
|
|
|
Update cm of itom-granfa(append below values to data.grafana.ini.root\_url)
|
|
root\_url = [https://us2-smax-testing.saas.microfocus.com/grafana](https://us2-smax-testing.saas.microfocus.com:9000/grafana)
|
|
\[smtp\]
|
|
enabled = true
|
|
host = *[email-smtp.us-west-2.amazonaws.com](http://email-smtp.us-west-2.amazonaws.com/)*:25
|
|
user = *aws\_access\_key\_id*
|
|
password = *aws\_secret\_access\_key*
|
|
skip\_verify = true
|
|
from\_address = *[sma\_noreply@microfocus.com](mailto:sma_noreply@microfocus.com)*
|
|
from\_name = *US2Dev\_Grafana*
|
|
\[rendering\]
|
|
server\_url = [http://bitnami-grafana-image-renderer:8080/render](http://bitnami-grafana-image-renderer:8080/render)
|
|
callback\_url = [https://itom-grafana:80/](https://itom-grafana:80/)
|
|
|
|
|
|
**Restart pod of** **itom-grafana-xxxxx**
|
|
Open granfa and update the user of datasource, make sure you are using correct key in the right farm
|
|
|
|
- Update yamls\_outputs in SMAX efs server(better to change all yaml files to readonly)
|
|
- Please note we have different **cms integration url for different farm**: e.g. [https://int.cms.fqdn:445/cms-gateway](https://int.cms.fqdn:445/cms-gateway) in us2-dev and [https://int.fqdn:445/cms-gateway](https://int.fqdn:445/cms-gateway) in us2-prod
|
|
|
|
We have updated yamls for currently deployments, but values are still not changed in /mnt/efs/var/vols/itom/itsma/global-volume/yamls\_output/, so if we execute the command:
|
|
|
|
```
|
|
kubectl delete -f xxxx.deployment.yaml & kubectl create -f xxxx.deployment.yaml
|
|
```
|
|
|
|
, pods can not up
|
|
|
|
## Validation
|
|
|
|
- Source farm not impacted
|
|
- Disable or enable smtp in restored farm(optional, depends on your business)
|
|
- Check the status of all the pods
|
|
- Smax testing in restored farm(bo, ess page, agent page, idol search)
|
|
- DND integration testing, try to execute one OO flow
|
|
- CMS integration testing, try to open jmx-console, ucmdb-browser, and CI sync with smax
|
|
- CGRO integration testing - Optional
|
|
- Audit Service testing - Optional
|
|
- Premetheus testing - All data is shown correctly in granfa, alertmanager works
|
|
|
|
## Issues you may meet
|
|
|
|
- kubectl get svc return none due to Fedrate login
|
|
- pv not bound while restoring eks farm(add sg of EKS control panel to efs inbound rule)
|
|
- cms can not up(only restore cms from velero backups solves it)
|
|
- smartA pods failed to start up due to some files are not copyed from source farm(take smarta-saw-con for example, but you may meet other)
|
|
|
|
```
|
|
kubectl scale sts smarta-saw-con --replicas=0 -n itsma-xxxxx
|
|
delete all files under the directory of: /mnt/efs/var/vols/itom/itsma/itsma-smarta-saw-con-0/smarta-saw-con-0/data
|
|
kubectl scale sts smarta-saw-con --replicas=2 -n itsma-xxxxx
|
|
```
|
|
|
|
- pods not up due to image not pushed to ECR(minor version difference between source & target farm)
|
|
- ingress not created(reconfig ALB controller in kube-system)
|
|
- Integration not work between cms & smax (manually update integration url in bo)
|
|
- Integration not work between oo & smax (manually update integration url in agent & db)
|
|
- Grafana alerts are sent as us2-prod but actually are from us2-dev(reconfig grafana)
|
|
- SAML login not works(till now)
|
|
- CMS integration not works due to different gateway url format(int.**cms**.fqdn in us2-dev but int.fqdn in us2-prod)
|
|
- Not all rabbitmq nodes are added into cluster, in my case only infra-rabbitmq-0 is there
|
|
|
|
kubectl scale sts infra-rabbitmq -n *itsma-ohs8f* --replicas=1
|
|
delete all files under the directory of /mnt/efs/var/vols/itom/itsma/ *rabbitmq-infra-rabbitmq-1(2)* /data/xservices/rabbitmq/ *3.7.1.14* /mnesia
|
|
kubectl scale sts infra-rabbitmq -n *itsma-ohs8f* --replicas=3
|
|
|
|
## Leftover
|
|
|
|
- Not switched to spot instance yet
|
|
- Old EFS server still there(other resources in AWS should have been deleted)
|
|
- Backup plan should be changed to save cost
|
|
- Contents of yaml\_outputs in EFS server are from source farm, should be changed manually
|
|
- Some records in parameter store are not updated and there are many invalid record
|
|
- API call not stable according to wenjun(solved by infra rabbitmq)
|
|
- Saml login still failed
|