17 KiB
ESM-Cloud-Disaster-and-Recovery-Guide_686087723
Introduction
The guide based on the latest ESM disaster and recovery solution, backing up data from source farm and restoring it to a new target farm
Which means you will discard current farm and restore on it on the new farm(cross AWS account, cross region).
Backup all the data from the source farm
- Backup Data
- Backup efs server for cms, smax, oomt, prometheus
- Backup RDS server for cms, smax, oomt, audit service
- Backup vertica db if CGRO is enabled in the source farm(optional)
- Backup all the k8s configuration files using velero
- Backup all cert files in target farm(smax, cdf, cms, oomt, audit) - /mnt/efs/var/vols/itom/itsma/global-volume/certificate/
- Backup efs server for cms, smax, oomt, prometheus
- Transfer Data
- Transfer all the snapshots to target farm(maybe takes time, depends on the size of data)
- Push all images
- Push all images to target farm
- To make sure data is consistent, the creation time for all the backups should not be too far way, better to sit within 2 hours.
- Push all images to target farm
- Tips
- Using backup vault to transer efs backups.
- Copy and share rds snapshots with customer key
- Refer to the link: How to share an RDS snapshot
- Using backup vault to transer efs backups.
Prepare new EKS cluster in the new target farm
- Shutdown the farm that is running in target farm(optional, if available ip is enough you can skip it)
- Build new vpc & subnet from CloudFormation(Make sure you are not using saml login into AWS console, instead you should login with your or service account) - in this case, we don't do this but just reuse the existing resources
- Build new EKS cluster from CloudFormation(add or update tag for 3 private subnets: kubernetes.io/cluster/ =shared; kubernetes.io/role/internal-elb=1)
- Build new EKS worker nodes: smax, cms, oomt, prometheus(NodeInstanceRole: value from Outputs tab when you create EKS cluster)
- Check the node groups are exactly the same as source farm(instance type, instance number, kubernetes labels)
- Build new EKS bastion(kubectl get nodes returns the correct output)
- Security group inbound rule check(Add sg of bastion server to EKS control panel SG inbound rule; Add EKS control panel SG to new EFS SG inbound rule)
Refer to the link: https://docs.microfocus.com/doc/SMAX/23.4/TasksOnAWS
Setting up velero
- Download velero binary and copy into $PATH(wget https://github.com/vmware-tanzu/velero/releases/download/v1.4.2/velero-v1.4.2-linux-amd64.tar.gz && tar -zxvf velero-v1.4.2-linux-amd64.tar.gz && cd velero-* && chmod a+x velero && mv velero /usr/local/bin/)
- Create bucket in S3 for velero
- Setup velero deployment(velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.1.0 --bucket $BUCKET --backup-location-config region=$REGION --snapshot-location-config region=$REGION --secret-file./credentials-velero
- Check velero functions by running: velero backup create test1
Refer to the link to install velero: https://github.com/vmware-tanzu/velero-plugin-for-aws
You can also refer to the link to config velero backups automatically: https://github.houston.softwaregrp.net/smax-saas-ops/saas-devops-tools/blob/master/velero_backup.sh
Velero should be installed on both source farm and target farm, in saas farm we have setup one user for DR in each farm, please install velero using that account
Restore infra in target farm
- Restore new smax rds server from snapshot - pay attention to the RDS type, storage type & size
- Restore new cms rds server from snapshot
- Restore new oomt rds server from snapshot(if has)
- Restore new audit rds server from snapshot(if has)
- Restore vetical db for CGRO(optional, if CGRO is enabled in the farm)
- Restore new smax efs server from snapshot(you should Add mount target after restore so that IPs will be assigned, same for cms & prometheus efs servers) - T ime consume task
- Restore new cms efs server from snapshot
- Restore new oomt efs server from snapshot
- Restore new prometheus efs server from snapshot(optional, if you care about promehteus data)
To save time, these tasks in the section can be done parallely
Update K8S resources
- Download current CDF installtion bundel in new bastion and run:./install --capabilities Tools=true,Monitoring=false,LogCollection=false,DeploymentManagement=false,ClusterManagement=false
- Download velero backups and shell script, which is used to batch update the parameters in K8S resources(put shell script under the directory of backups so that we have 9 files in total)
- Replace all images from velero backups, e.g.: sh replaceVeleroConf.sh " 551360491748.dkr.ecr.us-west-2.amazonaws.com \/ hpeswitom " " 551360491750.dkr.ecr.us-west-1.amazonaws.com \/ hpeswitomsandbox " false
- Replace aws account(if changed): sh replaceVeleroConf.sh source_aws_account target_aws_account false
- Replace region(if changed): sh replaceVeleroConf.sh us-west-2 us-west-1 false
- Replace org name(if changed): sh replaceVeleroConf.sh "\" hpeswitom \"" "\" hpeswitomsandbox \"" false
- Replace fqdn(if changed): sh replaceVeleroConf.sh " us2-smax.saas.microfocus.com " " us2-smax-testing.saas.microfocus.com " false - use change fqdn script is another appraoch. take care of the certificate and saml
- Replace efs server from velero backups(if you are restoring on the same farm and restoring to the same efs, you can skip this step since efs server endpoints never changed)
sh replaceVeleroConf.sh source_smax_efs target_smax_efs false
sh replaceVeleroConf.sh source_cms_efs target_cms_efs false
sh replaceVeleroConf.sh source_oomt_efs target_oomt_efs false
sh replaceVeleroConf.sh source_prometheus_efs target_prometheus_efs false
- Replace vertica server from velero backups(optional) - sh replaceVeleroConf.sh source_vertica_ip target_vertica_ip false (if you are restoring on the same farm, you can skip this step if vertica ip not changed)
- Replace rds server from velero backups(if you are restoring on the same farm, you can skip this step if rds endpoints not changed)
sh replaceVeleroConf.sh source_smax_rds target_smax_rds false
sh replaceVeleroConf.sh source_cms_rds target_cms_rds false
sh replaceVeleroConf.sh source_oomt_rds target_oomt_rds false
sh replaceVeleroConf.sh source_audit_rds target_audit_rds true
- Upload the updated backup files to target S3 bucket(rm -rf replaceVeleroConf.sh && cd.. && aws s3 cp --recursive backup_Name / s3://target_bucket/backups/backup_Name/)
- Check you have get the correct backups(velero backup get - should return the backup from source farm now)
Perform restore in target farm
- Disable smtp in target farm(optional) - doing this by adding outbound rule for Network ACLs(id:99 Port 25, Deny all)
- Mount new efs server for smax, cms, oomt, prometheus in new bastion(mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 :/ /mnt/efs) - You can skip this step if efs endpoints not changed
- Also add efs in /etc/fstab, otherwise mount point lost after a VM restart - You can skip this step if efs endpoints not changed
- Delete pv in case pv is already created
- Delete ns of itsma- xxxxx, cms, core, oomt, audit, prometheus if the namespaces is still there
- Perform full restore: velero restore create --from-backup <backup.all.example1> --wait
- Only restore one namespace(optional): velero restore create --from-backup <backup.all.example1> --wait --include-namespaces= cms
- Update credentials of itom-vault container in case the pod can not up
PASSPHRASE=$(kubectl get secret vault-passphrase -n core -o json 2>/dev/null | jq -r '.data.passphrase')
VAULT_CREDENTIAL_SECRET=$(kubectl get secret vault-credential -n core -o json 2>/dev/null )
ENCRYPTED_ROOT_TOKEN=$(echo ${VAULT_CREDENTIAL_SECRET} | jq -r '.data."root.token"')
VAULT_TOKEN=$(echo ${ENCRYPTED_ROOT_TOKEN} | openssl aes-256-cbc -md sha256 -a -d -pass pass:"${PASSPHRASE}")
echo ${VAULT_TOKEN}
kubectl exec -it $(kubectl get pod -ncore -ocustom-columns=NAME:.metadata.name |grep itom-vault| head -1) -ncore -- bash
export VAULT_ADDR= https://itom-vault.core:8200
export VAULT_TOKEN=<VAULT_TOKEN>
vault write -tls-skip-verify auth/kubernetes/config kubernetes_host=" https://kubernetes.default " kubernetes_ca_cert=@/var/run/secrets/ kubernetes.io/serviceaccount/ca.crt
- Helm upgrade apphub for cdf - All helm releases should update, this includes core, cms and maybe oomt in the future
/root/cdf/bin/helm get values apphub -n core > apphub.yaml
update apphub.yaml with new values(dburl, host,registry,orgName,externalAccessHost)
/root/cdf/bin/helm upgrade apphub /root/cdf/charts/apphub-1.20.0+20211100.219.tgz -f apphub.yaml -n core
- Helm upgrade cms releases - update smax.crt,database.host,smax.host,orgName,registry,externalAccessHost,idmAuthUrl,idmServiceUrl(pay attention to host and idmServiceUrl, we have different values between saas farms)
- Helm upgrade apphub for prometheus - update orgName,registry,externalAccessHost
- Helm upgrade oomt releases(optional, if you have enabled oomt)
- Helm upgrade audit service releases(optional, if you have enabled audit service)
- Wait until all the pods are up(kubectl get pod --all-namespaces|grep -vE '1/1|2/2|3/3|4/4|Completed')
- There is a known issue if smax transformed to helm, you will have to do the helm upgrade for itsma since most DND pods are waiting for the jobs
- Sometimes dnd-upgrade-jobs failed, just deleted the pods and related pods that are in Init states
Certificates
- Update SMAX cert by:./replaceExternalAccessHost.sh -c <certificate_path> -k <key_path> -t <cacert_path> -n
- Update CMS and SAM cert by:
Get current cms cert from source farm:
helm ls -n cms && helm get values cms-release -n cms > /tmp/cms.yaml
Put cms cert files under the directory of: /mnt/efs/var/vols/itom/itsma/global-volume/certificate/source/, the cert files will be imported automatically(make sure 1999:1999 is set)
Restart platfrom and platform-offline pods
- You can also update cert files for DND and OO: https://docs.microfocus.com/doc/SMAX/23.4/SMAXChangeFQDN
- Update cert files for OOMT if OOMT is enabled
- Update cert files for Audit service if audit is enabled
Application load balancer
- Configure Load balancer for smax - refer to: https://docs.microfocus.com/doc/SMAX/23.4/EKSDeploySuite
- Configure Load balancer for management portal: 5443
- Configure Load balancer for prometheus(optional)
- Rebuild ALB controller in kube-system(delete the deployment of aws-load-balancer-controller, under the namespace of kube-system, and recreate it, pay attention to the values of cluster-name, region)
- Delete and rebuild 3 ingress for cms - Please be noted that ALB name will be changed
- Delete and rebuild 3 ingress for oomt(optional)
- Delete and rebuild 3 ingress for audit(optional)
- Bind DNS records in Route53 for smax, cms, oomt and audit service
You can config ALB controller following the guide: https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html
Also note that all nodePort will be changed, register ALB with new nodePort, so that we will have healthy status
And the ALB for cms also changed(append random string to DNS name, so you have to update route53 with correct values)
Manual updates after restore
- Sensitive data update, depends on your business(update smtp server, update integration password, delete some customer tenants, update bo password for some tenant)
- Update CMS integration url in BO page(Select tenant → Application settings → Configuration Management settings, update CMS gateway service)
- Update SAM integration url in BO page(Select tenant → Capability settings → Software Asset Management, update CMS gateway url)
- Update DND integration url in Agent portal(Open tenant agent page → Administration → Providers → Aggregation providers)
- Update csa_access_point in DB(e.g. update dnd_ 339803511.csa_access_point set uri=' https://us2-smax-testing.saas.microfocus.com/339803511/oo ' where uuid=' 8a50b56d7406291f01740629c9f9013a ';)
- Update OOMT Integration URL in BO page(Select tenant → Capability settings, update OO integration URL and OO login URL)
- Update OPB agent and endpoints in Agent portal
- Update topology in OO Deployment Operations(ras server has to be reconfigured)
- Update settings in prometheus and granfa - Optional
Update cm of itom-granfa(append below values to data.grafana.ini.root_url)
root_url = https://us2-smax-testing.saas.microfocus.com/grafana
[smtp]
enabled = true
host = email-smtp.us-west-2.amazonaws.com:25
user = aws_access_key_id
password = aws_secret_access_key
skip_verify = true
from_address = sma_noreply@microfocus.com
from_name = US2Dev_Grafana
[rendering]
server_url = http://bitnami-grafana-image-renderer:8080/render
callback_url = https://itom-grafana:80/
Restart pod of itom-grafana-xxxxx
Open granfa and update the user of datasource, make sure you are using correct key in the right farm
- Update yamls_outputs in SMAX efs server(better to change all yaml files to readonly)
- Please note we have different cms integration url for different farm: e.g. https://int.cms.fqdn:445/cms-gateway in us2-dev and https://int.fqdn:445/cms-gateway in us2-prod
We have updated yamls for currently deployments, but values are still not changed in /mnt/efs/var/vols/itom/itsma/global-volume/yamls_output/, so if we execute the command:
kubectl delete -f xxxx.deployment.yaml & kubectl create -f xxxx.deployment.yaml
, pods can not up
Validation
- Source farm not impacted
- Disable or enable smtp in restored farm(optional, depends on your business)
- Check the status of all the pods
- Smax testing in restored farm(bo, ess page, agent page, idol search)
- DND integration testing, try to execute one OO flow
- CMS integration testing, try to open jmx-console, ucmdb-browser, and CI sync with smax
- CGRO integration testing - Optional
- Audit Service testing - Optional
- Premetheus testing - All data is shown correctly in granfa, alertmanager works
Issues you may meet
- kubectl get svc return none due to Fedrate login
- pv not bound while restoring eks farm(add sg of EKS control panel to efs inbound rule)
- cms can not up(only restore cms from velero backups solves it)
- smartA pods failed to start up due to some files are not copyed from source farm(take smarta-saw-con for example, but you may meet other)
kubectl scale sts smarta-saw-con --replicas=0 -n itsma-xxxxx
delete all files under the directory of: /mnt/efs/var/vols/itom/itsma/itsma-smarta-saw-con-0/smarta-saw-con-0/data
kubectl scale sts smarta-saw-con --replicas=2 -n itsma-xxxxx
- pods not up due to image not pushed to ECR(minor version difference between source & target farm)
- ingress not created(reconfig ALB controller in kube-system)
- Integration not work between cms & smax (manually update integration url in bo)
- Integration not work between oo & smax (manually update integration url in agent & db)
- Grafana alerts are sent as us2-prod but actually are from us2-dev(reconfig grafana)
- SAML login not works(till now)
- CMS integration not works due to different gateway url format(int.cms.fqdn in us2-dev but int.fqdn in us2-prod)
- Not all rabbitmq nodes are added into cluster, in my case only infra-rabbitmq-0 is there
kubectl scale sts infra-rabbitmq -n itsma-ohs8f --replicas=1
delete all files under the directory of /mnt/efs/var/vols/itom/itsma/ rabbitmq-infra-rabbitmq-1(2) /data/xservices/rabbitmq/ 3.7.1.14 /mnesia
kubectl scale sts infra-rabbitmq -n itsma-ohs8f --replicas=3
Leftover
- Not switched to spot instance yet
- Old EFS server still there(other resources in AWS should have been deleted)
- Backup plan should be changed to save cost
- Contents of yaml_outputs in EFS server are from source farm, should be changed manually
- Some records in parameter store are not updated and there are many invalid record
- API call not stable according to wenjun(solved by infra rabbitmq)
- Saml login still failed