149 lines
12 KiB
Markdown
149 lines
12 KiB
Markdown
# EKS-upgrade-from-version-1.30-to-1.31_706832607
|
|
## Introduction
|
|
|
|
This page describes the steps for upgrading the EKS cluster of ESM in SaaS environment, from version 1.30 to 1.31.
|
|
|
|
Reference resources: [https://rndwiki.houston.softwaregrp.net/confluence/pages/viewpage.action?spaceKey=SMA&title=How%20to%20upgrade%20EKS%20in%20SaaS](https://rndwiki.houston.softwaregrp.net/confluence/pages/viewpage.action?spaceKey=SMA&title=How%20to%20upgrade%20EKS%20in%20SaaS)
|
|
|
|
The process has 3 main parts: 1. Upgrading the add-ons; 2. Upgrading the EKS cluster; 3. Upgrading the EKS worker node groups.
|
|
|
|
## 1\. Upgrading the add-ons
|
|
|
|
The add-ons **coredns**, **vpc-cni** and **kube-proxy** need to be upgraded before driving the EKS upgrade. Here are the referenced instructions:
|
|
[https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html](https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html "https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html")
|
|
[https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html](https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html "https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html")
|
|
[https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html](https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html "https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html")
|
|
|
|
**1.1. Upgrading the *coredns* add-on**
|
|
|
|
Open the subsequent referenced Amazon page: [https://docs.aws.amazon.com/eks/latest/userguide/coredns-add-on-self-managed-update.html](https://docs.aws.amazon.com/eks/latest/userguide/coredns-add-on-self-managed-update.html).
|
|
1.1.1 **Confirm**, in the bastion's cli, that you have the self-managed type of the add-on installed on your cluster. Replace my-cluster with the name of your cluster.
|
|
aws eks describe-addon --cluster-name my-cluster --addon-name coredns --query addon.addonVersion --output text
|
|
e.g. aws eks describe-addon --cluster-name us2-dev-eks-cluster --addon-name coredns --query addon.addonVersion --output text
|
|
If an error message is returned, you have the self-managed type of the add-on installed on your cluster.
|
|
1.1.2. **Check** the version of the container image that is currently installed on the cluster.
|
|
kubectl describe deployment coredns -n kube-system | grep Image | cut -d ":" -f 3
|
|
1.1.3. **Check** the current CoreDNS image version:
|
|
kubectl describe deployment coredns -n kube-system | grep Image
|
|
1.1.4. Since the upgrade is made to CoreDNS v1.11.4-eksbuild.14, **add** the endpointslices permission to the system:coredns Kubernetes clusterrole.
|
|
kubectl edit clusterrole system:coredns -n kube-system
|
|
Add the following lines under the existing permissions lines in the rules section of the file.
|
|
\[...\]
|
|
\- apiGroups:
|
|
\- [discovery.k8s.io](http://discovery.k8s.io/)
|
|
resources:
|
|
\- endpointslices
|
|
verbs:
|
|
\- list
|
|
\- watch
|
|
\[...\]
|
|
1.1.5. **Update** the CoreDNS - replace just the region and the image version:
|
|
kubectl set image deployment.apps/coredns -n kube-system coredns= [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns:v1.11.4-eksbuild.14](http://602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns:v1.11.4-eksbuild.14)
|
|
1.1.5. **Check** the pods in the kube-system namespace and the add-on version now installed:
|
|
kubectl get pods -n kube-system
|
|
kubectl describe deployment coredns -n kube-system | grep Image | cut -d ":" -f 3
|
|
|
|
**1.2. Upgrading the *vpc-cni* add-on**
|
|
|
|
Open the subsequent referenced Amazon page: [https://docs.aws.amazon.com/eks/latest/userguide/vpc-add-on-self-managed-update.html](https://docs.aws.amazon.com/eks/latest/userguide/vpc-add-on-self-managed-update.html)
|
|
1.2.1. **Confirm** that the Amazon EKS type of the add-on is not installed on the cluster. Replace my-cluster with the name of your cluster.
|
|
aws eks describe-addon --cluster-name my-cluster --addon-name vpc-cni --query addon.addonVersion --output text
|
|
If an error message is returned, the Amazon EKS type of the add-on is not installed on the cluster.
|
|
e.g. aws eks describe-addon --cluster-name us2-dev-eks-cluster --addon-name vpc-cni --query addon.addonVersion --output text
|
|
1.2.2. **Check** the version of the container image that is currently installed on the cluster.
|
|
kubectl describe daemonset aws-node --namespace kube-system | grep amazon-k8s-cni: | cut -d: -f 3
|
|
1.2.3. Navigate to /opt/25/2 and **backup** the current settings so to configure the same settings once the version is updated:
|
|
cd /opt/25.2/
|
|
kubectl get daemonset aws-node -n kube-system -o yaml > aws-k8s-cni-old.yaml
|
|
cat aws-k8s-cni-old.yaml
|
|
1.2.4. **Check** the latest available version table on the page: [https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#vpc-cni-latest-available-version](https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#vpc-cni-latest-available-version) => v1.19.5-eksbuild.3
|
|
1.2.5. Create a folder for the EKS upgrade and **download** the vpc-cni manifest file in it:
|
|
mkdir eks\_upgrade\_1.31
|
|
cd eks\_upgrade\_1.31/
|
|
curl -O [https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.19.5/config/master/aws-k8s-cni.yaml](https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.19.5/config/master/aws-k8s-cni.yaml)
|
|
1.2.6. **Apply** the modified manifest to the cluster:
|
|
kubectl apply -f aws-k8s-cni.yaml
|
|
1.2.7. **Check** the pods in the kube-system namespace and the add-on version now installed:
|
|
watch 'kubectl get pods -n kube-system '
|
|
kubectl describe daemonset aws-node --namespace kube-system | grep amazon-k8s-cni: | cut -d: -f 3
|
|
1.2.8. Since custom networking (non-routable CIDR) is enabled on this farm, **re-enable** it after updating VPC CNI plugin.
|
|
kubectl set env daemonset aws-node -n kube-system AWS\_VPC\_K8S\_CNI\_CUSTOM\_NETWORK\_CFG=true
|
|
and **check** again the pods:
|
|
watch 'kubectl get pods -n kube-system '
|
|
|
|
**1.3. Upgrading the *kube-proxy* add-on**
|
|
|
|
Open the following in the AWS content tree page: [https://docs.aws.amazon.com/eks/latest/userguide/kube-proxy-add-on-self-managed-update.html](https://docs.aws.amazon.com/eks/latest/userguide/kube-proxy-add-on-self-managed-update.html)
|
|
1.3.1. **Check** that the self-managed type of the add-on is installed on the cluster. Replace my-cluster with the name of your cluster.
|
|
aws eks describe-addon --cluster-name my-cluster --addon-name kube-proxy --query addon.addonVersion --output text
|
|
e.g. aws eks describe-addon --cluster-name us2-dev-eks-cluster --addon-name kube-proxy --query addon.addonVersion --output text
|
|
If an error message is returned, then the self-managed type of the add-on is installed on your cluster.
|
|
1.3.2. **Check** the version of the container image that is currently installed on the cluster.
|
|
kubectl describe daemonset kube-proxy -n kube-system | grep Image
|
|
1.3.3. **Update** the kube-proxy add-on using the minimal version:
|
|
kubectl set image daemonset.apps/kube-proxy -n kube-system kube-proxy= [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.31.9-minimal-eksbuild.2](http://602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.31.9-minimal-eksbuild.2)
|
|
1.3.4. **Check** that the new version is now installed on the cluster.
|
|
watch 'kubectl get pods -n kube-system'
|
|
kubectl get pods -n kube-system | grep kube-proxy
|
|
kubectl describe daemonset kube-proxy -n kube-system | grep Image | cut -d ":" -f 3
|
|
|
|
## 2\. Upgrading the EKS cluster
|
|
|
|
Login AWS console, go to the EKS service, click "Update now" and choose the targeted version, 1.31 in this case. Click "Update" and wait until the upgrade is completed, 15~45 minutes.
|
|
|
|

|
|
|
|

|
|
|
|
Once the EKS cluster is upgraded to the new version, upgrade the worker nodes to the new version accordingly.
|
|
|
|
## 3\. Upgrading the EKS worker node groups
|
|
|
|
Open the subsequent referenced Amazon page: [https://docs.aws.amazon.com/eks/latest/userguide/update-workers.html](https://docs.aws.amazon.com/eks/latest/userguide/update-workers.html)
|
|
3.1. **Create** a dedicated location on the Linux bastion for the EKS node groups upgrade
|
|
3.2. **Download** the scripts from this location: [https://rndwiki.houston.softwaregrp.net/confluence/pages/viewpageattachments.action?pageId=1309586390&metadataLink=true](https://rndwiki.houston.softwaregrp.net/confluence/pages/viewpageattachments.action?pageId=1309586390&metadataLink=true)
|
|
3.3. If the preparation of the new node groups is being done in a different day than the one when the node groups are being actually upgraded, make sure that new node groups are created with 0 desired size, by **commenting** the last line in the script:
|
|
\# aws eks update-nodegroup-config --cluster-name $eks\_name --nodegroup-name $old\_nodegroup\_name-workernodes-1-$eks\_version --scaling-config minSize=$min\_size,maxSize=$max\_size,desiredSize=$desired\_size 2>&1 >/dev/null
|
|
3.4. **Run** the creation node group creation script [create-eks-worker.sh](attachments/706832607/709421232.sh):
|
|
sh./create-eks-worker.sh
|
|
If the script is not formatted properly, use the below command to **format** it correctly and re-run the script:
|
|
dos2unix create-eks-worker.sh
|
|
3.5. If not all the labels are created on each node group, use the script [tag\_ASG.sh](attachments/706832607/709421233.sh) here to **tag** them:
|
|
sh./tag\_ASG.sh
|
|
3.6. If one node is overloaded with pods, **evaluate** the pods on a certain node:
|
|
kubectl taint nodes ${currentNodeName} podReScheduler=value:NoExecute
|
|
3.7. **Scale** up the new node group to the desired size
|
|
AWS UI > EKS > <the cluster name> > Compute > <each worker node group> > Edit >
|
|
3.8. **Taint** the old worker nodes by running the in-line script lines:
|
|
nodes=$(kubectl get nodes | grep -i v1.30 | awk '{print $1}')
|
|
for node in $nodes
|
|
do
|
|
kubectl taint nodes ${node} podReScheduler=value:NoSchedule
|
|
done
|
|
3.9. **Check** if there are any pods still on the previous version, e.g. 1.30, worker nodes, by running these in-line script lines:
|
|
nodes=$(kubectl get nodes | grep -i v1.30 | awk '{print $1}')
|
|
for node in $nodes
|
|
do
|
|
kubectl get po -o wide -A | grep -i $node | grep -v 'aws-node-\\|kube-proxy-\\|ebs-csi-node\\|twistlock-defender\\|itom-prometheus-node-exporter-\\|itom-throttling-controller\\|Completed' | awk '{print $1,$2}'
|
|
done
|
|
3.10. If there are pods running on 1.30, only on small namespaces like: audit, core, kube-system, cert-manager, velero, manually **restart** them with the script [rollingMigratePodsByNamespace.sh](attachments/706832607/709421199.sh):
|
|
./rollingMigratePodsByNamespace.sh <namespace1> <namespace2>..
|
|
nohup sh rollingMigratePodsByNamespace.sh audit core kube-system &
|
|
e.g.
|
|
./rollingMigratePodsByNamespace.sh cert-manager kube-system monitoring velero
|
|
**Note:** It is not safe to run the script on big namespaces like itsma, core or monitoring.
|
|
3.11. Manually **restart** the pods on the itsma, core, monitoring namespaces:
|
|
kubectl delete pod itom-toolkit-6c5f5745b-cfzqx -n itsma-ohs8f
|
|
kubectl delete pod filebeat-drxl5 -n logging
|
|
kubectl delete pod suite-conf-pod-itsma-6854dd8f74-5c9dm -n core
|
|
3.12. **Check** again as on step #3.9 above.
|
|
3.13. Terminate and **delete** old version, e.g. 1.30, worker nodes.
|
|
AWS UI > EKS > <the cluster name> > Compute > <old node groups> > Delete.
|
|
3.14. Once all the old worknodes are terminated, **install** the Qualys agents on the new worknodes, except for US24-PROD, by using the install\_qualys\_agent.sh script install\_qualys\_agent.sh:
|
|
sh install\_qualys\_agent.sh <farmName>
|
|
e.g. sh install\_qualys\_agent.sh us6-prod
|
|
3.15. **SSH** to one of the new worknode, check that Qualys is installed by typing: service qualys-cloud-agent status
|
|
ssh -i worknodes.pem [ec2-user@ip-10-210-96-76.us-west-2.compute.intern](mailto:ec2-user@ip-10-210-96-76.us-west-2.compute.intern) al
|
|
service qualys-cloud-agent status
|
|
exit
|