nexus/knowledgebase/csd-wiki/ICSD/Runbooks-based-on-monitoring_686083879.md

# Runbooks-based-on-monitoring_686083879
## Alerts, Description and Summary

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.

1. Infrastructure
	1. Compute
		2. Network
		1. ALB 5xx (More than 50 in a 2 mins time frame)
			1. Summary:
				There are more than 50 5xx errors triggered on frontend. Multiple end user may experience a production issue on their side.
				1. Check whether there is any other time-correlated alerts reporting.
				2. S2NEWALB target 5xx (TBD)
		3. Storage
		1. S3EBS (EBS disk queue depth more than 5 for more than 10 mins)
			1. Summary:
				The tasks on the storage is being queued.
				1. Check
					1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.
				2. S2EBS (EBS burst balance below 40% for more than 30 mins )
			1. Summary:
				The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
				1. Check
					1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
				3. EBS (EBS burst balance is 0)
			1. Summary:
				The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
				1. Check
					1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
						1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed)
												2. Add more storage to the EBS
				4. S2EFS (Burst credit below 40% for more than 30 mins )
			1. Summary:
				The tasks on the storage will be queued soon.
				1. Check
					1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
						1. Usually there is no action required, if the alert persists, then it's a critical issue.
				5. EFS (Burst credit is 0)
			1. Summary:
				The tasks on the storage is being queued. Everything via EFS IO will be slowed down.
				1. Check
					1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
						1. Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)
		4. Virtualization
		5. Database
		1. S2CPU (CPU more than 97% for more than 60 mins)
			1. Summary:
				The overall CPU usage is more than 97% for more than one hour.
				1. Check
					1. performance insight for top queries for anything taking more CPU
								2. Todo
					1. Keep monitoring and check whether other metrics on Database is abnormal.
										2. Get top 10 query information.
				2. S2CPU (sy: system >70% for more than 60 mins )
			1. Summary:
				The CPU is spending more time on system level processing instead of handling the business flow.
				1. Check
					1. performance insight for top queries for anything taking more CPU
								2. Todo
					1. Keep monitoring and check whether other metrics on Database is abnormal.
				3. S2CPU (si: soft interrupts > 15% for more than 60 mins )
			1. Summary:
				The CPU is spending more time on system level processing instead of handling the business flow.
				1. Check
					1. performance insight for top queries for anything taking more CPU
								2. Todo
					1. Keep monitoring and check whether other metrics on Database is abnormal.
				4. S3Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)
			1. Summary:
				The tasks on the storage is being queued.
				1. Check
					1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
				5. S2Disk (Free Storage Space is below 500 MB)
			1. Summary:
				The instance is running out of storage.
				1. Todo
					1. Add more storage to EBS
										2. Enable storage auto-scaling
				6. S2 Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )
			1. Summary:
				The instance auto-scaling quota is not enough.
				1. Todo
					1. Increase the max auto-scaling storage size.
				7. S2Memory (Free memory less than 5% for more than 5 mins)
			1. Summary:
				The instance will running out of memory soon.
				1. Check
					1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
								2. Todo
					1. Keep monitoring
										2. considering rolling restart current deployment, for example, gateway/platform/serviceportal
				8. Memory (Free memory less than 2% for more than 5 mins)
			1. Summary:
				The instance will running out of memory soon.
				1. Check
					1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
								2. Todo
					1. considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
						1. consider scaling up RDS. Usually double the memory size.
												2. Do DB tuning based on the query which is identified as memory consuming
				9. S2Storage (Burst Balance below 40% for more than 30 mins )
			1. Summary:
				The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
				1. Check
					1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
				10. Storage (Burst Balance is 0)
			1. Summary:
				The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
				1. Check
					1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
										2. whether there is a big load against EBS storage.
								2. Todo
					1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
						1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed)
												2. Add more storage to the EBS
				11. S2Storage (EBSByteBalance% or EBSIOBalance% below 40% for more than 30 mins )
			1. Summary:
				The load on RDS is high and the burst balance may not fulfill the request in the following quarter/hour.
				1. Check
					1. keep monitoring whether RDS is running out of credits via RDS dashboard soon (Same Dashboard in the infrastructure page).
										2. whether there is a big load against RDS storage.
								2. Todo
					1. Usually there is no action required, if the alert persists, then it's a critical issue. Please fix the top sql
										2. up size the RDS instance type
				12. Storage (EBSByteBalance% or EBSIOBalance% is 0)
			1. Summary:
				The tasks on the storage is being queued. Everything via RDS IO will be slowed down
				1. Check
					1. whether RDS is running out of credits via RDS monitoring dashboard.
										2. whether there is a big load against RDS storage.
								2. Todo
					1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
						1. Fix the top sql
												2. Up size the RDS instance
				13. S2DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)
			1. Summary:
				The database is overloaded.
				1. Check
					1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
				14. DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)
			1. Summary:
				The database is mostly overloaded on CPU.
				1. Check
					1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
				15. S3DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)
			1. Summary:
				The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.
				1. Check
					1. AWS console → RDS → Performance Insight to check which operation is taking the most of time
				16. Dead tuple (TBD)
2. OS (Node level)
	1. CPU
		1. S2CPU more than 97% for more than 60 mins
			1. Summary:
				The instance is almost running out of CPU for more than 60 mins.
				1. Todo
					1. Keep monitoring
				2. S2CPU (sy: system >70% for more than 60 mins )(mark for review)
				3. Summary:
			The instance too busy on its own system operation to handle the tasks for normal business.
			1. Todo
				1. Keep monitoring
				4. S2CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)
			1. Summary:
				The instance is almost running out of CPU for more than 60 mins.
				1. Todo
					1. Keep monitoring
		2. Memory
		1. S3Memory more than 95% for more than 10 mins
			1. Summary:
				The instance is almost running out of CPU for more than 60 mins.
				1. Todo
					1. Keep monitoring
		3. Disk
		1. S3Disk usage more than 95%
			1. Summary:
				The instance is almost running out of disk.
				1. Todo
					1. Add more storage to the disk
				2. [Disk read/write latency](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#b_Read_Write_Latencies) (TBD)
				3. S3 [Inode usage](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#c_Number_of_inodes_on_our_system) > 97%
			1. Summary:
				The instance will be blocked by the soft limit on OS level (Inode) very soon.
				1. Todo
					1. Restart pods on the instance to release inode usage
										2. If above step cannot help, need to open an incident for further analysis.
				4. [Node disk IO load](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#d_Overall_IO_load_on_your_instance) (TBD)
		4. Network
		1. network operation latency(TBD)
				2. network transit error rate(TBD)
				3. network transit drop rate(TBD)
				4. network transit queue length(TBD)
				5. Throughput / bandwidth (TBD)
		5. S3Load (Load Avg 15m/core number > 200% for 35 mins )
		1. Summary:
			The instance is overloaded for more than 35 mins.
			1. Todo
				1. Keep monitoring
								2. If it happens multiple times in a day, run the rebalancing pod script.
3. Container
	1. CPU
		1. S2CPU (CPU more than 97% for more than 60 mins)
			1. Summary:
				The instance is almost running out of CPU for more than 60 mins.
				1. Todo
					1. Keep monitoring
		2. Memory
		1. swap usage
		3. Disk
		1. [Disk read/write latency](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#b_Read_Write_Latencies) (TBD)
				2. S3Inode usage(free/total) > 97%
			1. Summary:
				The instance will be blocked by the soft limit on OS level (Inode) very soon.
				1. Todo
					1. Restart pods on the instance to release inode usage
										2. If above step cannot help, need to open an incident for further analysis.
		4. Network
		1. network transit error rate(TBD)
				2. network transit drop rate(TBD)
		5. Unavailable service (Send alert directly, TBD, because different service has different severity. Further drill down is required.)
		1. SMA
			1. critical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade
				1. Summary(Same for all the availability alerts):
					The service is not available now.
					1. Todo
						1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
												2. Try to fix based on the results from step 1.
						2. S2impact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user
						3. S3no obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer
						4. S4services out side of ESM / toolkit
				2. CMS
			1. critical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb
						2. S2impact partial of business: itom-autopass-lms, itom-vault
						3. S3no obvious impact on business:
						4. S4services out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers
		6. Load
		1. S3Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)
			1. Summary:
				The instance is overloaded for more than 35 mins.
				1. Todo
					1. Keep monitoring
										2. If it happens multiple times in a day, run the rebalancing pod script.
		7. Threads
		1. container\_threads on process (TBD)
4. App metrics
	1. Thread
		2. Connections
		3. Limits
		4. Smart Analytics
		1. S3Content data ratio(total doc/committed doc) > 1.20
			1. Summary:
				All the query against the IDOL will take more time and get slowed down.
				1. Todo
					1. Run the jenkins job of IDOL compact.
										2. Or follow the steps in the guide below
						[https://docs.microfocus.com/doc/SMAX/2022.05/Searchslow](https://docs.microfocus.com/doc/SMAX/2022.05/Searchslow)
				2. S3 Documents per Content > 3M (ignore the archive content)
			1. Sumary:
				All the query against the IDOL can be impacted
				1. Todo
					1. Scale content groups: [https://docs.microfocus.com/doc/SMAX/24.4/SmartAAdmin](https://docs.microfocus.com/doc/SMAX/24.4/SmartAAdmin)
		5. Rabbitmq (each node)
		1. S3queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)
			1. Summary:
				The rabbitmq queues are in a higher than normal.
				1. Todo
					1. Keep monitoring
										2. If it is getting higher continuously, consider performing the same steps mentioned here.
						[https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution)
				2. S3Pending Messages/Minute > 500 for more than 30 mins (Mark for review)
			1. Summary:
				The pending messages in rabbitmq are getting accumulated.
				1. Todo
					1. Keep monitoring
										2. If it is getting higher continuously, consider performing the same steps mentioned here.
						[https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution)
				3. Message queue not equally distributed to different cluster nodes(TBD)
			1. Summary:
				Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.
				1. Todo
					1. Scale down the rabbitmq node which is not in the cluster.
										2. Remove the `<rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesia` folders on the NFS server or the bastion node
										3. Wait until the rabbitmq nodes to be ready
		6. IDM
		1. S4Active user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. Keep monitoring
		7. Gateway
		1. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If the number cannot drop after above steps, do rollong restart xmpp.
										3. If the number cannot drop after above steps, take thread dump for the pod with issue.
						[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
				2. S2Httpclient InUse > 20 for 30 mins
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If the number cannot drop after above steps, do rollong restart xmpp.
										3. If the number cannot drop after above steps, take thread dump for the pod with issue.
						[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
		8. Platform
		1. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If the number cannot drop after above steps, do rollong restart xmpp.
										3. If the number cannot drop after above steps, take thread dump for the pod with issue.
						[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
				2. S2Httpclient InUse > 20 for 30 mins
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If the number cannot drop after above steps, do rollong restart xmpp.
										3. If the number cannot drop after above steps, take thread dump for the pod with issue.
						[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
		9. Serviceportal
		1. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If the number cannot drop after above steps, do rollong restart xmpp.
										3. If the number cannot drop after above steps, take thread dump for the pod with issue.
						[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
				2. S2Httpclient InUse > 20 for 30 mins
			1. Summary:
				The active user number is more than the target size.
				1. Todo
					1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
										2. If the number cannot drop after above steps, do rollong restart xmpp.
										3. If the number cannot drop after above steps, take thread dump for the pod with issue.
						[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
5. Instrumental