364 lines
21 KiB
Markdown
364 lines
21 KiB
Markdown
# Runbooks-based-on-monitoring_686083879
|
|
## Alerts, Description and Summary
|
|
|
|
Alerts comes with monitoring and experience.
|
|
|
|
Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.
|
|
|
|
1. Infrastructure
|
|
1. Compute
|
|
2. Network
|
|
1. ALB 5xx (More than 50 in a 2 mins time frame)
|
|
1. Summary:
|
|
There are more than 50 5xx errors triggered on frontend. Multiple end user may experience a production issue on their side.
|
|
1. Check whether there is any other time-correlated alerts reporting.
|
|
2. S2NEWALB target 5xx (TBD)
|
|
3. Storage
|
|
1. S3EBS (EBS disk queue depth more than 5 for more than 10 mins)
|
|
1. Summary:
|
|
The tasks on the storage is being queued.
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.
|
|
2. S2EBS (EBS burst balance below 40% for more than 30 mins )
|
|
1. Summary:
|
|
The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
|
|
1. Check
|
|
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
|
|
3. EBS (EBS burst balance is 0)
|
|
1. Summary:
|
|
The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed)
|
|
2. Add more storage to the EBS
|
|
4. S2EFS (Burst credit below 40% for more than 30 mins )
|
|
1. Summary:
|
|
The tasks on the storage will be queued soon.
|
|
1. Check
|
|
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue.
|
|
5. EFS (Burst credit is 0)
|
|
1. Summary:
|
|
The tasks on the storage is being queued. Everything via EFS IO will be slowed down.
|
|
1. Check
|
|
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)
|
|
4. Virtualization
|
|
5. Database
|
|
1. S2CPU (CPU more than 97% for more than 60 mins)
|
|
1. Summary:
|
|
The overall CPU usage is more than 97% for more than one hour.
|
|
1. Check
|
|
1. performance insight for top queries for anything taking more CPU
|
|
2. Todo
|
|
1. Keep monitoring and check whether other metrics on Database is abnormal.
|
|
2. Get top 10 query information.
|
|
2. S2CPU (sy: system >70% for more than 60 mins )
|
|
1. Summary:
|
|
The CPU is spending more time on system level processing instead of handling the business flow.
|
|
1. Check
|
|
1. performance insight for top queries for anything taking more CPU
|
|
2. Todo
|
|
1. Keep monitoring and check whether other metrics on Database is abnormal.
|
|
3. S2CPU (si: soft interrupts > 15% for more than 60 mins )
|
|
1. Summary:
|
|
The CPU is spending more time on system level processing instead of handling the business flow.
|
|
1. Check
|
|
1. performance insight for top queries for anything taking more CPU
|
|
2. Todo
|
|
1. Keep monitoring and check whether other metrics on Database is abnormal.
|
|
4. S3Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)
|
|
1. Summary:
|
|
The tasks on the storage is being queued.
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
5. S2Disk (Free Storage Space is below 500 MB)
|
|
1. Summary:
|
|
The instance is running out of storage.
|
|
1. Todo
|
|
1. Add more storage to EBS
|
|
2. Enable storage auto-scaling
|
|
6. S2 Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )
|
|
1. Summary:
|
|
The instance auto-scaling quota is not enough.
|
|
1. Todo
|
|
1. Increase the max auto-scaling storage size.
|
|
7. S2Memory (Free memory less than 5% for more than 5 mins)
|
|
1. Summary:
|
|
The instance will running out of memory soon.
|
|
1. Check
|
|
1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
|
|
2. Todo
|
|
1. Keep monitoring
|
|
2. considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
8. Memory (Free memory less than 2% for more than 5 mins)
|
|
1. Summary:
|
|
The instance will running out of memory soon.
|
|
1. Check
|
|
1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
|
|
2. Todo
|
|
1. considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
|
|
1. consider scaling up RDS. Usually double the memory size.
|
|
2. Do DB tuning based on the query which is identified as memory consuming
|
|
9. S2Storage (Burst Balance below 40% for more than 30 mins )
|
|
1. Summary:
|
|
The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
|
|
1. Check
|
|
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
|
|
10. Storage (Burst Balance is 0)
|
|
1. Summary:
|
|
The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed)
|
|
2. Add more storage to the EBS
|
|
11. S2Storage (EBSByteBalance% or EBSIOBalance% below 40% for more than 30 mins )
|
|
1. Summary:
|
|
The load on RDS is high and the burst balance may not fulfill the request in the following quarter/hour.
|
|
1. Check
|
|
1. keep monitoring whether RDS is running out of credits via RDS dashboard soon (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against RDS storage.
|
|
2. Todo
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please fix the top sql
|
|
2. up size the RDS instance type
|
|
12. Storage (EBSByteBalance% or EBSIOBalance% is 0)
|
|
1. Summary:
|
|
The tasks on the storage is being queued. Everything via RDS IO will be slowed down
|
|
1. Check
|
|
1. whether RDS is running out of credits via RDS monitoring dashboard.
|
|
2. whether there is a big load against RDS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Fix the top sql
|
|
2. Up size the RDS instance
|
|
13. S2DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)
|
|
1. Summary:
|
|
The database is overloaded.
|
|
1. Check
|
|
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
|
|
14. DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)
|
|
1. Summary:
|
|
The database is mostly overloaded on CPU.
|
|
1. Check
|
|
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
|
|
15. S3DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)
|
|
1. Summary:
|
|
The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.
|
|
1. Check
|
|
1. AWS console → RDS → Performance Insight to check which operation is taking the most of time
|
|
16. Dead tuple (TBD)
|
|
2. OS (Node level)
|
|
1. CPU
|
|
1. S2CPU more than 97% for more than 60 mins
|
|
1. Summary:
|
|
The instance is almost running out of CPU for more than 60 mins.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. S2CPU (sy: system >70% for more than 60 mins )(mark for review)
|
|
3. Summary:
|
|
The instance too busy on its own system operation to handle the tasks for normal business.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
4. S2CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)
|
|
1. Summary:
|
|
The instance is almost running out of CPU for more than 60 mins.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. Memory
|
|
1. S3Memory more than 95% for more than 10 mins
|
|
1. Summary:
|
|
The instance is almost running out of CPU for more than 60 mins.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
3. Disk
|
|
1. S3Disk usage more than 95%
|
|
1. Summary:
|
|
The instance is almost running out of disk.
|
|
1. Todo
|
|
1. Add more storage to the disk
|
|
2. [Disk read/write latency](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#b_Read_Write_Latencies) (TBD)
|
|
3. S3 [Inode usage](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#c_Number_of_inodes_on_our_system) > 97%
|
|
1. Summary:
|
|
The instance will be blocked by the soft limit on OS level (Inode) very soon.
|
|
1. Todo
|
|
1. Restart pods on the instance to release inode usage
|
|
2. If above step cannot help, need to open an incident for further analysis.
|
|
4. [Node disk IO load](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#d_Overall_IO_load_on_your_instance) (TBD)
|
|
4. Network
|
|
1. network operation latency(TBD)
|
|
2. network transit error rate(TBD)
|
|
3. network transit drop rate(TBD)
|
|
4. network transit queue length(TBD)
|
|
5. Throughput / bandwidth (TBD)
|
|
5. S3Load (Load Avg 15m/core number > 200% for 35 mins )
|
|
1. Summary:
|
|
The instance is overloaded for more than 35 mins.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it happens multiple times in a day, run the rebalancing pod script.
|
|
3. Container
|
|
1. CPU
|
|
1. S2CPU (CPU more than 97% for more than 60 mins)
|
|
1. Summary:
|
|
The instance is almost running out of CPU for more than 60 mins.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. Memory
|
|
1. swap usage
|
|
3. Disk
|
|
1. [Disk read/write latency](https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#b_Read_Write_Latencies) (TBD)
|
|
2. S3Inode usage(free/total) > 97%
|
|
1. Summary:
|
|
The instance will be blocked by the soft limit on OS level (Inode) very soon.
|
|
1. Todo
|
|
1. Restart pods on the instance to release inode usage
|
|
2. If above step cannot help, need to open an incident for further analysis.
|
|
4. Network
|
|
1. network transit error rate(TBD)
|
|
2. network transit drop rate(TBD)
|
|
5. Unavailable service (Send alert directly, TBD, because different service has different severity. Further drill down is required.)
|
|
1. SMA
|
|
1. critical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade
|
|
1. Summary(Same for all the availability alerts):
|
|
The service is not available now.
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
2. S2impact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user
|
|
3. S3no obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer
|
|
4. S4services out side of ESM / toolkit
|
|
2. CMS
|
|
1. critical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb
|
|
2. S2impact partial of business: itom-autopass-lms, itom-vault
|
|
3. S3no obvious impact on business:
|
|
4. S4services out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers
|
|
6. Load
|
|
1. S3Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)
|
|
1. Summary:
|
|
The instance is overloaded for more than 35 mins.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it happens multiple times in a day, run the rebalancing pod script.
|
|
7. Threads
|
|
1. container\_threads on process (TBD)
|
|
4. App metrics
|
|
1. Thread
|
|
2. Connections
|
|
3. Limits
|
|
4. Smart Analytics
|
|
1. S3Content data ratio(total doc/committed doc) > 1.20
|
|
1. Summary:
|
|
All the query against the IDOL will take more time and get slowed down.
|
|
1. Todo
|
|
1. Run the jenkins job of IDOL compact.
|
|
2. Or follow the steps in the guide below
|
|
[https://docs.microfocus.com/doc/SMAX/2022.05/Searchslow](https://docs.microfocus.com/doc/SMAX/2022.05/Searchslow)
|
|
2. S3 Documents per Content > 3M (ignore the archive content)
|
|
1. Sumary:
|
|
All the query against the IDOL can be impacted
|
|
1. Todo
|
|
1. Scale content groups: [https://docs.microfocus.com/doc/SMAX/24.4/SmartAAdmin](https://docs.microfocus.com/doc/SMAX/24.4/SmartAAdmin)
|
|
5. Rabbitmq (each node)
|
|
1. S3queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)
|
|
1. Summary:
|
|
The rabbitmq queues are in a higher than normal.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it is getting higher continuously, consider performing the same steps mentioned here.
|
|
[https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution)
|
|
2. S3Pending Messages/Minute > 500 for more than 30 mins (Mark for review)
|
|
1. Summary:
|
|
The pending messages in rabbitmq are getting accumulated.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it is getting higher continuously, consider performing the same steps mentioned here.
|
|
[https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution)
|
|
3. Message queue not equally distributed to different cluster nodes(TBD)
|
|
1. Summary:
|
|
Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.
|
|
1. Todo
|
|
1. Scale down the rabbitmq node which is not in the cluster.
|
|
2. Remove the `<rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesia` folders on the NFS server or the bastion node
|
|
3. Wait until the rabbitmq nodes to be ready
|
|
6. IDM
|
|
1. S4Active user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. Keep monitoring
|
|
7. Gateway
|
|
1. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
2. S2Httpclient InUse > 20 for 30 mins
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
8. Platform
|
|
1. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
2. S2Httpclient InUse > 20 for 30 mins
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
9. Serviceportal
|
|
1. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
2. S2Httpclient InUse > 20 for 30 mins
|
|
1. Summary:
|
|
The active user number is more than the target size.
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
5. Instrumental
|