885 lines
30 KiB
Markdown
885 lines
30 KiB
Markdown
# Alert-Runbooks-based-on-monitoring_686083866
|
|
## Alerts, Description and Actions
|
|
|
|
Alerts comes with monitoring and experience.
|
|
|
|
Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] ALB HTTP 5XX Count alert
|
|
|
|
**Alert Description:** This alert is triggered when there are more than 34 5xx errors triggered on frontend in 3mins. Multiple end user may experience a production issue on their side.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: ALB HTTP 5XX Count
|
|
- Threshold: 34
|
|
- Duration: 3 minutes
|
|
|
|
**Actions:**
|
|
|
|
1. Check whether there is any other time-correlated alerts reporting.
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] EBS Disk Queue Depth alert
|
|
|
|
**Alert Description:** This alert is triggered when EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: EBS disk queue depth
|
|
- Threshold: 5
|
|
- Duration: 10 minutes
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] EBS Burst Balance Average alert
|
|
|
|
**Alert Description:** This alert is triggered when EBS burst balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: EBS burst balance
|
|
- Threshold: 40%
|
|
- Duration: 30 minutes
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] EBS Burst Balance Average alert
|
|
|
|
**Alert Description:** This alert is triggered when EBS burst balance EBS burst balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: EBS burst balance
|
|
- Threshold: 0
|
|
- Duration: immediately
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Switch the EBS to GP3 with a specified IOPS (in general default 3000/12000 should be enough, if not you may enlarge it to 18000, need to switch back to 3000/12000 once the issue is fixed)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] EFS Burst Credit Balance alert
|
|
|
|
**Alert Description:** This alert is triggered when Burst credit below 40% for more than 30 mins. The tasks on the storage will be queued soon.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: EFS Burst Credit Balance
|
|
- Threshold: 40%
|
|
- Duration: 30 minutes
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue.
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] EFS Burst Credit Balance alert
|
|
|
|
**Alert Description:** This alert is triggered when EFS Burst credit is 0. The tasks on the storage is being queued. Everything via EFS IO will be slowed down.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: EFS Burst credit
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS CPU Utilization alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS CPU more than 97% for more than 60 mins. The overall CPU usage is more than 97% for more than one hour.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS CPU Utilization
|
|
- Threshold: 97%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. performance insight for top queries for anything taking more CPU
|
|
2. Todo
|
|
1. Keep monitoring and check whether other metrics on Database is abnormal.
|
|
2. Get top 10 query information.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS cpuUtilization System alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS sy: system >70% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS cpuUtilization System
|
|
- Threshold: 70%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. performance insight for top queries for anything taking more CPU
|
|
2. Todo
|
|
1. Keep monitoring and check whether other metrics on Database is abnormal.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS CPU Soft Interrupts alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS si: soft interrupts > 15% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS CPU Soft Interrupts
|
|
- Threshold: 15%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. performance insight for top queries for anything taking more CPU
|
|
2. Todo
|
|
1. Keep monitoring and check whether other metrics on Database is abnormal.
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] RDS Disk queue depth alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS Disk queue depth
|
|
- Threshold: 5
|
|
- Duration: 10mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS Disk Free Storage Space alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS disk Free Storage Space is below 500 MB. The instance is running out of storage.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS Disk Free Storage Space
|
|
- Threshold: 500
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
a. Add more storage to EBS
|
|
b. Enable storage auto-scaling
|
|
|
|
### Alert Runbook: RDS storage auto-scaling quota is not enough
|
|
|
|
**Alert Description:** This alert is triggered when Storage don't has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2. The instance is running out of storage.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage
|
|
- Threshold: 0.2
|
|
- Duration: TBD
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Increase the max auto-scaling storage size.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS Free Memory Percentage alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS free memory less than 5% for more than 5 mins. The instance will running out of memory soon.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS Free Memory Percentage
|
|
- Threshold: 5%
|
|
- Duration: 5mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
|
|
2. Todo
|
|
a. Keep monitoring
|
|
b. considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] RDS Free Memory Percentage alert
|
|
|
|
**Alert Description:** This alert is triggered when free memory less than 2% for more than 5 mins. The instance will running out of memory soon.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS Free Memory Percentage
|
|
- Threshold: 2%
|
|
- Duration: 5mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
|
|
2. Todo
|
|
1. considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
|
|
1. consider scaling up RDS. Usually double the memory size.
|
|
2. Do DB tuning based on the query which is identified as memory consuming
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS Burst Balance alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS Burst Balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric: RDS Burst Balance
|
|
- Threshold: 40%
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] RDS Burst Balance alert
|
|
|
|
**Alert Description:** This alert is triggered when RDS Burst Balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:RDS Burst credit
|
|
- Threshold: 0
|
|
- Duration: immediately
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
|
|
2. whether there is a big load against EBS storage.
|
|
2. Todo
|
|
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
|
|
1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed)
|
|
2. Add more storage to the EBS
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] SMA/CMS RDS DBLoad alert
|
|
|
|
**Alert Description:** This alert is triggered when DBLoad is more than 2 times of CPU number for more than one hour(AWS Specific, via performance insight). The database is overloaded.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:RDS DBLoad
|
|
- Threshold: 2 times of CPU number
|
|
- Duration: 1 hour
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
|
|
|
|
### Alert Runbook: \[ S1 - Critical \] \[ farm-name \] SMA/CMS RDS DBLoad alert
|
|
|
|
**Alert Description:** This alert is triggered when DBLoad is more than 4 times of CPU number for more than one hour. The database is mostly overloaded on CPU.
|
|
|
|
**Alert Severity:** S1 - Critical
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:RDS DBLoad
|
|
- Threshold: 4 times of CPU number
|
|
- Duration: one hour
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] SMA/CMS RDS DBLoadNonCPU alert
|
|
|
|
**Alert Description:** This alert is triggered when DBLoadNonCPU is more than 1 times of CPU number more than one hour. The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:RDS DBLoadNonCPU
|
|
- Threshold: 1 times of CPU number
|
|
- Duration: 1 hour
|
|
|
|
**Actions:**
|
|
|
|
1. Check
|
|
1. AWS console → RDS → Performance Insight to check which operation is taking the most of time
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name\] Node CPU Usage alert
|
|
|
|
**Alert Description:** This alert is triggered when node CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Node CPU Usage
|
|
- Threshold: 97%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Node CPU System alert
|
|
|
|
**Alert Description:** This alert is triggered when node sy: system >70% for more than 60 mins. The instance too busy on its own system operation to handle the tasks for normal business.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Node CPU System
|
|
- Threshold: 70%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Node CPU Soft Interrupts alert
|
|
|
|
**Alert Description:** This alert is triggered when node si: soft interrupts > 15% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Node CPU Soft Interrupts
|
|
- Threshold: 15%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Node Mem Usage alert
|
|
|
|
**Alert Description:** This alert is triggered when node memory more than 95% for more than 10 mins. The instance is almost running out of Mem for more than 60 mins.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Node Mem Usage
|
|
- Threshold: 95%
|
|
- Duration: 10mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Node Disk Usage alert
|
|
|
|
**Alert Description:** This alert is triggered when node disk usage more than 95%. The instance is almost running out of disk.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Node Disk Usage
|
|
- Threshold: 95%
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Add more storage to the disk
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Disk Inode Usage alert
|
|
|
|
**Alert Description:** This alert is triggered when disk inode usage is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Disk Inode Usage
|
|
- Threshold: 97%
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Restart pods on the instance to release inode usage
|
|
2. If above step cannot help, need to open an incident for further analysis.
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Node Load Avg 15m/core
|
|
|
|
**Alert Description:** This alert is triggered when node Load Avg 15m/core number > 200% for 35 mins. The instance is overloaded for more than 35 mins.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Node Load Avg 15m/core
|
|
- Threshold: 2
|
|
- Duration: 35mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it happens multiple times in a day, run the rebalancing pod script.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Pod CPU usage alert
|
|
|
|
**Alert Description:** This alert is triggered when CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Pod CPU usage
|
|
- Threshold: 97%
|
|
- Duration: 60mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Pod Inode Usage alert
|
|
|
|
**Alert Description:** This alert is triggered when pod Inode usage(free/total) is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Pod Inode Usage
|
|
- Threshold: 97%
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Restart pods on the instance to release inode usage
|
|
2. If above step cannot help, need to open an incident for further analysis.
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] SMA Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when these services (portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade) are not available now.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] SMA Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when these services (others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user) are not available now.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] SMA Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when these services (XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer ) are not available now.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: \[ S4 - Info \] \[ farm-name \] SMA Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when services out side of ESM / toolkit are not available now.
|
|
|
|
**Alert Severity:** S4 - Info
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] CMS Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when these services (itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb) are not available now.
|
|
|
|
**Alert Severity:** S0 - Urgent
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] CMS Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when these services ( itom-autopass-lms, itom-vault) are not available now.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: \[ S4 - Info \] \[ farm-name \] CMS Unavailable k8s resource alert
|
|
|
|
**Alert Description:** This alert is triggered when these services ( itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers ) are not available now.
|
|
|
|
**Alert Severity:** S4 - Info
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:services not available
|
|
- Threshold: 0
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
|
|
2. Try to fix based on the results from step 1.
|
|
|
|
### Alert Runbook: Pod Load Avg 10s
|
|
|
|
**Alert Description:** This alert is triggered when Pod Load Avg 10s is more than 200% for 35mins.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Pod Load Avg 10s
|
|
- Threshold: 200%
|
|
- Duration: 35mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it happens multiple times in a day, run the rebalancing pod script.
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] SmartA Data Compact Ration alert
|
|
|
|
**Alert Description:** This alert is triggered when content data ratio(total doc/committed doc) is more than 1.20. All the query against the IDOL will take more time and get slowed down.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:SmartA Data Compact Ration
|
|
- Threshold: 1.20
|
|
- Duration: immdediatey
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Run the jenkins job of IDOL compact.
|
|
2. Or follow the steps in the guide below
|
|
[https://docs.microfocus.com/doc/SMAX/23.4/Searchslow](https://docs.microfocus.com/doc/SMAX/23.4/Searchslow)
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Rabbitmq Queue alert
|
|
|
|
**Alert Description:** This alert is triggered when each rabbitmq node queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile). The rabbitmq queues are in a higher than normal.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Rabbitmq Queue
|
|
- Threshold: 200/250
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it is getting higher continuously, consider performing the same steps mentioned here.
|
|
[https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution)
|
|
|
|
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Rabbitmq Messages/Minute alert
|
|
|
|
**Alert Description:** This alert is triggered when Pending Messages/Minute > 500 for more than 30 mins. The pending messages in rabbitmq are getting accumulated.
|
|
|
|
**Alert Severity:** S3 - Warning
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Rabbitmq Messages/Minute
|
|
- Threshold: 500
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
2. If it is getting higher continuously, consider performing the same steps mentioned here.
|
|
[https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution)
|
|
|
|
### Alert Runbook: Message queue not equally distributed to different cluster nodes
|
|
|
|
**Alert Description:** This alert is triggered when Message queue not equally distributed to different cluster nodes. Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.
|
|
|
|
**Alert Severity:** S1 - Critical
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Rabbitmq Message queue
|
|
- Threshold: TBD
|
|
- Duration: TBD
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Scale down the rabbitmq node which is not in the cluster.
|
|
2. Remove the `<rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesia` folders on the NFS server or the bastion node
|
|
3. Wait until the rabbitmq nodes to be ready
|
|
|
|
### Alert Runbook: \[ S4 - Info \] \[ farm-name \] IDM active users alert
|
|
|
|
**Alert Description:** This alert is triggered when per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S4 - Info
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:IDM active users
|
|
- Threshold: 1100/3000
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. Keep monitoring
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Gateway Tomcat https connector currentThreadsBusy alert
|
|
|
|
**Alert Description:** This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Gateway Tomcat https connector currentThreadsBusy
|
|
- Threshold: 30
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Gateway Httpclient InUse alert
|
|
|
|
**Alert Description:** This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Gateway Httpclient InUse
|
|
- Threshold: 20
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Platform Tomcat https connector currentThreadsBusy alert
|
|
|
|
**Alert Description:** This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Platform Tomcat https connector currentThreadsBusy
|
|
- Threshold: 30
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Platform Httpclient InUse alert
|
|
|
|
**Alert Description:** This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Platform Httpclient InUse
|
|
- Threshold: 20
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Serviceportal Tomcat https connector currentThreadsBusy alert
|
|
|
|
**Alert Description:** This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Serviceportal Tomcat https connector currentThreadsBusy
|
|
- Threshold: 30
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|
|
|
|
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Serviceportal Httpclient InUse alert
|
|
|
|
**Alert Description:** This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
|
|
|
|
**Alert Severity:** S2 - Error
|
|
|
|
**Alert Trigger Conditions:**
|
|
|
|
- Metric:Serviceportal Httpclient InUse
|
|
- Threshold: 20
|
|
- Duration: 30mins
|
|
|
|
**Actions:**
|
|
|
|
1. Todo
|
|
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
|
|
2. If the number cannot drop after above steps, do rollong restart xmpp.
|
|
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
|
|
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
|