Files
nexus/knowledgebase/csd-wiki/ICSD/Alert-Runbooks-based-on-monitoring_686083866.md

885 lines
31 KiB
Markdown

# Alert-Runbooks-based-on-monitoring_686083866
## Alerts, Description and Actions
Alerts comes with monitoring and experience.
Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] ALB HTTP 5XX Count alert
**Alert Description:** This alert is triggered when there are more than 34 5xx errors triggered on frontend in 3mins. Multiple end user may experience a production issue on their side.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric: ALB HTTP 5XX Count
- Threshold: 34
- Duration: 3 minutes
**Actions:**
1. Check whether there is any other time-correlated alerts reporting.
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] EBS Disk Queue Depth alert
**Alert Description:** This alert is triggered when EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric: EBS disk queue depth
- Threshold: 5
- Duration: 10 minutes
**Actions:**
1. Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] EBS Burst Balance Average alert
**Alert Description:** This alert is triggered when EBS burst balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: EBS burst balance
- Threshold: 40%
- Duration: 30 minutes
**Actions:**
1. Check
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] EBS Burst Balance Average alert
**Alert Description:** This alert is triggered when EBS burst balance EBS burst balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric: EBS burst balance
- Threshold: 0
- Duration: immediately
**Actions:**
1. Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
1. Switch the EBS to GP3 with a specified IOPS (in general default 3000/12000 should be enough, if not you may enlarge it to 18000, need to switch back to 3000/12000 once the issue is fixed)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] EFS Burst Credit Balance alert
**Alert Description:** This alert is triggered when Burst credit below 40% for more than 30 mins. The tasks on the storage will be queued soon.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: EFS Burst Credit Balance
- Threshold: 40%
- Duration: 30 minutes
**Actions:**
1. Check
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
1. Usually there is no action required, if the alert persists, then it's a critical issue.
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] EFS Burst Credit Balance alert
**Alert Description:** This alert is triggered when EFS Burst credit is 0. The tasks on the storage is being queued. Everything via EFS IO will be slowed down.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric: EFS Burst credit
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Check
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
1. Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS CPU Utilization alert
**Alert Description:** This alert is triggered when RDS CPU more than 97% for more than 60 mins. The overall CPU usage is more than 97% for more than one hour.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: RDS CPU Utilization
- Threshold: 97%
- Duration: 60mins
**Actions:**
1. Check
1. performance insight for top queries for anything taking more CPU
2. Todo
1. Keep monitoring and check whether other metrics on Database is abnormal.
2. Get top 10 query information.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS cpuUtilization System alert
**Alert Description:** This alert is triggered when RDS sy: system >70% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: RDS cpuUtilization System
- Threshold: 70%
- Duration: 60mins
**Actions:**
1. Check
1. performance insight for top queries for anything taking more CPU
2. Todo
1. Keep monitoring and check whether other metrics on Database is abnormal.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS CPU Soft Interrupts alert
**Alert Description:** This alert is triggered when RDS si: soft interrupts > 15% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: RDS CPU Soft Interrupts
- Threshold: 15%
- Duration: 60mins
**Actions:**
1. Check
1. performance insight for top queries for anything taking more CPU
2. Todo
1. Keep monitoring and check whether other metrics on Database is abnormal.
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] RDS Disk queue depth alert
**Alert Description:** This alert is triggered when RDS EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric: RDS Disk queue depth
- Threshold: 5
- Duration: 10mins
**Actions:**
1. Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS Disk Free Storage Space alert
**Alert Description:** This alert is triggered when RDS disk Free Storage Space is below 500 MB. The instance is running out of storage.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: RDS Disk Free Storage Space
- Threshold: 500
- Duration: immdediatey
**Actions:**
1. Todo
a. Add more storage to EBS
b. Enable storage auto-scaling
### Alert Runbook: RDS storage auto-scaling quota is not enough
**Alert Description:** This alert is triggered when Storage don't has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2. The instance is running out of storage.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage
- Threshold: 0.2
- Duration: TBD
**Actions:**
1. Todo
1. Increase the max auto-scaling storage size.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS Free Memory Percentage alert
**Alert Description:** This alert is triggered when RDS free memory less than 5% for more than 5 mins. The instance will running out of memory soon.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: RDS Free Memory Percentage
- Threshold: 5%
- Duration: 5mins
**Actions:**
1. Check
Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
2. Todo
a. Keep monitoring
b. considering rolling restart current deployment, for example, gateway/platform/serviceportal
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] RDS Free Memory Percentage alert
**Alert Description:** This alert is triggered when free memory less than 2% for more than 5 mins. The instance will running out of memory soon.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric: RDS Free Memory Percentage
- Threshold: 2%
- Duration: 5mins
**Actions:**
1. Check
1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
2. Todo
1. considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
1. consider scaling up RDS. Usually double the memory size.
2. Do DB tuning based on the query which is identified as memory consuming
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] RDS Burst Balance alert
**Alert Description:** This alert is triggered when RDS Burst Balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric: RDS Burst Balance
- Threshold: 40%
- Duration: 30mins
**Actions:**
1. Check
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] RDS Burst Balance alert
**Alert Description:** This alert is triggered when RDS Burst Balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric:RDS Burst credit
- Threshold: 0
- Duration: immediately
**Actions:**
1. Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page).
2. whether there is a big load against EBS storage.
2. Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed)
2. Add more storage to the EBS
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] SMA/CMS RDS DBLoad alert
**Alert Description:** This alert is triggered when DBLoad is more than 2 times of CPU number for more than one hour(AWS Specific, via performance insight). The database is overloaded.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:RDS DBLoad
- Threshold: 2 times of CPU number
- Duration: 1 hour
**Actions:**
1. Check
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
### Alert Runbook: \[ S1 - Critical \] \[ farm-name \] SMA/CMS RDS DBLoad alert
**Alert Description:** This alert is triggered when DBLoad is more than 4 times of CPU number for more than one hour. The database is mostly overloaded on CPU.
**Alert Severity:** S1 - Critical
**Alert Trigger Conditions:**
- Metric:RDS DBLoad
- Threshold: 4 times of CPU number
- Duration: one hour
**Actions:**
1. Check
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] SMA/CMS RDS DBLoadNonCPU alert
**Alert Description:** This alert is triggered when DBLoadNonCPU is more than 1 times of CPU number more than one hour. The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:RDS DBLoadNonCPU
- Threshold: 1 times of CPU number
- Duration: 1 hour
**Actions:**
1. Check
1. AWS console → RDS → Performance Insight to check which operation is taking the most of time
### Alert Runbook: \[ S2 - Error \] \[ farm-name\] Node CPU Usage alert
**Alert Description:** This alert is triggered when node CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Node CPU Usage
- Threshold: 97%
- Duration: 60mins
**Actions:**
1. Todo
1. Keep monitoring
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Node CPU System alert
**Alert Description:** This alert is triggered when node sy: system >70% for more than 60 mins. The instance too busy on its own system operation to handle the tasks for normal business.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Node CPU System
- Threshold: 70%
- Duration: 60mins
**Actions:**
1. Todo
1. Keep monitoring
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Node CPU Soft Interrupts alert
**Alert Description:** This alert is triggered when node si: soft interrupts > 15% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Node CPU Soft Interrupts
- Threshold: 15%
- Duration: 60mins
**Actions:**
1. Todo
1. Keep monitoring
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Node Mem Usage alert
**Alert Description:** This alert is triggered when node memory more than 95% for more than 10 mins. The instance is almost running out of Mem for more than 60 mins.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Node Mem Usage
- Threshold: 95%
- Duration: 10mins
**Actions:**
1. Todo
1. Keep monitoring
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Node Disk Usage alert
**Alert Description:** This alert is triggered when node disk usage more than 95%. The instance is almost running out of disk.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Node Disk Usage
- Threshold: 95%
- Duration: immdediatey
**Actions:**
1. Todo
1. Add more storage to the disk
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Disk Inode Usage alert
**Alert Description:** This alert is triggered when disk inode usage is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Disk Inode Usage
- Threshold: 97%
- Duration: immdediatey
**Actions:**
1. Todo
1. Restart pods on the instance to release inode usage
2. If above step cannot help, need to open an incident for further analysis.
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Node Load Avg 15m/core
**Alert Description:** This alert is triggered when node Load Avg 15m/core number > 200% for 35 mins. The instance is overloaded for more than 35 mins.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Node Load Avg 15m/core
- Threshold: 2
- Duration: 35mins
**Actions:**
1. Todo
1. Keep monitoring
2. If it happens multiple times in a day, run the rebalancing pod script.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Pod CPU usage alert
**Alert Description:** This alert is triggered when CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Pod CPU usage
- Threshold: 97%
- Duration: 60mins
**Actions:**
1. Todo
1. Keep monitoring
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Pod Inode Usage alert
**Alert Description:** This alert is triggered when pod Inode usage(free/total) is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Pod Inode Usage
- Threshold: 97%
- Duration: immdediatey
**Actions:**
1. Todo
1. Restart pods on the instance to release inode usage
2. If above step cannot help, need to open an incident for further analysis.
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] SMA Unavailable k8s resource alert
**Alert Description:** This alert is triggered when these services (portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade) are not available now.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] SMA Unavailable k8s resource alert
**Alert Description:** This alert is triggered when these services (others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user) are not available now.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] SMA Unavailable k8s resource alert
**Alert Description:** This alert is triggered when these services (XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer ) are not available now.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: \[ S4 - Info \] \[ farm-name \] SMA Unavailable k8s resource alert
**Alert Description:** This alert is triggered when services out side of ESM / toolkit are not available now.
**Alert Severity:** S4 - Info
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: \[ S0 - Urgent \] \[ farm-name \] CMS Unavailable k8s resource alert
**Alert Description:** This alert is triggered when these services (itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb) are not available now.
**Alert Severity:** S0 - Urgent
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] CMS Unavailable k8s resource alert
**Alert Description:** This alert is triggered when these services ( itom-autopass-lms, itom-vault) are not available now.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: \[ S4 - Info \] \[ farm-name \] CMS Unavailable k8s resource alert
**Alert Description:** This alert is triggered when these services ( itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers ) are not available now.
**Alert Severity:** S4 - Info
**Alert Trigger Conditions:**
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
**Actions:**
1. Todo
1. Run 'kubectl describe <pod name> -n <namespace>' and 'kubectl logs <pod name> -n <namespace>' to understand the reason of the failure
2. Try to fix based on the results from step 1.
### Alert Runbook: Pod Load Avg 10s
**Alert Description:** This alert is triggered when Pod Load Avg 10s is more than 200% for 35mins.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Pod Load Avg 10s
- Threshold: 200%
- Duration: 35mins
**Actions:**
1. Todo
1. Keep monitoring
2. If it happens multiple times in a day, run the rebalancing pod script.
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] SmartA Data Compact Ration alert
**Alert Description:** This alert is triggered when content data ratio(total doc/committed doc) is more than 1.20. All the query against the IDOL will take more time and get slowed down.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:SmartA Data Compact Ration
- Threshold: 1.20
- Duration: immdediatey
**Actions:**
1. Todo
1. Run the jenkins job of IDOL compact.
2. Or follow the steps in the guide below
[https://docs.microfocus.com/doc/SMAX/23.4/Searchslow](https://docs.microfocus.com/doc/SMAX/23.4/Searchslow)
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Rabbitmq Queue alert
**Alert Description:** This alert is triggered when each rabbitmq node queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile). The rabbitmq queues are in a higher than normal.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Rabbitmq Queue
- Threshold: 200/250
- Duration: 30mins
**Actions:**
1. Todo
1. Keep monitoring
2. If it is getting higher continuously, consider performing the same steps mentioned here.
[https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution)
### Alert Runbook: \[ S3 - Warning \] \[ farm-name \] Rabbitmq Messages/Minute alert
**Alert Description:** This alert is triggered when Pending Messages/Minute > 500 for more than 30 mins. The pending messages in rabbitmq are getting accumulated.
**Alert Severity:** S3 - Warning
**Alert Trigger Conditions:**
- Metric:Rabbitmq Messages/Minute
- Threshold: 500
- Duration: 30mins
**Actions:**
1. Todo
1. Keep monitoring
2. If it is getting higher continuously, consider performing the same steps mentioned here.
[https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution](https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution)
### Alert Runbook: Message queue not equally distributed to different cluster nodes
**Alert Description:** This alert is triggered when Message queue not equally distributed to different cluster nodes. Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.
**Alert Severity:** S1 - Critical
**Alert Trigger Conditions:**
- Metric:Rabbitmq Message queue
- Threshold: TBD
- Duration: TBD
**Actions:**
1. Todo
1. Scale down the rabbitmq node which is not in the cluster.
2. Remove the `<rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesia` folders on the NFS server or the bastion node
3. Wait until the rabbitmq nodes to be ready
### Alert Runbook: \[ S4 - Info \] \[ farm-name \] IDM active users alert
**Alert Description:** This alert is triggered when per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins. The active user number is more than the target size.
**Alert Severity:** S4 - Info
**Alert Trigger Conditions:**
- Metric:IDM active users
- Threshold: 1100/3000
- Duration: 30mins
**Actions:**
1. Todo
1. Keep monitoring
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Gateway Tomcat https connector currentThreadsBusy alert
**Alert Description:** This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Gateway Tomcat https connector currentThreadsBusy
- Threshold: 30
- Duration: 30mins
**Actions:**
1. Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Gateway Httpclient InUse alert
**Alert Description:** This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Gateway Httpclient InUse
- Threshold: 20
- Duration: 30mins
**Actions:**
1. Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Platform Tomcat https connector currentThreadsBusy alert
**Alert Description:** This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Platform Tomcat https connector currentThreadsBusy
- Threshold: 30
- Duration: 30mins
**Actions:**
1. Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Platform Httpclient InUse alert
**Alert Description:** This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Platform Httpclient InUse
- Threshold: 20
- Duration: 30mins
**Actions:**
1. Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Serviceportal Tomcat https connector currentThreadsBusy alert
**Alert Description:** This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Serviceportal Tomcat https connector currentThreadsBusy
- Threshold: 30
- Duration: 30mins
**Actions:**
1. Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)
### Alert Runbook: \[ S2 - Error \] \[ farm-name \] Serviceportal Httpclient InUse alert
**Alert Description:** This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
**Alert Severity:** S2 - Error
**Alert Trigger Conditions:**
- Metric:Serviceportal Httpclient InUse
- Threshold: 20
- Duration: 30mins
**Actions:**
1. Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
[How to generate thread dump and memory dumps for java applications](https://rndwiki.houston.softwaregrp.net/confluence/display/SMA/How+to+generate+thread+dump+and+memory+dumps+for+java+applications)