ishenwei/nexus

Fork 0

Files

weishen 3f2e1765d8 Auto-sync: 2026-04-18 17:09

2026-04-18 17:09:43 +08:00

30 KiB

Raw Blame History

Alert-Runbooks-based-on-monitoring_686083866

Alerts, Description and Actions

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.

Alert Runbook: [ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert

Alert Description: This alert is triggered when there are more than 34 5xx errors triggered on frontend in 3mins. Multiple end user may experience a production issue on their side.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric: ALB HTTP 5XX Count
Threshold: 34
Duration: 3 minutes

Actions:

Check whether there is any other time-correlated alerts reporting.

Alert Runbook: [ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert

Alert Description: This alert is triggered when EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric: EBS disk queue depth
Threshold: 5
Duration: 10 minutes

Actions:

Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.

Alert Runbook: [ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert

Alert Description: This alert is triggered when EBS burst balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: EBS burst balance
Threshold: 40%
Duration: 30 minutes

Actions:

Check
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.

Alert Runbook: [ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert

Alert Description: This alert is triggered when EBS burst balance EBS burst balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric: EBS burst balance
Threshold: 0
Duration: immediately

Actions:

Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
  1. Switch the EBS to GP3 with a specified IOPS (in general default 3000/12000 should be enough, if not you may enlarge it to 18000, need to switch back to 3000/12000 once the issue is fixed)

Alert Runbook: [ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert

Alert Description: This alert is triggered when Burst credit below 40% for more than 30 mins. The tasks on the storage will be queued soon.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: EFS Burst Credit Balance
Threshold: 40%
Duration: 30 minutes

Actions:

Check
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
  1. Usually there is no action required, if the alert persists, then it's a critical issue.

Alert Runbook: [ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert

Alert Description: This alert is triggered when EFS Burst credit is 0. The tasks on the storage is being queued. Everything via EFS IO will be slowed down.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric: EFS Burst credit
Threshold: 0
Duration: immdediatey

Actions:

Check
1. whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
  1. Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS CPU Utilization alert

Alert Description: This alert is triggered when RDS CPU more than 97% for more than 60 mins. The overall CPU usage is more than 97% for more than one hour.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: RDS CPU Utilization
Threshold: 97%
Duration: 60mins

Actions:

Check
1. performance insight for top queries for anything taking more CPU
Todo
1. Keep monitoring and check whether other metrics on Database is abnormal. 2. Get top 10 query information.

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert

Alert Description: This alert is triggered when RDS sy: system >70% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: RDS cpuUtilization System
Threshold: 70%
Duration: 60mins

Actions:

Check
1. performance insight for top queries for anything taking more CPU
Todo
1. Keep monitoring and check whether other metrics on Database is abnormal.

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert

Alert Description: This alert is triggered when RDS si: soft interrupts > 15% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: RDS CPU Soft Interrupts
Threshold: 15%
Duration: 60mins

Actions:

Check
1. performance insight for top queries for anything taking more CPU
Todo
1. Keep monitoring and check whether other metrics on Database is abnormal.

Alert Runbook: [ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert

Alert Description: This alert is triggered when RDS EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric: RDS Disk queue depth
Threshold: 5
Duration: 10mins

Actions:

Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert

Alert Description: This alert is triggered when RDS disk Free Storage Space is below 500 MB. The instance is running out of storage.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: RDS Disk Free Storage Space
Threshold: 500
Duration: immdediatey

Actions:

Todo
a. Add more storage to EBS
b. Enable storage auto-scaling

Alert Runbook: RDS storage auto-scaling quota is not enough

Alert Description: This alert is triggered when Storage don't has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2. The instance is running out of storage.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage
Threshold: 0.2
Duration: TBD

Actions:

Todo
1. Increase the max auto-scaling storage size.

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert

Alert Description: This alert is triggered when RDS free memory less than 5% for more than 5 mins. The instance will running out of memory soon.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: RDS Free Memory Percentage
Threshold: 5%
Duration: 5mins

Actions:

Check
Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
Todo
a. Keep monitoring
b. considering rolling restart current deployment, for example, gateway/platform/serviceportal

Alert Runbook: [ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert

Alert Description: This alert is triggered when free memory less than 2% for more than 5 mins. The instance will running out of memory soon.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric: RDS Free Memory Percentage
Threshold: 2%
Duration: 5mins

Actions:

Check
1. Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
Todo
1. considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
  1. consider scaling up RDS. Usually double the memory size. 2. Do DB tuning based on the query which is identified as memory consuming

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Burst Balance alert

Alert Description: This alert is triggered when RDS Burst Balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric: RDS Burst Balance
Threshold: 40%
Duration: 30mins

Actions:

Check
1. keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.

Alert Runbook: [ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert

Alert Description: This alert is triggered when RDS Burst Balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric:RDS Burst credit
Threshold: 0
Duration: immediately

Actions:

Check
1. whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Todo
1. Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
  1. Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed) 2. Add more storage to the EBS

Alert Runbook: [ S2 - Error ] [ farm-name ] SMA/CMS RDS DBLoad alert

Alert Description: This alert is triggered when DBLoad is more than 2 times of CPU number for more than one hour(AWS Specific, via performance insight). The database is overloaded.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:RDS DBLoad
Threshold: 2 times of CPU number
Duration: 1 hour

Actions:

Check
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time

Alert Runbook: [ S1 - Critical ] [ farm-name ] SMA/CMS RDS DBLoad alert

Alert Description: This alert is triggered when DBLoad is more than 4 times of CPU number for more than one hour. The database is mostly overloaded on CPU.

Alert Severity: S1 - Critical

Alert Trigger Conditions:

Metric:RDS DBLoad
Threshold: 4 times of CPU number
Duration: one hour

Actions:

Check
1. AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time

Alert Runbook: [ S3 - Warning ] [ farm-name ] SMA/CMS RDS DBLoadNonCPU alert

Alert Description: This alert is triggered when DBLoadNonCPU is more than 1 times of CPU number more than one hour. The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:RDS DBLoadNonCPU
Threshold: 1 times of CPU number
Duration: 1 hour

Actions:

Check
1. AWS console → RDS → Performance Insight to check which operation is taking the most of time

Alert Runbook: [ S2 - Error ] [ farm-name] Node CPU Usage alert

Alert Description: This alert is triggered when node CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Node CPU Usage
Threshold: 97%
Duration: 60mins

Actions:

Todo
1. Keep monitoring

Alert Runbook: [ S2 - Error ] [ farm-name ] Node CPU System alert

Alert Description: This alert is triggered when node sy: system >70% for more than 60 mins. The instance too busy on its own system operation to handle the tasks for normal business.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Node CPU System
Threshold: 70%
Duration: 60mins

Actions:

Todo
1. Keep monitoring

Alert Runbook: [ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert

Alert Description: This alert is triggered when node si: soft interrupts > 15% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Node CPU Soft Interrupts
Threshold: 15%
Duration: 60mins

Actions:

Todo
1. Keep monitoring

Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Mem Usage alert

Alert Description: This alert is triggered when node memory more than 95% for more than 10 mins. The instance is almost running out of Mem for more than 60 mins.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Node Mem Usage
Threshold: 95%
Duration: 10mins

Actions:

Todo
1. Keep monitoring

Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Disk Usage alert

Alert Description: This alert is triggered when node disk usage more than 95%. The instance is almost running out of disk.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Node Disk Usage
Threshold: 95%
Duration: immdediatey

Actions:

Todo
1. Add more storage to the disk

Alert Runbook: [ S3 - Warning ] [ farm-name ] Disk Inode Usage alert

Alert Description: This alert is triggered when disk inode usage is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Disk Inode Usage
Threshold: 97%
Duration: immdediatey

Actions:

Todo
1. Restart pods on the instance to release inode usage 2. If above step cannot help, need to open an incident for further analysis.

Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core

Alert Description: This alert is triggered when node Load Avg 15m/core number > 200% for 35 mins. The instance is overloaded for more than 35 mins.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Node Load Avg 15m/core
Threshold: 2
Duration: 35mins

Actions:

Todo
1. Keep monitoring 2. If it happens multiple times in a day, run the rebalancing pod script.

Alert Runbook: [ S2 - Error ] [ farm-name ] Pod CPU usage alert

Alert Description: This alert is triggered when CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Pod CPU usage
Threshold: 97%
Duration: 60mins

Actions:

Todo
1. Keep monitoring

Alert Runbook: [ S3 - Warning ] [ farm-name ] Pod Inode Usage alert

Alert Description: This alert is triggered when pod Inode usage(free/total) is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Pod Inode Usage
Threshold: 97%
Duration: immdediatey

Actions:

Todo
1. Restart pods on the instance to release inode usage 2. If above step cannot help, need to open an incident for further analysis.

Alert Runbook: [ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Description: This alert is triggered when these services (portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade) are not available now.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: [ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Description: This alert is triggered when these services (others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user) are not available now.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: [ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Description: This alert is triggered when these services (XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer ) are not available now.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: [ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Description: This alert is triggered when services out side of ESM / toolkit are not available now.

Alert Severity: S4 - Info

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: [ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert

Alert Description: This alert is triggered when these services (itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb) are not available now.

Alert Severity: S0 - Urgent

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: [ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert

Alert Description: This alert is triggered when these services ( itom-autopass-lms, itom-vault) are not available now.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: [ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert

Alert Description: This alert is triggered when these services ( itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers ) are not available now.

Alert Severity: S4 - Info

Alert Trigger Conditions:

Metric:services not available
Threshold: 0
Duration: immdediatey

Actions:

Todo
1. Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.

Alert Runbook: Pod Load Avg 10s

Alert Description: This alert is triggered when Pod Load Avg 10s is more than 200% for 35mins.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Pod Load Avg 10s
Threshold: 200%
Duration: 35mins

Actions:

Todo
1. Keep monitoring 2. If it happens multiple times in a day, run the rebalancing pod script.

Alert Runbook: [ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert

Alert Description: This alert is triggered when content data ratio(total doc/committed doc) is more than 1.20. All the query against the IDOL will take more time and get slowed down.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:SmartA Data Compact Ration
Threshold: 1.20
Duration: immdediatey

Actions:

Todo
1. Run the jenkins job of IDOL compact. 2. Or follow the steps in the guide below
  https://docs.microfocus.com/doc/SMAX/23.4/Searchslow

Alert Runbook: [ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert

Alert Description: This alert is triggered when each rabbitmq node queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile). The rabbitmq queues are in a higher than normal.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Rabbitmq Queue
Threshold: 200/250
Duration: 30mins

Actions:

Todo
1. Keep monitoring 2. If it is getting higher continuously, consider performing the same steps mentioned here.
  https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution

Alert Runbook: [ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert

Alert Description: This alert is triggered when Pending Messages/Minute > 500 for more than 30 mins. The pending messages in rabbitmq are getting accumulated.

Alert Severity: S3 - Warning

Alert Trigger Conditions:

Metric:Rabbitmq Messages/Minute
Threshold: 500
Duration: 30mins

Actions:

Todo
1. Keep monitoring 2. If it is getting higher continuously, consider performing the same steps mentioned here.
  https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution

Alert Runbook: Message queue not equally distributed to different cluster nodes

Alert Description: This alert is triggered when Message queue not equally distributed to different cluster nodes. Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.

Alert Severity: S1 - Critical

Alert Trigger Conditions:

Metric:Rabbitmq Message queue
Threshold: TBD
Duration: TBD

Actions:

Todo
1. Scale down the rabbitmq node which is not in the cluster. 2. Remove the <rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesia folders on the NFS server or the bastion node 3. Wait until the rabbitmq nodes to be ready

Alert Runbook: [ S4 - Info ] [ farm-name ] IDM active users alert

Alert Description: This alert is triggered when per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins. The active user number is more than the target size.

Alert Severity: S4 - Info

Alert Trigger Conditions:

Metric:IDM active users
Threshold: 1100/3000
Duration: 30mins

Actions:

Todo
1. Keep monitoring

Alert Runbook: [ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert

Alert Description: This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Gateway Tomcat https connector currentThreadsBusy
Threshold: 30
Duration: 30mins

Actions:

Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
  How to generate thread dump and memory dumps for java applications

Alert Runbook: [ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert

Alert Description: This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Gateway Httpclient InUse
Threshold: 20
Duration: 30mins

Actions:

Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
  How to generate thread dump and memory dumps for java applications

Alert Runbook: [ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert

Alert Description: This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Platform Tomcat https connector currentThreadsBusy
Threshold: 30
Duration: 30mins

Actions:

Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
  How to generate thread dump and memory dumps for java applications

Alert Runbook: [ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert

Alert Description: This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Platform Httpclient InUse
Threshold: 20
Duration: 30mins

Actions:

Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
  How to generate thread dump and memory dumps for java applications

Alert Runbook: [ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert

Alert Description: This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Serviceportal Tomcat https connector currentThreadsBusy
Threshold: 30
Duration: 30mins

Actions:

Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
  How to generate thread dump and memory dumps for java applications

Alert Runbook: [ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert

Alert Description: This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.

Alert Severity: S2 - Error

Alert Trigger Conditions:

Metric:Serviceportal Httpclient InUse
Threshold: 20
Duration: 30mins

Actions:

Todo
1. If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
  How to generate thread dump and memory dumps for java applications

30 KiB Raw Blame History

Alert-Runbooks-based-on-monitoring_686083866

Alerts, Description and Actions

Alert Runbook: [ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert

Alert Runbook: [ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert

Alert Runbook: [ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert

Alert Runbook: [ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert

Alert Runbook: [ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS CPU Utilization alert

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert

Alert Runbook: RDS storage auto-scaling quota is not enough

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert

Alert Runbook: [ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert

Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Burst Balance alert

Alert Runbook: [ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert

Alert Runbook: [ S2 - Error ] [ farm-name ] SMA/CMS RDS DBLoad alert

Alert Runbook: [ S1 - Critical ] [ farm-name ] SMA/CMS RDS DBLoad alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] SMA/CMS RDS DBLoadNonCPU alert

Alert Runbook: [ S2 - Error ] [ farm-name] Node CPU Usage alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Node CPU System alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Mem Usage alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Disk Usage alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Disk Inode Usage alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core

Alert Runbook: [ S2 - Error ] [ farm-name ] Pod CPU usage alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Pod Inode Usage alert

Alert Runbook: [ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Runbook: [ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Runbook: [ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert

Alert Runbook: [ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert

Alert Runbook: [ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert

Alert Runbook: [ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert

Alert Runbook: Pod Load Avg 10s

Alert Runbook: [ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert

Alert Runbook: [ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert

Alert Runbook: Message queue not equally distributed to different cluster nodes

Alert Runbook: [ S4 - Info ] [ farm-name ] IDM active users alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert

Alert Runbook: [ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert

30 KiB

Raw Blame History