30 KiB
Alert-Runbooks-based-on-monitoring_686083866
Alerts, Description and Actions
Alerts comes with monitoring and experience.
Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.
Alert Runbook: [ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert
Alert Description: This alert is triggered when there are more than 34 5xx errors triggered on frontend in 3mins. Multiple end user may experience a production issue on their side.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric: ALB HTTP 5XX Count
- Threshold: 34
- Duration: 3 minutes
Actions:
- Check whether there is any other time-correlated alerts reporting.
Alert Runbook: [ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert
Alert Description: This alert is triggered when EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric: EBS disk queue depth
- Threshold: 5
- Duration: 10 minutes
Actions:
- Check
- whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.
Alert Runbook: [ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert
Alert Description: This alert is triggered when EBS burst balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: EBS burst balance
- Threshold: 40%
- Duration: 30 minutes
Actions:
- Check
- keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
Alert Runbook: [ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert
Alert Description: This alert is triggered when EBS burst balance EBS burst balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric: EBS burst balance
- Threshold: 0
- Duration: immediately
Actions:
- Check
- whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
- Switch the EBS to GP3 with a specified IOPS (in general default 3000/12000 should be enough, if not you may enlarge it to 18000, need to switch back to 3000/12000 once the issue is fixed)
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
Alert Runbook: [ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert
Alert Description: This alert is triggered when Burst credit below 40% for more than 30 mins. The tasks on the storage will be queued soon.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: EFS Burst Credit Balance
- Threshold: 40%
- Duration: 30 minutes
Actions:
- Check
- whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
- Usually there is no action required, if the alert persists, then it's a critical issue.
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
Alert Runbook: [ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert
Alert Description: This alert is triggered when EFS Burst credit is 0. The tasks on the storage is being queued. Everything via EFS IO will be slowed down.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric: EFS Burst credit
- Threshold: 0
- Duration: immdediatey
Actions:
- Check
- whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
- Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
Alert Runbook: [ S2 - Error ] [ farm-name ] RDS CPU Utilization alert
Alert Description: This alert is triggered when RDS CPU more than 97% for more than 60 mins. The overall CPU usage is more than 97% for more than one hour.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: RDS CPU Utilization
- Threshold: 97%
- Duration: 60mins
Actions:
- Check
- performance insight for top queries for anything taking more CPU
- Todo
- Keep monitoring and check whether other metrics on Database is abnormal. 2. Get top 10 query information.
Alert Runbook: [ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert
Alert Description: This alert is triggered when RDS sy: system >70% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: RDS cpuUtilization System
- Threshold: 70%
- Duration: 60mins
Actions:
- Check
- performance insight for top queries for anything taking more CPU
- Todo
- Keep monitoring and check whether other metrics on Database is abnormal.
Alert Runbook: [ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert
Alert Description: This alert is triggered when RDS si: soft interrupts > 15% for more than 60 mins. The CPU is spending more time on system level processing instead of handling the business flow.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: RDS CPU Soft Interrupts
- Threshold: 15%
- Duration: 60mins
Actions:
- Check
- performance insight for top queries for anything taking more CPU
- Todo
- Keep monitoring and check whether other metrics on Database is abnormal.
Alert Runbook: [ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert
Alert Description: This alert is triggered when RDS EBS disk queue depth more than 5 for more than 10 mins. The tasks on the storage is being queued.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric: RDS Disk queue depth
- Threshold: 5
- Duration: 10mins
Actions:
- Check
- whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert
Alert Description: This alert is triggered when RDS disk Free Storage Space is below 500 MB. The instance is running out of storage.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: RDS Disk Free Storage Space
- Threshold: 500
- Duration: immdediatey
Actions:
- Todo
a. Add more storage to EBS
b. Enable storage auto-scaling
Alert Runbook: RDS storage auto-scaling quota is not enough
Alert Description: This alert is triggered when Storage don't has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2. The instance is running out of storage.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage
- Threshold: 0.2
- Duration: TBD
Actions:
- Todo
- Increase the max auto-scaling storage size.
Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert
Alert Description: This alert is triggered when RDS free memory less than 5% for more than 5 mins. The instance will running out of memory soon.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: RDS Free Memory Percentage
- Threshold: 5%
- Duration: 5mins
Actions:
- Check
Login to AWS console → RDS → Monitoring to check whether swap usage is increasing - Todo
a. Keep monitoring
b. considering rolling restart current deployment, for example, gateway/platform/serviceportal
Alert Runbook: [ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert
Alert Description: This alert is triggered when free memory less than 2% for more than 5 mins. The instance will running out of memory soon.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric: RDS Free Memory Percentage
- Threshold: 2%
- Duration: 5mins
Actions:
- Check
- Login to AWS console → RDS → Monitoring to check whether swap usage is increasing
- Todo
- considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
- consider scaling up RDS. Usually double the memory size. 2. Do DB tuning based on the query which is identified as memory consuming
- considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
Alert Runbook: [ S2 - Error ] [ farm-name ] RDS Burst Balance alert
Alert Description: This alert is triggered when RDS Burst Balance below 40% for more than 30 mins. The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric: RDS Burst Balance
- Threshold: 40%
- Duration: 30mins
Actions:
- Check
- keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
Alert Runbook: [ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert
Alert Description: This alert is triggered when RDS Burst Balance is 0. The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric:RDS Burst credit
- Threshold: 0
- Duration: immediately
Actions:
- Check
- whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
- Todo
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
- Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed) 2. Add more storage to the EBS
- Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
Alert Runbook: [ S2 - Error ] [ farm-name ] SMA/CMS RDS DBLoad alert
Alert Description: This alert is triggered when DBLoad is more than 2 times of CPU number for more than one hour(AWS Specific, via performance insight). The database is overloaded.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:RDS DBLoad
- Threshold: 2 times of CPU number
- Duration: 1 hour
Actions:
- Check
- AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
Alert Runbook: [ S1 - Critical ] [ farm-name ] SMA/CMS RDS DBLoad alert
Alert Description: This alert is triggered when DBLoad is more than 4 times of CPU number for more than one hour. The database is mostly overloaded on CPU.
Alert Severity: S1 - Critical
Alert Trigger Conditions:
- Metric:RDS DBLoad
- Threshold: 4 times of CPU number
- Duration: one hour
Actions:
- Check
- AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
Alert Runbook: [ S3 - Warning ] [ farm-name ] SMA/CMS RDS DBLoadNonCPU alert
Alert Description: This alert is triggered when DBLoadNonCPU is more than 1 times of CPU number more than one hour. The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:RDS DBLoadNonCPU
- Threshold: 1 times of CPU number
- Duration: 1 hour
Actions:
- Check
- AWS console → RDS → Performance Insight to check which operation is taking the most of time
Alert Runbook: [ S2 - Error ] [ farm-name] Node CPU Usage alert
Alert Description: This alert is triggered when node CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Node CPU Usage
- Threshold: 97%
- Duration: 60mins
Actions:
- Todo
- Keep monitoring
Alert Runbook: [ S2 - Error ] [ farm-name ] Node CPU System alert
Alert Description: This alert is triggered when node sy: system >70% for more than 60 mins. The instance too busy on its own system operation to handle the tasks for normal business.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Node CPU System
- Threshold: 70%
- Duration: 60mins
Actions:
- Todo
- Keep monitoring
Alert Runbook: [ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert
Alert Description: This alert is triggered when node si: soft interrupts > 15% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Node CPU Soft Interrupts
- Threshold: 15%
- Duration: 60mins
Actions:
- Todo
- Keep monitoring
Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Mem Usage alert
Alert Description: This alert is triggered when node memory more than 95% for more than 10 mins. The instance is almost running out of Mem for more than 60 mins.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Node Mem Usage
- Threshold: 95%
- Duration: 10mins
Actions:
- Todo
- Keep monitoring
Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Disk Usage alert
Alert Description: This alert is triggered when node disk usage more than 95%. The instance is almost running out of disk.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Node Disk Usage
- Threshold: 95%
- Duration: immdediatey
Actions:
- Todo
- Add more storage to the disk
Alert Runbook: [ S3 - Warning ] [ farm-name ] Disk Inode Usage alert
Alert Description: This alert is triggered when disk inode usage is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Disk Inode Usage
- Threshold: 97%
- Duration: immdediatey
Actions:
- Todo
- Restart pods on the instance to release inode usage 2. If above step cannot help, need to open an incident for further analysis.
Alert Runbook: [ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core
Alert Description: This alert is triggered when node Load Avg 15m/core number > 200% for 35 mins. The instance is overloaded for more than 35 mins.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Node Load Avg 15m/core
- Threshold: 2
- Duration: 35mins
Actions:
- Todo
- Keep monitoring 2. If it happens multiple times in a day, run the rebalancing pod script.
Alert Runbook: [ S2 - Error ] [ farm-name ] Pod CPU usage alert
Alert Description: This alert is triggered when CPU more than 97% for more than 60 mins. The instance is almost running out of CPU for more than 60 mins.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Pod CPU usage
- Threshold: 97%
- Duration: 60mins
Actions:
- Todo
- Keep monitoring
Alert Runbook: [ S3 - Warning ] [ farm-name ] Pod Inode Usage alert
Alert Description: This alert is triggered when pod Inode usage(free/total) is more than 97%. The instance will be blocked by the soft limit on OS level (Inode) very soon.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Pod Inode Usage
- Threshold: 97%
- Duration: immdediatey
Actions:
- Todo
- Restart pods on the instance to release inode usage 2. If above step cannot help, need to open an incident for further analysis.
Alert Runbook: [ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert
Alert Description: This alert is triggered when these services (portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade) are not available now.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: [ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert
Alert Description: This alert is triggered when these services (others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user) are not available now.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: [ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert
Alert Description: This alert is triggered when these services (XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer ) are not available now.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: [ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert
Alert Description: This alert is triggered when services out side of ESM / toolkit are not available now.
Alert Severity: S4 - Info
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: [ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert
Alert Description: This alert is triggered when these services (itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb) are not available now.
Alert Severity: S0 - Urgent
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: [ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert
Alert Description: This alert is triggered when these services ( itom-autopass-lms, itom-vault) are not available now.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: [ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert
Alert Description: This alert is triggered when these services ( itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers ) are not available now.
Alert Severity: S4 - Info
Alert Trigger Conditions:
- Metric:services not available
- Threshold: 0
- Duration: immdediatey
Actions:
- Todo
- Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
Alert Runbook: Pod Load Avg 10s
Alert Description: This alert is triggered when Pod Load Avg 10s is more than 200% for 35mins.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Pod Load Avg 10s
- Threshold: 200%
- Duration: 35mins
Actions:
- Todo
- Keep monitoring 2. If it happens multiple times in a day, run the rebalancing pod script.
Alert Runbook: [ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert
Alert Description: This alert is triggered when content data ratio(total doc/committed doc) is more than 1.20. All the query against the IDOL will take more time and get slowed down.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:SmartA Data Compact Ration
- Threshold: 1.20
- Duration: immdediatey
Actions:
- Todo
- Run the jenkins job of IDOL compact.
2. Or follow the steps in the guide below
https://docs.microfocus.com/doc/SMAX/23.4/Searchslow
- Run the jenkins job of IDOL compact.
2. Or follow the steps in the guide below
Alert Runbook: [ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert
Alert Description: This alert is triggered when each rabbitmq node queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile). The rabbitmq queues are in a higher than normal.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Rabbitmq Queue
- Threshold: 200/250
- Duration: 30mins
Actions:
- Todo
- Keep monitoring
2. If it is getting higher continuously, consider performing the same steps mentioned here.
https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution
- Keep monitoring
2. If it is getting higher continuously, consider performing the same steps mentioned here.
Alert Runbook: [ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert
Alert Description: This alert is triggered when Pending Messages/Minute > 500 for more than 30 mins. The pending messages in rabbitmq are getting accumulated.
Alert Severity: S3 - Warning
Alert Trigger Conditions:
- Metric:Rabbitmq Messages/Minute
- Threshold: 500
- Duration: 30mins
Actions:
- Todo
- Keep monitoring
2. If it is getting higher continuously, consider performing the same steps mentioned here.
https://docs.microfocus.com/doc/SMAX/23.4/RabbitMQNotStart#Solution
- Keep monitoring
2. If it is getting higher continuously, consider performing the same steps mentioned here.
Alert Runbook: Message queue not equally distributed to different cluster nodes
Alert Description: This alert is triggered when Message queue not equally distributed to different cluster nodes. Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.
Alert Severity: S1 - Critical
Alert Trigger Conditions:
- Metric:Rabbitmq Message queue
- Threshold: TBD
- Duration: TBD
Actions:
- Todo
- Scale down the rabbitmq node which is not in the cluster.
2. Remove the
<rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesiafolders on the NFS server or the bastion node 3. Wait until the rabbitmq nodes to be ready
- Scale down the rabbitmq node which is not in the cluster.
2. Remove the
Alert Runbook: [ S4 - Info ] [ farm-name ] IDM active users alert
Alert Description: This alert is triggered when per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins. The active user number is more than the target size.
Alert Severity: S4 - Info
Alert Trigger Conditions:
- Metric:IDM active users
- Threshold: 1100/3000
- Duration: 30mins
Actions:
- Todo
- Keep monitoring
Alert Runbook: [ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert
Alert Description: This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Gateway Tomcat https connector currentThreadsBusy
- Threshold: 30
- Duration: 30mins
Actions:
- Todo
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
How to generate thread dump and memory dumps for java applications
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
Alert Runbook: [ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert
Alert Description: This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Gateway Httpclient InUse
- Threshold: 20
- Duration: 30mins
Actions:
- Todo
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
How to generate thread dump and memory dumps for java applications
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
Alert Runbook: [ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert
Alert Description: This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Platform Tomcat https connector currentThreadsBusy
- Threshold: 30
- Duration: 30mins
Actions:
- Todo
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
How to generate thread dump and memory dumps for java applications
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
Alert Runbook: [ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert
Alert Description: This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Platform Httpclient InUse
- Threshold: 20
- Duration: 30mins
Actions:
- Todo
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
How to generate thread dump and memory dumps for java applications
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
Alert Runbook: [ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert
Alert Description: This alert is triggered when Tomcat https connector currentThreadsBusy > 30 for 30 mins. The active user number is more than the target size.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Serviceportal Tomcat https connector currentThreadsBusy
- Threshold: 30
- Duration: 30mins
Actions:
- Todo
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
How to generate thread dump and memory dumps for java applications
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
Alert Runbook: [ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert
Alert Description: This alert is triggered when Httpclient InUse > 20 for 30 mins. The active user number is more than the target size.
Alert Severity: S2 - Error
Alert Trigger Conditions:
- Metric:Serviceportal Httpclient InUse
- Threshold: 20
- Duration: 30mins
Actions:
- Todo
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.
How to generate thread dump and memory dumps for java applications
- If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal
2. If the number cannot drop after above steps, do rollong restart xmpp.
3. If the number cannot drop after above steps, take thread dump for the pod with issue.