Runbooks-based-on-monitoring_686083879

Alerts, Description and Summary

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.

Infrastructure
1. Compute 2. Network
  1. ALB 5xx (More than 50 in a 2 mins time frame)
    1. Summary:
      There are more than 50 5xx errors triggered on frontend. Multiple end user may experience a production issue on their side.
      1. Check whether there is any other time-correlated alerts reporting.
      2. S2NEWALB target 5xx (TBD)
  2. Storage
  3. S3EBS (EBS disk queue depth more than 5 for more than 10 mins)
    1. Summary:
      The tasks on the storage is being queued.
      1. Check
        
        whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        No action is required. Usually if it's node level issue, AWS autoscaling group will replace the node after a while.
      2. S2EBS (EBS burst balance below 40% for more than 30 mins )
    2. Summary:
      The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
      1. Check
        
        keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
      2. EBS (EBS burst balance is 0)
    3. Summary:
      The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
      1. Check
        
        whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
        
        Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed) 2. Add more storage to the EBS
      2. S2EFS (Burst credit below 40% for more than 30 mins )
    4. Summary:
      The tasks on the storage will be queued soon.
      1. Check
        
        whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
        
        Usually there is no action required, if the alert persists, then it's a critical issue.
      2. EFS (Burst credit is 0)
    5. Summary:
      The tasks on the storage is being queued. Everything via EFS IO will be slowed down.
      1. Check
        
        whether EFS is running out of credits via EFS burst credit dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
        
        Switch the EFS to throughput mode (for example: 60 - 100 MB/s, need to switch back once the issue is fixed)
  4. Virtualization
  5. Database
  6. S2CPU (CPU more than 97% for more than 60 mins)
    1. Summary:
      The overall CPU usage is more than 97% for more than one hour.
      1. Check
        
        performance insight for top queries for anything taking more CPU 2. Todo
        
        Keep monitoring and check whether other metrics on Database is abnormal. 2. Get top 10 query information.
      2. S2CPU (sy: system >70% for more than 60 mins )
    2. Summary:
      The CPU is spending more time on system level processing instead of handling the business flow.
      1. Check
        
        performance insight for top queries for anything taking more CPU 2. Todo
        
        Keep monitoring and check whether other metrics on Database is abnormal.
      2. S2CPU (si: soft interrupts > 15% for more than 60 mins )
    3. Summary:
      The CPU is spending more time on system level processing instead of handling the business flow.
      1. Check
        
        performance insight for top queries for anything taking more CPU 2. Todo
        
        Keep monitoring and check whether other metrics on Database is abnormal.
      2. S3Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)
    4. Summary:
      The tasks on the storage is being queued.
      1. Check
        
        whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage.
      2. S2Disk (Free Storage Space is below 500 MB)
    5. Summary:
      The instance is running out of storage.
      1. Todo
        
        Add more storage to EBS 2. Enable storage auto-scaling
      2. S2 Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )
    6. Summary:
      The instance auto-scaling quota is not enough.
      1. Todo
        
        Increase the max auto-scaling storage size.
      2. S2Memory (Free memory less than 5% for more than 5 mins)
    7. Summary:
      The instance will running out of memory soon.
      1. Check
        
        Login to AWS console → RDS → Monitoring to check whether swap usage is increasing 2. Todo
        
        Keep monitoring 2. considering rolling restart current deployment, for example, gateway/platform/serviceportal
      2. Memory (Free memory less than 2% for more than 5 mins)
    8. Summary:
      The instance will running out of memory soon.
      1. Check
        
        Login to AWS console → RDS → Monitoring to check whether swap usage is increasing 2. Todo
        
        considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If it's happening for 2-3 times a day and the swap usage is higher. Need to
        
        consider scaling up RDS. Usually double the memory size. 2. Do DB tuning based on the query which is identified as memory consuming
      2. S2Storage (Burst Balance below 40% for more than 30 mins )
    9. Summary:
      The load on EBS is high and the burst balance may not fulfill the request in the following quarter/hour.
      1. Check
        
        keep monitoring whether EBS is running out of credits via EBS burst balance dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        Usually there is no action required, if the alert persists, then it's a critical issue. Please follow the todo when Burst Balance is 0.
      2. Storage (Burst Balance is 0)
    10. Summary:
      The tasks on the storage is being queued. Everything via EBS IO will be slowed down.
      1. Check
        
        whether EBS is running out of credits via EBS burst balance dashboard (Same Dashboard in the infrastructure page). 2. whether there is a big load against EBS storage. 2. Todo
        
        Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
        
        Switch the EBS to GP3 with a specified IOPS (in general default 12000 should be enough, if not you may enlarge it to 18000, need to switch back to 12000 once the issue is fixed) 2. Add more storage to the EBS
      2. S2Storage (EBSByteBalance% or EBSIOBalance% below 40% for more than 30 mins )
    11. Summary:
      The load on RDS is high and the burst balance may not fulfill the request in the following quarter/hour.
      1. Check
        
        keep monitoring whether RDS is running out of credits via RDS dashboard soon (Same Dashboard in the infrastructure page). 2. whether there is a big load against RDS storage. 2. Todo
        
        Usually there is no action required, if the alert persists, then it's a critical issue. Please fix the top sql 2. up size the RDS instance type
      2. Storage (EBSByteBalance% or EBSIOBalance% is 0)
    12. Summary:
      The tasks on the storage is being queued. Everything via RDS IO will be slowed down
      1. Check
        
        whether RDS is running out of credits via RDS monitoring dashboard. 2. whether there is a big load against RDS storage. 2. Todo
        
        Manually login to the system to check whether it's slowing down the system, if it has been slowed down dramatically, choose one of below options to fix
        
        Fix the top sql 2. Up size the RDS instance
      2. S2DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)
    13. Summary:
      The database is overloaded.
      1. Check
        
        AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
      2. DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)
    14. Summary:
      The database is mostly overloaded on CPU.
      1. Check
        
        AWS console → RDS → Performance Insight to check which kind of operation is taking the most of time
      2. S3DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)
    15. Summary:
      The database is blocked on some areas other than CPU, it can be blocked by DB locks, read/write IO and other reasons.
      1. Check
        
        AWS console → RDS → Performance Insight to check which operation is taking the most of time
      2. Dead tuple (TBD)
OS (Node level)
1. CPU
  1. S2CPU more than 97% for more than 60 mins
    1. Summary:
      The instance is almost running out of CPU for more than 60 mins.
      1. Todo
        
        Keep monitoring
      2. S2CPU (sy: system >70% for more than 60 mins )(mark for review)
      3. Summary:
        The instance too busy on its own system operation to handle the tasks for normal business.
    2. Todo
      1. Keep monitoring
      2. S2CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)
    3. Summary:
      The instance is almost running out of CPU for more than 60 mins.
      1. Todo
        
        Keep monitoring
  2. Memory
  3. S3Memory more than 95% for more than 10 mins
    1. Summary:
      The instance is almost running out of CPU for more than 60 mins.
      1. Todo
        
        Keep monitoring
  4. Disk
  5. S3Disk usage more than 95%
    1. Summary:
      The instance is almost running out of disk.
      1. Todo
        
        Add more storage to the disk
      2. Disk read/write latency (TBD)
      3. S3 Inode usage > 97%
    2. Summary:
      The instance will be blocked by the soft limit on OS level (Inode) very soon.
      1. Todo
        
        Restart pods on the instance to release inode usage 2. If above step cannot help, need to open an incident for further analysis.
      2. Node disk IO load (TBD)
  6. Network
  7. network operation latency(TBD) 2. network transit error rate(TBD) 3. network transit drop rate(TBD) 4. network transit queue length(TBD) 5. Throughput / bandwidth (TBD)
  8. S3Load (Load Avg 15m/core number > 200% for 35 mins )
  9. Summary:
    The instance is overloaded for more than 35 mins.
    1. Todo
      1. Keep monitoring 2. If it happens multiple times in a day, run the rebalancing pod script.
Container
1. CPU
  1. S2CPU (CPU more than 97% for more than 60 mins)
    1. Summary:
      The instance is almost running out of CPU for more than 60 mins.
      1. Todo
        
        Keep monitoring
  2. Memory
  3. swap usage
  4. Disk
  5. Disk read/write latency (TBD) 2. S3Inode usage(free/total) > 97%
    1. Summary:
      The instance will be blocked by the soft limit on OS level (Inode) very soon.
      1. Todo
        
        Restart pods on the instance to release inode usage 2. If above step cannot help, need to open an incident for further analysis.
  6. Network
  7. network transit error rate(TBD) 2. network transit drop rate(TBD)
  8. Unavailable service (Send alert directly, TBD, because different service has different severity. Further drill down is required.)
  9. SMA
    1. critical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade
      1. Summary(Same for all the availability alerts):
        The service is not available now.
        
        Todo
        
        Run 'kubectl describe -n ' and 'kubectl logs -n ' to understand the reason of the failure 2. Try to fix based on the results from step 1.
        
        S2impact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user
        
        S3no obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer
        
        S4services out side of ESM / toolkit
      2. CMS
    2. critical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb 2. S2impact partial of business: itom-autopass-lms, itom-vault 3. S3no obvious impact on business: 4. S4services out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers
  10. Load
  11. S3Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)
    1. Summary:
      The instance is overloaded for more than 35 mins.
      1. Todo
        
        Keep monitoring 2. If it happens multiple times in a day, run the rebalancing pod script.
  12. Threads
  13. container_threads on process (TBD)
App metrics
1. Thread 2. Connections 3. Limits 4. Smart Analytics
  1. S3Content data ratio(total doc/committed doc) > 1.20
    1. Summary:
      All the query against the IDOL will take more time and get slowed down.
      1. Todo
        
        Run the jenkins job of IDOL compact. 2. Or follow the steps in the guide below
        https://docs.microfocus.com/doc/SMAX/2022.05/Searchslow
      2. S3 Documents per Content > 3M (ignore the archive content)
    2. Sumary:
      All the query against the IDOL can be impacted
      1. Todo
        
        Scale content groups: https://docs.microfocus.com/doc/SMAX/24.4/SmartAAdmin
  2. Rabbitmq (each node)
  3. S3queue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)
    1. Summary:
      The rabbitmq queues are in a higher than normal.
      1. Todo
        
        Keep monitoring 2. If it is getting higher continuously, consider performing the same steps mentioned here.
        https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution
      2. S3Pending Messages/Minute > 500 for more than 30 mins (Mark for review)
    2. Summary:
      The pending messages in rabbitmq are getting accumulated.
      1. Todo
        
        Keep monitoring 2. If it is getting higher continuously, consider performing the same steps mentioned here.
        https://docs.microfocus.com/doc/SMAX/2022.05/RabbitMQNotStart#Solution
      2. Message queue not equally distributed to different cluster nodes(TBD)
    3. Summary:
      Rabbitmq nodes are not working in a cluster. This can cause rabbitmq working not in a stable way.
      1. Todo
        
        Scale down the rabbitmq node which is not in the cluster. 2. Remove the <rabbitmq-infra-rabbitmq-n>/data/xservices/rabbitmq/x.x.x.xx/mnesia folders on the NFS server or the bastion node 3. Wait until the rabbitmq nodes to be ready
  4. IDM
  5. S4Active user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )
    1. Summary:
      The active user number is more than the target size.
      1. Todo
        
        Keep monitoring
  6. Gateway
  7. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
    1. Summary:
      The active user number is more than the target size.
      1. Todo
        
        If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
        How to generate thread dump and memory dumps for java applications
      2. S2Httpclient InUse > 20 for 30 mins
    2. Summary:
      The active user number is more than the target size.
      1. Todo
        
        If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
        How to generate thread dump and memory dumps for java applications
  8. Platform
  9. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
    1. Summary:
      The active user number is more than the target size.
      1. Todo
        
        If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
        How to generate thread dump and memory dumps for java applications
      2. S2Httpclient InUse > 20 for 30 mins
    2. Summary:
      The active user number is more than the target size.
      1. Todo
        
        If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
        How to generate thread dump and memory dumps for java applications
  10. Serviceportal
  11. S2Tomcat https connector currentThreadsBusy > 30 for 30 mins
    1. Summary:
      The active user number is more than the target size.
      1. Todo
        
        If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
        How to generate thread dump and memory dumps for java applications
      2. S2Httpclient InUse > 20 for 30 mins
    2. Summary:
      The active user number is more than the target size.
      1. Todo
        
        If the number do not drop, considering rolling restart current deployment, for example, gateway/platform/serviceportal 2. If the number cannot drop after above steps, do rollong restart xmpp. 3. If the number cannot drop after above steps, take thread dump for the pod with issue.
        How to generate thread dump and memory dumps for java applications
Instrumental

21 KiB Raw Blame History

Runbooks-based-on-monitoring_686083879

Alerts, Description and Summary

21 KiB

Raw Blame History