41 KiB
ESM-Cloud-Unified-Monitoring-v1.1_686083891
Legends
S2
S3
S4
NEW
Check here for the severity definitions.
Introduction
This guide presents all the items related to monitoring the ESM product on SaaS.
Levels of monitoring
Alerts
Alerts comes with monitoring and experience.
Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.
| Monitoring Level | Category | Severity | Code | Alert Description AWS | Alert Description GCP | Sample Chart | Alert Message | Runbook AWS | Runbook GCP |
|---|---|---|---|---|---|---|---|---|---|
| Infrastructure | Compute | ALB HTTP 5XX Count (More than 34 in a 3 mins time frame) | N/A | Link | [ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert | Runbook | |||
S2 | ALB Target 5xx Count | N/A | Link | ||||||
| Storage | S3 | EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins) | Disk queue length avg (disk queue length is more than 5 for more than 10 mins) | Link | [ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert | Runbook | |||
S2 | EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert | Runbook | ||||
| EBS Burst Balance Average (EBS burst balance is below 0) | N/A | Link | [ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert | Runbook | |||||
S2 | EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert | Runbook | ||||
| EFS Burst Credit Balance (Burst credit is 0) | N/A | Link | [ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert | Runbook | |||||
| Disk average latency (?) | |||||||||
| Filestore: Average read latency (?) | |||||||||
| Filestore: Average write latency (?) | |||||||||
| Filestore: Used space percent (?) | |||||||||
Virtualization | |||||||||
Database | S2 | RDS CPU Utilization (CPU more than 97% for more than 30 mins) | CPU utilization (CPU more than 97% for more than 30 mins) | Link | [ S2 - Error ] [ farm-name ] RDS CPU Utilization alert | Runbook | |||
S2 | CPU (sy: system >70% for more than 60 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert | Runbook | ||||
S2 | CPU (si: soft interrupts > 15% for more than 60 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert | Runbook | ||||
S3 | Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins) | IO wait (Total of IO_time,?) | Link | [ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert | Runbook | ||||
S2 | Disk (Free Storage Space is below 500 MB) | Disk (Free Storage Space= (1-Disk Utilization)* Disk allocation / Disk Utilization is below 500 MB) | Link | [ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert | Runbook | ||||
S2 | Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 ) | Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 ) | Runbook | ||||||
S2 | Memory (Free memory less than 5% for more than 5 mins) | Memory components(sum of all components) (Free memory less than 5% for more than 5 mins) | Link | [ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert | Runbook | ||||
| Memory (Free memory less than 2% for more than 5 mins) | Memory components(sum of all components) (Free memory less than 2% for more than 5 mins) | Link | [ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert | Runbook | |||||
S2 | Storage (Burst Balance below 40% for more than 30 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] RDS Burst Balance alert | Runbook | ||||
| RDS Burst Balance (Burst Balance is 0) | N/A | Link | [ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert | Runbook | |||||
S2 | RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour) | Database load (via query insight, execution_time, more than 2 times of CPU capacity) | [ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert [ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert | Runbook | |||||
| RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour) | Database load (via query insight, execution_time, more than 4 times of CPU capacity) | [ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert [ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert | Runbook | ||||||
S3 | RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour) | IO wait time+Lock wait time (via query insight,, more than 1 times of CPU capacity) | [ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert [ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert | Runbook | |||||
Wait events (Total of all events,?) | |||||||||
Query latency (Total of all the latencies,?) | |||||||||
| Link | Block Session Count | ||||||||
| Link | long active query duration | ||||||||
Capture RDS top 10 query (TBD)
| RDS top 10 query | ||||||||
dead tuple ems dead tuple rms dead tuple idm | |||||||||
| OS (Node level) | CPU | S2 | CPU more than 97% for more than 60 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Node CPU Usage alert | Runbook | ||
S2 | CPU (sy: system >70% for more than 60 mins )(mark for review) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Node CPU System alert | Runbook | ||||
S2 | CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert | Runbook | ||||
Memory | S3 | Memory more than 95% for more than 10 mins | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Node Mem Usage alert | Runbook | |||
Disk | S3 | Disk usage more than 95% | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Node Disk Usage alert | Runbook | |||
Disk read/write latency (TBD) | Same as AWS | Disk Read Latency Disk Write Latency | |||||||
S3 | Inode usage > 97% | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Disk Inode Usage alert | Runbook | ||||
Node disk IO load (TBD) | Same as AWS | Link | Disk IOPS | ||||||
Network | network operation latency(TBD) | Same as AWS | |||||||
network transit error rate(TBD) | Same as AWS | Link | Network Transit Error Rate | ||||||
network transit drop rate(TBD) | Same as AWS | Link | Network Transit Drop Rate | ||||||
network transit queue length(TBD) | Same as AWS | ||||||||
Throughput / bandwidth (TBD) | Same as AWS | ||||||||
S3 | Load (Load Avg 15m/core number > 200% for 35 mins ) | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core | Runbook | ||||
| Container | CPU | S2 | CPU (CPU more than 97% for more than 60 mins) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Pod CPU usage alert | Runbook | ||
Memory | swap usage | Same as AWS | Link | Pod Swap Usage | |||||
Disk | Disk read/write latency (TBD) | Same as AWS | Pod Disk Read Latency Pod Disk Write Latency | ||||||
S3 | Inode usage(free/total) > 97% | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Pod Inode Usage alert | Runbook | ||||
Network | network transit error rate(TBD) | Same as AWS | Link | Pod Network Transit Error Rate | |||||
network transit drop rate(TBD) | Same as AWS | Link | Pod Network Transit Drop Rate | ||||||
Unavailable service | SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade | Same as AWS | Link | [ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
S2 | SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user | Same as AWS | Link | [ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
S3 | SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
S4 | SMAXservices out side of ESM / toolkit | Same as AWS | Link | [ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down) | Same as AWS | Link | [ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | |||||
S2 | CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | ||||
S3 | CMS no obvious impact on business: | Same as AWS | |||||||
S4 | CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers | Same as AWS | Link | [ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | ||||
Load | S3 | Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics) | Same as AWS | Link | Pod Load Avg 10s | Runbook | |||
Threads | container_threads on process (TBD) | Same as AWS | Link | Threads | |||||
Pod balancing (TBD) | |||||||||
| App metrics | Thread | ||||||||
Connections | |||||||||
Limits | |||||||||
Smart Analytics | S3 | SMAXContent data ratio(total doc/committed doc) > 1.20 | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert | Runbook | |||
Rabbitmq (each node) | S3 | SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile) | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert | Runbook | |||
S3 | SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review) | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert | Runbook | ||||
SMAXMessage queue not equally distributed to different cluster nodes(TBD) | Same as AWS | Runbook | |||||||
IDM | S4 | SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins ) | Same as AWS | Link | [ S4 - Info ] [ farm-name ] IDM active users alert | Runbook | |||
Gateway | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert | Runbook | |||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert | Runbook | ||||
Platform | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert | Runbook | |||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert | Runbook | ||||
Serviceportal | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert | Runbook | |||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert | Runbook | ||||
OpenSearch based Monitoring (TBD) | Access 5xx | ||||||||
Access Response time | |||||||||
Database level customer metrics | Same as AWS | Link | NativeSACM Transaction Context Queue | ||||||
Same as AWS | Link | NativeSACM Transaction Context Queue retries | |||||||
Same as AWS | |||||||||
Same as AWS | Link | ||||||||
TextDetection Job queue | Same as AWS | Link | |||||||
IndexEntities Job queue | Same as AWS | Link | |||||||
EntitiesHandler Job queue | Same as AWS | Link | |||||||
SLT Job Delay time[mins] | Same as AWS | Link | |||||||
TextDetection Job Delay time[mins] | Same as AWS | Link | |||||||
IndexEntities Job Delay time[mins] | Same as AWS | Link | |||||||
EntitiesHandler Job Delay time[mins] | Same as AWS | Link | |||||||
| Instrumental | Method | ||||||||
Query | |||||||||
| Others | When to scale out (overloaded) | ||||||||