# ESM-Cloud-Unified-Monitoring-v1.1_686083891 ## Legends S2 S3 S4 NEW Check here for the [severity definitions](https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Alert+Serverity+Definition). ## Introduction This guide presents all the items related to monitoring the ESM product on SaaS. ## Levels of monitoring ## Alerts Alerts comes with monitoring and experience. Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.
| Monitoring Level | Category | Severity | Code | Alert Description AWS | Alert Description GCP | Sample Chart | Alert Message | Runbook AWS | Runbook GCP |
|---|---|---|---|---|---|---|---|---|---|
| Infrastructure | Compute | ALB HTTP 5XX Count (More than 34 in a 3 mins time frame) | N/A | Link | [ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert | Runbook | |||
S2 | ALB Target 5xx Count | N/A | Link | ||||||
| Storage | S3 | EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins) | Disk queue length avg (disk queue length is more than 5 for more than 10 mins) | Link | [ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert | Runbook | |||
S2 | EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert | Runbook | ||||
| EBS Burst Balance Average (EBS burst balance is below 0) | N/A | Link | [ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert | Runbook | |||||
S2 | EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert | Runbook | ||||
| EFS Burst Credit Balance (Burst credit is 0) | N/A | Link | [ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert | Runbook | |||||
| Disk average latency (?) | |||||||||
| Filestore: Average read latency (?) | |||||||||
| Filestore: Average write latency (?) | |||||||||
| Filestore: Used space percent (?) | |||||||||
Virtualization | |||||||||
Database | S2 | RDS CPU Utilization (CPU more than 97% for more than 30 mins) | CPU utilization (CPU more than 97% for more than 30 mins) | Link | [ S2 - Error ] [ farm-name ] RDS CPU Utilization alert | Runbook | |||
S2 | CPU (sy: system >70% for more than 60 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert | Runbook | ||||
S2 | CPU (si: soft interrupts > 15% for more than 60 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert | Runbook | ||||
S3 | Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins) | IO wait (Total of IO_time,?) | Link | [ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert | Runbook | ||||
S2 | Disk (Free Storage Space is below 500 MB) | Disk (Free Storage Space= (1-Disk Utilization)* Disk allocation / Disk Utilization is below 500 MB) | Link | [ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert | Runbook | ||||
S2 | Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 ) | Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 ) | Runbook | ||||||
S2 | Memory (Free memory less than 5% for more than 5 mins) | Memory components(sum of all components) (Free memory less than 5% for more than 5 mins) | Link | [ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert | Runbook | ||||
| Memory (Free memory less than 2% for more than 5 mins) | Memory components(sum of all components) (Free memory less than 2% for more than 5 mins) | Link | [ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert | Runbook | |||||
S2 | Storage (Burst Balance below 40% for more than 30 mins ) | N/A | Link | [ S2 - Error ] [ farm-name ] RDS Burst Balance alert | Runbook | ||||
| RDS Burst Balance (Burst Balance is 0) | N/A | Link | [ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert | Runbook | |||||
S2 | RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour) | Database load (via query insight, execution_time, more than 2 times of CPU capacity) | [ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert [ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert | Runbook | |||||
| RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour) | Database load (via query insight, execution_time, more than 4 times of CPU capacity) | [ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert [ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert | Runbook | ||||||
S3 | RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour) | IO wait time+Lock wait time (via query insight,, more than 1 times of CPU capacity) | [ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert [ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert | Runbook | |||||
Wait events (Total of all events,?) | |||||||||
Query latency (Total of all the latencies,?) | |||||||||
| Link | Block Session Count | ||||||||
| Link | long active query duration | ||||||||
Capture RDS top 10 query (TBD)
| RDS top 10 query | ||||||||
dead tuple ems dead tuple rms dead tuple idm | |||||||||
| OS (Node level) | CPU | S2 | CPU more than 97% for more than 60 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Node CPU Usage alert | Runbook | ||
S2 | CPU (sy: system >70% for more than 60 mins )(mark for review) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Node CPU System alert | Runbook | ||||
S2 | CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert | Runbook | ||||
Memory | S3 | Memory more than 95% for more than 10 mins | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Node Mem Usage alert | Runbook | |||
Disk | S3 | Disk usage more than 95% | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Node Disk Usage alert | Runbook | |||
Disk read/write latency (TBD) | Same as AWS | Disk Read Latency Disk Write Latency | |||||||
S3 | Inode usage > 97% | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Disk Inode Usage alert | Runbook | ||||
Node disk IO load (TBD) | Same as AWS | Link | Disk IOPS | ||||||
Network | network operation latency(TBD) | Same as AWS | |||||||
network transit error rate(TBD) | Same as AWS | Link | Network Transit Error Rate | ||||||
network transit drop rate(TBD) | Same as AWS | Link | Network Transit Drop Rate | ||||||
network transit queue length(TBD) | Same as AWS | ||||||||
Throughput / bandwidth (TBD) | Same as AWS | ||||||||
S3 | Load (Load Avg 15m/core number > 200% for 35 mins ) | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core | Runbook | ||||
| Container | CPU | S2 | CPU (CPU more than 97% for more than 60 mins) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Pod CPU usage alert | Runbook | ||
Memory | swap usage | Same as AWS | Link | Pod Swap Usage | |||||
Disk | Disk read/write latency (TBD) | Same as AWS | Pod Disk Read Latency Pod Disk Write Latency | ||||||
S3 | Inode usage(free/total) > 97% | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Pod Inode Usage alert | Runbook | ||||
Network | network transit error rate(TBD) | Same as AWS | Link | Pod Network Transit Error Rate | |||||
network transit drop rate(TBD) | Same as AWS | Link | Pod Network Transit Drop Rate | ||||||
Unavailable service | SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade | Same as AWS | Link | [ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
S2 | SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user | Same as AWS | Link | [ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
S3 | SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
S4 | SMAXservices out side of ESM / toolkit | Same as AWS | Link | [ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | ||||
CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down) | Same as AWS | Link | [ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | |||||
S2 | CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down) | Same as AWS | Link | [ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | ||||
S3 | CMS no obvious impact on business: | Same as AWS | |||||||
S4 | CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers | Same as AWS | Link | [ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | ||||
Load | S3 | Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics) | Same as AWS | Link | Pod Load Avg 10s | Runbook | |||
Threads | container_threads on process (TBD) | Same as AWS | Link | Threads | |||||
Pod balancing (TBD) | |||||||||
| App metrics | Thread | ||||||||
Connections | |||||||||
Limits | |||||||||
Smart Analytics | S3 | SMAXContent data ratio(total doc/committed doc) > 1.20 | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert | Runbook | |||
Rabbitmq (each node) | S3 | SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile) | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert | Runbook | |||
S3 | SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review) | Same as AWS | Link | [ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert | Runbook | ||||
SMAXMessage queue not equally distributed to different cluster nodes(TBD) | Same as AWS | Runbook | |||||||
IDM | S4 | SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins ) | Same as AWS | Link | [ S4 - Info ] [ farm-name ] IDM active users alert | Runbook | |||
Gateway | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert | Runbook | |||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert | Runbook | ||||
Platform | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert | Runbook | |||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert | Runbook | ||||
Serviceportal | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert | Runbook | |||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Same as AWS | Link | [ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert | Runbook | ||||
OpenSearch based Monitoring (TBD) | Access 5xx | ||||||||
Access Response time | |||||||||
Database level customer metrics | Same as AWS | Link | NativeSACM Transaction Context Queue | ||||||
Same as AWS | Link | NativeSACM Transaction Context Queue retries | |||||||
Same as AWS | |||||||||
Same as AWS | Link | ||||||||
TextDetection Job queue | Same as AWS | Link | |||||||
IndexEntities Job queue | Same as AWS | Link | |||||||
EntitiesHandler Job queue | Same as AWS | Link | |||||||
SLT Job Delay time[mins] | Same as AWS | Link | |||||||
TextDetection Job Delay time[mins] | Same as AWS | Link | |||||||
IndexEntities Job Delay time[mins] | Same as AWS | Link | |||||||
EntitiesHandler Job Delay time[mins] | Same as AWS | Link | |||||||
| Instrumental | Method | ||||||||
Query | |||||||||
| Others | When to scale out (overloaded) | ||||||||