# ESM-Cloud-Unified-Monitoring_686074338 ## Legends S2 S3 S4 NEW Check here for the [severity definitions](https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Alert+Serverity+Definition). ## Introduction This guide presents all the items related to monitoring the ESM product on SaaS. ## Levels of monitoring ## Alerts Alerts comes with monitoring and experience. Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.
| Monitoring Level | Category | Severity | Code | Alert Description | Sample Chart | Alert Message | Runbook |
|---|---|---|---|---|---|---|---|
| Infrastructure | Compute | ALB HTTP 5XX Count (More than 34 in a 3 mins time frame) | Link | [ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert | Runbook | ||
S2 | ALB Target 5xx Count | Link | |||||
| Storage | S3 | EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins) | Link | [ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert | Runbook | ||
S2 | EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins ) | Link | [ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert | Runbook | |||
| EBS Burst Balance Average (EBS burst balance is below 0) | Link | [ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert | Runbook | ||||
S2 | EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins ) | Link | [ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert | Runbook | |||
| EFS Burst Credit Balance (Burst credit is 0) | Link | [ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert | Runbook | ||||
Virtualization | |||||||
Database | S2 | RDS CPU Utilization (CPU more than 97% for more than 30 mins) | Link | [ S2 - Error ] [ farm-name ] RDS CPU Utilization alert | Runbook | ||
S2 | CPU (sy: system >70% for more than 60 mins ) | Link | [ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert | Runbook | |||
S2 | CPU (si: soft interrupts > 15% for more than 60 mins ) | Link | [ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert | Runbook | |||
S3 | Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins) | Link | [ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert | Runbook | |||
S2 | Disk (Free Storage Space is below 500 MB) | Link | [ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert | Runbook | |||
S2 | Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 ) | Runbook | |||||
S2 | Memory (Free memory less than 5% for more than 5 mins) | Link | [ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert | Runbook | |||
| Memory (Free memory less than 2% for more than 5 mins) | Link | [ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert | Runbook | ||||
S2 | Storage (Burst Balance below 40% for more than 30 mins ) | Link | [ S2 - Error ] [ farm-name ] RDS Burst Balance alert | Runbook | |||
| RDS Burst Balance (Burst Balance is 0) | Link | [ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert | Runbook | ||||
S2 | RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour) | [ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert [ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert | Runbook | ||||
| RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour) | [ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert [ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert | Runbook | |||||
S3 | RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour) | [ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert [ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert | Runbook | ||||
| Link | Block Session Count | ||||||
| Link | long active query duration | ||||||
Capture RDS top 10 query (TBD)
| RDS top 10 query | ||||||
dead tuple ems dead tuple rms dead tuple idm | |||||||
| OS (Node level) | CPU | S2 | CPU more than 97% for more than 60 mins | Link | [ S2 - Error ] [ farm-name ] Node CPU Usage alert | Runbook | |
S2 | CPU (sy: system >70% for more than 60 mins )(mark for review) | Link | [ S2 - Error ] [ farm-name ] Node CPU System alert | Runbook | |||
S2 | CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review) | Link | [ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert | Runbook | |||
Memory | S3 | Memory more than 95% for more than 10 mins | Link | [ S3 - Warning ] [ farm-name ] Node Mem Usage alert | Runbook | ||
Disk | S3 | Disk usage more than 95% | Link | [ S3 - Warning ] [ farm-name ] Node Disk Usage alert | Runbook | ||
Disk read/write latency (TBD) | Disk Read Latency Disk Write Latency | ||||||
S3 | Inode usage > 97% | Link | [ S3 - Warning ] [ farm-name ] Disk Inode Usage alert | Runbook | |||
Node disk IO load (TBD) | Link | Disk IOPS | |||||
Network | network operation latency(TBD) | ||||||
network transit error rate(TBD) | Link | Network Transit Error Rate | |||||
network transit drop rate(TBD) | Link | Network Transit Drop Rate | |||||
network transit queue length(TBD) | |||||||
Throughput / bandwidth (TBD) | |||||||
S3 | Load (Load Avg 15m/core number > 200% for 35 mins ) | Link | [ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core | Runbook | |||
| Container | CPU | S2 | CPU (CPU more than 97% for more than 60 mins) | Link | [ S2 - Error ] [ farm-name ] Pod CPU usage alert | Runbook | |
Memory | swap usage | Link | Pod Swap Usage | ||||
Disk | Disk read/write latency (TBD) | Pod Disk Read Latency Pod Disk Write Latency | |||||
S3 | Inode usage(free/total) > 97% | Link | [ S3 - Warning ] [ farm-name ] Pod Inode Usage alert | Runbook | |||
Network | network transit error rate(TBD) | Link | Pod Network Transit Error Rate | ||||
network transit drop rate(TBD) | Link | Pod Network Transit Drop Rate | |||||
Unavailable service | SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade | Link | [ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | |||
S2 | SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user | Link | [ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | |||
S3 | SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer | Link | [ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | |||
S4 | SMAXservices out side of ESM / toolkit | Link | [ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert | Runbook | |||
CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down) | Link | [ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | ||||
S2 | CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down) | Link | [ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | |||
S3 | CMS no obvious impact on business: | ||||||
S4 | CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers | Link | [ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert | Runbook | |||
Load | S3 | Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics) | Link | Pod Load Avg 10s | Runbook | ||
Threads | container_threads on process (TBD) | Link | Threads | ||||
Pod balancing (TBD) | |||||||
| App metrics | Thread | ||||||
Connections | |||||||
Limits | |||||||
Smart Analytics | S3 | SMAXContent data ratio(total doc/committed doc) > 1.20 | Link | [ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert | Runbook | ||
Rabbitmq (each node) | S3 | SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile) | Link | [ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert | Runbook | ||
S3 | SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review) | Link | [ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert | Runbook | |||
SMAXMessage queue not equally distributed to different cluster nodes(TBD) | Runbook | ||||||
IDM | S4 | SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins ) | Link | [ S4 - Info ] [ farm-name ] IDM active users alert | Runbook | ||
Gateway | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Link | [ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert | Runbook | ||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Link | [ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert | Runbook | |||
Platform | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Link | [ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert | Runbook | ||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Link | [ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert | Runbook | |||
Serviceportal | S2 | SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins | Link | [ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert | Runbook | ||
S2 | SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins | Link | [ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert | Runbook | |||
OpenSearch based Monitoring (TBD) | Access 5xx | ||||||
Access Response time | |||||||
Database level customer metrics | Link | NativeSACM Transaction Context Queue | |||||
| Link | NativeSACM Transaction Context Queue retries | ||||||
| Link | |||||||
TextDetection Job queue | Link | ||||||
IndexEntities Job queue | Link | ||||||
EntitiesHandler Job queue | Link | ||||||
SLT Job Delay time[mins] | Link | ||||||
TextDetection Job Delay time[mins] | Link | ||||||
IndexEntities Job Delay time[mins] | Link | ||||||
EntitiesHandler Job Delay time[mins] | Link | ||||||
| Instrumental | Method | ||||||
Query | |||||||
| Others | When to scale out (overloaded) | ||||||