Files
nexus/knowledgebase/csd-wiki/ICSD/ESM-Cloud-Unified-Monitoring_686074338.md
2026-04-18 17:09:43 +08:00

37 KiB

ESM-Cloud-Unified-Monitoring_686074338

Legends

S2

S3

S4

NEW

Check here for the severity definitions.

Introduction

This guide presents all the items related to monitoring the ESM product on SaaS.

Levels of monitoring

Alerts

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.

Monitoring LevelCategorySeverityCodeAlert DescriptionSample ChartAlert MessageRunbook
Infrastructure

Compute

ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)Link[ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alertRunbook

S2

ALB Target 5xx CountLink
Storage

S3

EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins)Link[ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alertRunbook

S2

EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins )Link[ S2 - Error ] [ farm-name ] EBS Burst Balance Average alertRunbook
EBS Burst Balance Average (EBS burst balance is below 0)Link[ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alertRunbook

S2

EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins )Link[ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alertRunbook
EFS Burst Credit Balance (Burst credit is 0)Link[ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alertRunbook

Virtualization

Database

S2

RDS CPU Utilization (CPU more than 97% for more than 30 mins)Link[ S2 - Error ] [ farm-name ] RDS CPU Utilization alertRunbook

S2

CPU (sy: system >70% for more than 60 mins )Link[ S2 - Error ] [ farm-name ] RDS cpuUtilization System alertRunbook

S2

CPU (si: soft interrupts > 15% for more than 60 mins )Link[ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alertRunbook

S3

Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)Link[ S3 - Warning ] [ farm-name ] RDS Disk queue depth alertRunbook

S2

Disk (Free Storage Space is below 500 MB)Link[ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alertRunbook

S2

Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )Runbook

S2

Memory (Free memory less than 5% for more than 5 mins)Link[ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alertRunbook
Memory (Free memory less than 2% for more than 5 mins)Link[ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alertRunbook

S2

Storage (Burst Balance below 40% for more than 30 mins )Link[ S2 - Error ] [ farm-name ] RDS Burst Balance alertRunbook
RDS Burst Balance (Burst Balance is 0)Link[ S0 - Urgent ] [ farm-name ] RDS Burst Balance alertRunbook

S2

RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)

Link

Link

[ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert

[ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert

Runbook
RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)

Link

Link

[ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert

[ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert

Runbook

S3

RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)

Link

Link

[ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert

[ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert

Runbook

Locks (TBD)

LinkBlock Session Count

Long active queries (TBD)

Linklong active query duration

Capture RDS top 10 query (TBD)

    1. Clean stat_statement daily
    2. capture during runtime if CPU is more than 97% for 60 mins

Link

RDS top 10 query

Dead tuple (TBD)

Link

Link

Link

dead tuple ems

dead tuple rms

dead tuple idm

OS (Node level)

CPU

S2

CPU more than 97% for more than 60 minsLink[ S2 - Error ] [ farm-name ] Node CPU Usage alertRunbook

S2

CPU (sy: system >70% for more than 60 mins )(mark for review)Link[ S2 - Error ] [ farm-name ] Node CPU System alertRunbook

S2

CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)Link[ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alertRunbook

Memory

S3

Memory more than 95% for more than 10 minsLink[ S3 - Warning ] [ farm-name ] Node Mem Usage alertRunbook

Disk

S3

Disk usage more than 95%Link[ S3 - Warning ] [ farm-name ] Node Disk Usage alertRunbook

Disk read/write latency (TBD)

Link

Link

Disk Read Latency

Disk Write Latency

S3

Inode usage > 97%

Link[ S3 - Warning ] [ farm-name ] Disk Inode Usage alertRunbook

Node disk IO load (TBD)

LinkDisk IOPS

Network

network operation latency(TBD)

network transit error rate(TBD)

LinkNetwork Transit Error Rate

network transit drop rate(TBD)

LinkNetwork Transit Drop Rate

network transit queue length(TBD)

Throughput / bandwidth (TBD)

S3

Load (Load Avg 15m/core number > 200% for 35 mins )Link[ S3 - Warning ] [ farm-name ] Node Load Avg 15m/coreRunbook
Container

CPU

S2

CPU (CPU more than 97% for more than 60 mins)Link[ S2 - Error ] [ farm-name ] Pod CPU usage alertRunbook

Memory

swap usage

LinkPod Swap Usage

Disk

Disk read/write latency (TBD)

Link

Link

Pod Disk Read Latency

Pod Disk Write Latency

S3

Inode usage(free/total) > 97%

Link[ S3 - Warning ] [ farm-name ] Pod Inode Usage alertRunbook

Network

network transit error rate(TBD)

LinkPod Network Transit Error Rate

network transit drop rate(TBD)

LinkPod Network Transit Drop Rate

Unavailable service

SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade

Link[ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

S2

SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user

Link[ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

S3

SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer

Link[ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

S4

SMAXservices out side of ESM / toolkit

Link[ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down)

Link[ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alertRunbook

S2

CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down)

Link[ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alertRunbook

S3

CMS no obvious impact on business:

S4

CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers

Link[ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alertRunbook

Load

S3

Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)

LinkPod Load Avg 10sRunbook

Threads

container_threads on process (TBD)

LinkThreads

Pod balancing (TBD)

App metrics

Thread

Connections

Limits

Smart Analytics

S3

SMAXContent data ratio(total doc/committed doc) > 1.20

Link[ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alertRunbook

Rabbitmq (each node)

S3

SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)

Link[ S3 - Warning ] [ farm-name ] Rabbitmq Queue alertRunbook

S3

SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review)

Link[ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alertRunbook

SMAXMessage queue not equally distributed to different cluster nodes(TBD)

Runbook

IDM

S4

SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )

Link[ S4 - Info ] [ farm-name ] IDM active users alertRunbook

Gateway

S2

SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins

(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins

Link[ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alertRunbook

S2

SMAXHttpclient InUse > 20 for 30 mins

(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins

Link[ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alertRunbook

Platform

S2

SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins

(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins

Link[ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alertRunbook

S2

SMAXHttpclient InUse > 20 for 30 mins

(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins

Link[ S2 - Error ] [ farm-name ] Platform Httpclient InUse alertRunbook

Serviceportal

S2

SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins

(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins

Link[ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alertRunbook

S2

SMAXHttpclient InUse > 20 for 30 mins

(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins

Link[ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alertRunbook

OpenSearch based Monitoring (TBD)

Access 5xx

Access Response time

Database level customer metrics

NativeSACM Transaction Context Queue

LinkNativeSACM Transaction Context Queue

NativeSACM Transaction Context Queue retries

LinkNativeSACM Transaction Context Queue retries

NativeSACM Transaction Context Queue stuck?

SLT Job queue

Link

TextDetection Job queue

Link

IndexEntities Job queue

Link

EntitiesHandler Job queue

Link

SLT Job Delay time[mins]

Link

TextDetection Job Delay time[mins]

Link

IndexEntities Job Delay time[mins]

Link

EntitiesHandler Job Delay time[mins]

Link
Instrumental

Method

Query

Others

When to scale out (overloaded)