Files
nexus/knowledgebase/csd-wiki/ICSD/ESM-Cloud-Unified-Monitoring-v1.1_686083891.md

41 KiB

ESM-Cloud-Unified-Monitoring-v1.1_686083891

Legends

S2

S3

S4

NEW

Check here for the severity definitions.

Introduction

This guide presents all the items related to monitoring the ESM product on SaaS.

Levels of monitoring

Alerts

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.

Monitoring LevelCategorySeverityCodeAlert Description AWSAlert Description GCPSample ChartAlert MessageRunbook AWSRunbook GCP
Infrastructure

Compute

ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)N/A
Link[ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alertRunbook

S2

ALB Target 5xx CountN/ALink
Storage

S3

EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins)Disk queue length avg (disk queue length is more than 5 for more than 10 mins)
Link[ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alertRunbook

S2

EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins )N/ALink[ S2 - Error ] [ farm-name ] EBS Burst Balance Average alertRunbook
EBS Burst Balance Average (EBS burst balance is below 0)N/ALink[ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alertRunbook

S2

EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins )N/ALink[ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alertRunbook
EFS Burst Credit Balance (Burst credit is 0)N/ALink[ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alertRunbook
Disk average latency (?)
Filestore: Average read latency (?)
Filestore: Average write latency (?)
Filestore: Used space percent (?)

Virtualization

Database

S2

RDS CPU Utilization (CPU more than 97% for more than 30 mins)CPU utilization (CPU more than 97% for more than 30 mins)
Link[ S2 - Error ] [ farm-name ] RDS CPU Utilization alertRunbook

S2

CPU (sy: system >70% for more than 60 mins )N/ALink[ S2 - Error ] [ farm-name ] RDS cpuUtilization System alertRunbook

S2

CPU (si: soft interrupts > 15% for more than 60 mins )N/ALink[ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alertRunbook

S3

Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)IO wait (Total of IO_time,?)Link[ S3 - Warning ] [ farm-name ] RDS Disk queue depth alertRunbook

S2

Disk (Free Storage Space is below 500 MB)Disk (Free Storage Space= (1-Disk Utilization)* Disk allocation / Disk Utilization is below 500 MB)Link[ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alertRunbook

S2

Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )Runbook

S2

Memory (Free memory less than 5% for more than 5 mins)Memory components(sum of all components) (Free memory less than 5% for more than 5 mins)Link[ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alertRunbook
Memory (Free memory less than 2% for more than 5 mins)Memory components(sum of all components) (Free memory less than 2% for more than 5 mins)Link[ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alertRunbook

S2

Storage (Burst Balance below 40% for more than 30 mins )N/ALink[ S2 - Error ] [ farm-name ] RDS Burst Balance alertRunbook
RDS Burst Balance (Burst Balance is 0)N/ALink[ S0 - Urgent ] [ farm-name ] RDS Burst Balance alertRunbook

S2

RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)Database load (via query insight, execution_time, more than 2 times of CPU capacity)

Link

Link

[ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert

[ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert

Runbook
RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)Database load (via query insight, execution_time, more than 4 times of CPU capacity)

Link

Link

[ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert

[ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert

Runbook

S3

RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)IO wait time+Lock wait time (via query insight,, more than 1 times of CPU capacity)

Link

Link

[ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert

[ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert

Runbook

Wait events (Total of all events,?)

Query latency (Total of all the latencies,?)

Locks (TBD)

LinkBlock Session Count

Long active queries (TBD)

Linklong active query duration

Capture RDS top 10 query (TBD)

    1. Clean stat_statement daily
    2. capture during runtime if CPU is more than 97% for 60 mins

Link

RDS top 10 query

Dead tuple (TBD)

Link

Link

Link

dead tuple ems

dead tuple rms

dead tuple idm

OS (Node level)

CPU

S2

CPU more than 97% for more than 60 minsSame as AWSLink[ S2 - Error ] [ farm-name ] Node CPU Usage alertRunbook

S2

CPU (sy: system >70% for more than 60 mins )(mark for review)Same as AWSLink[ S2 - Error ] [ farm-name ] Node CPU System alertRunbook

S2

CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)Same as AWSLink[ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alertRunbook

Memory

S3

Memory more than 95% for more than 10 minsSame as AWSLink[ S3 - Warning ] [ farm-name ] Node Mem Usage alertRunbook

Disk

S3

Disk usage more than 95%Same as AWSLink[ S3 - Warning ] [ farm-name ] Node Disk Usage alertRunbook

Disk read/write latency (TBD)

Same as AWS

Link

Link

Disk Read Latency

Disk Write Latency

S3

Inode usage > 97%

Same as AWS

Link[ S3 - Warning ] [ farm-name ] Disk Inode Usage alertRunbook

Node disk IO load (TBD)

Same as AWS

LinkDisk IOPS

Network

network operation latency(TBD)

Same as AWS

network transit error rate(TBD)

Same as AWS

LinkNetwork Transit Error Rate

network transit drop rate(TBD)

Same as AWS

LinkNetwork Transit Drop Rate

network transit queue length(TBD)

Same as AWS

Throughput / bandwidth (TBD)

Same as AWS

S3

Load (Load Avg 15m/core number > 200% for 35 mins )Same as AWSLink[ S3 - Warning ] [ farm-name ] Node Load Avg 15m/coreRunbook
Container

CPU

S2

CPU (CPU more than 97% for more than 60 mins)Same as AWSLink[ S2 - Error ] [ farm-name ] Pod CPU usage alertRunbook

Memory

swap usage

Same as AWS

LinkPod Swap Usage

Disk

Disk read/write latency (TBD)

Same as AWS

Link

Link

Pod Disk Read Latency

Pod Disk Write Latency

S3

Inode usage(free/total) > 97%

Same as AWS

Link[ S3 - Warning ] [ farm-name ] Pod Inode Usage alertRunbook

Network

network transit error rate(TBD)

Same as AWS

LinkPod Network Transit Error Rate

network transit drop rate(TBD)

Same as AWS

LinkPod Network Transit Drop Rate

Unavailable service

SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade

Same as AWS

Link[ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

S2

SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user

Same as AWS

Link[ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

S3

SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer

Same as AWS

Link[ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

S4

SMAXservices out side of ESM / toolkit

Same as AWS

Link[ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alertRunbook

CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down)

Same as AWS

Link[ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alertRunbook

S2

CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down)

Same as AWS

Link[ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alertRunbook

S3

CMS no obvious impact on business:

Same as AWS

S4

CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers

Same as AWS

Link[ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alertRunbook

Load

S3

Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)

Same as AWS

LinkPod Load Avg 10sRunbook

Threads

container_threads on process (TBD)

Same as AWS

LinkThreads

Pod balancing (TBD)

App metrics

Thread

Connections

Limits

Smart Analytics

S3

SMAXContent data ratio(total doc/committed doc) > 1.20

Same as AWS

Link[ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alertRunbook

Rabbitmq (each node)

S3

SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)

Same as AWS

Link[ S3 - Warning ] [ farm-name ] Rabbitmq Queue alertRunbook

S3

SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review)

Same as AWS

Link[ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alertRunbook

SMAXMessage queue not equally distributed to different cluster nodes(TBD)

Same as AWS

Runbook

IDM

S4

SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )

Same as AWS

Link[ S4 - Info ] [ farm-name ] IDM active users alertRunbook

Gateway

S2

SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins

(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins

Same as AWS

Link[ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alertRunbook

S2

SMAXHttpclient InUse > 20 for 30 mins

(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins

Same as AWS

Link[ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alertRunbook

Platform

S2

SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins

(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins

Same as AWS

Link[ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alertRunbook

S2

SMAXHttpclient InUse > 20 for 30 mins

(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins

Same as AWS

Link[ S2 - Error ] [ farm-name ] Platform Httpclient InUse alertRunbook

Serviceportal

S2

SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins

(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins

Same as AWS

Link[ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alertRunbook

S2

SMAXHttpclient InUse > 20 for 30 mins

(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins

Same as AWS

Link[ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alertRunbook

OpenSearch based Monitoring (TBD)

Access 5xx

Access Response time

Database level customer metrics

NativeSACM Transaction Context Queue

Same as AWS

LinkNativeSACM Transaction Context Queue

NativeSACM Transaction Context Queue retries

Same as AWS

LinkNativeSACM Transaction Context Queue retries

NativeSACM Transaction Context Queue stuck?

Same as AWS

SLT Job queue

Same as AWS

Link

TextDetection Job queue

Same as AWS

Link

IndexEntities Job queue

Same as AWS

Link

EntitiesHandler Job queue

Same as AWS

Link

SLT Job Delay time[mins]

Same as AWS

Link

TextDetection Job Delay time[mins]

Same as AWS

Link

IndexEntities Job Delay time[mins]

Same as AWS

Link

EntitiesHandler Job Delay time[mins]

Same as AWS

Link
Instrumental

Method

Query

Others

When to scale out (overloaded)