ESM-Cloud-Unified-Monitoring-v1.1_686083891

Legends

NEW

Check here for the severity definitions.

Introduction

This guide presents all the items related to monitoring the ESM product on SaaS.

Levels of monitoring

Alerts

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.

Monitoring Level	Category	Severity	Alert Description AWS	Alert Description GCP	Sample Chart	Alert Message	Runbook AWS
Infrastructure	Compute		ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)	N/A	Link	[ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert	Runbook
	Compute	S2	ALB Target 5xx Count	N/A	Link
	Storage	S3	EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins)	Disk queue length avg (disk queue length is more than 5 for more than 10 mins)	Link	[ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert	Runbook
		S2	EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins )	N/A	Link	[ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert	Runbook
			EBS Burst Balance Average (EBS burst balance is below 0)	N/A	Link	[ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert	Runbook
		S2	EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins )	N/A	Link	[ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert	Runbook
			EFS Burst Credit Balance (Burst credit is 0)	N/A	Link	[ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert	Runbook
				Disk average latency (?)
				Filestore: Average read latency (?)
				Filestore: Average write latency (?)
				Filestore: Used space percent (?)
	Virtualization
	Database	S2	RDS CPU Utilization (CPU more than 97% for more than 30 mins)	CPU utilization (CPU more than 97% for more than 30 mins)	Link	[ S2 - Error ] [ farm-name ] RDS CPU Utilization alert	Runbook
		S2	CPU (sy: system >70% for more than 60 mins )	N/A	Link	[ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert	Runbook
		S2	CPU (si: soft interrupts > 15% for more than 60 mins )	N/A	Link	[ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert	Runbook
		S3	Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)	IO wait (Total of IO_time,?)	Link	[ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert	Runbook
		S2	Disk (Free Storage Space is below 500 MB)	Disk (Free Storage Space= (1-Disk Utilization)* Disk allocation / Disk Utilization is below 500 MB)	Link	[ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert	Runbook
		S2	Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )	Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )			Runbook
		S2	Memory (Free memory less than 5% for more than 5 mins)	Memory components(sum of all components) (Free memory less than 5% for more than 5 mins)	Link	[ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert	Runbook
			Memory (Free memory less than 2% for more than 5 mins)	Memory components(sum of all components) (Free memory less than 2% for more than 5 mins)	Link	[ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert	Runbook
		S2	Storage (Burst Balance below 40% for more than 30 mins )	N/A	Link	[ S2 - Error ] [ farm-name ] RDS Burst Balance alert	Runbook
			RDS Burst Balance (Burst Balance is 0)	N/A	Link	[ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert	Runbook
		S2	RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)	Database load (via query insight, execution_time, more than 2 times of CPU capacity)	Link Link	[ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert [ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert	Runbook
			RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)	Database load (via query insight, execution_time, more than 4 times of CPU capacity)	Link Link	[ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert [ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert	Runbook
		S3	RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)	IO wait time+Lock wait time (via query insight,, more than 1 times of CPU capacity)	Link Link	[ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert [ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert	Runbook
				Wait events (Total of all events,?)
				Query latency (Total of all the latencies,?)
			Locks (TBD)		Link	Block Session Count
			Long active queries (TBD)		Link	long active query duration
			Capture RDS top 10 query (TBD) Clean stat_statement daily capture during runtime if CPU is more than 97% for 60 mins		Link	RDS top 10 query
			Dead tuple (TBD)		Link Link Link	dead tuple ems dead tuple rms dead tuple idm
OS (Node level)	CPU	S2	CPU more than 97% for more than 60 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Node CPU Usage alert	Runbook
		S2	CPU (sy: system >70% for more than 60 mins )(mark for review)	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Node CPU System alert	Runbook
		S2	CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert	Runbook
	Memory	S3	Memory more than 95% for more than 10 mins	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Node Mem Usage alert	Runbook
	Disk	S3	Disk usage more than 95%	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Node Disk Usage alert	Runbook
			Disk read/write latency (TBD)	Same as AWS	Link Link	Disk Read Latency Disk Write Latency
		S3	Inode usage > 97%	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Disk Inode Usage alert	Runbook
			Node disk IO load (TBD)	Same as AWS	Link	Disk IOPS
	Network		network operation latency(TBD)	Same as AWS
			network transit error rate(TBD)	Same as AWS	Link	Network Transit Error Rate
			network transit drop rate(TBD)	Same as AWS	Link	Network Transit Drop Rate
			network transit queue length(TBD)	Same as AWS
			Throughput / bandwidth (TBD)	Same as AWS
		S3	Load (Load Avg 15m/core number > 200% for 35 mins )	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core	Runbook
Container	CPU	S2	CPU (CPU more than 97% for more than 60 mins)	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Pod CPU usage alert	Runbook
	Memory		swap usage	Same as AWS	Link	Pod Swap Usage
	Disk		Disk read/write latency (TBD)	Same as AWS	Link Link	Pod Disk Read Latency Pod Disk Write Latency
	Disk	S3	Inode usage(free/total) > 97%	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Pod Inode Usage alert	Runbook
	Network		network transit error rate(TBD)	Same as AWS	Link	Pod Network Transit Error Rate
	Network		network transit drop rate(TBD)	Same as AWS	Link	Pod Network Transit Drop Rate
	Unavailable service		SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade	Same as AWS	Link	[ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
		S2	SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user	Same as AWS	Link	[ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
		S3	SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
		S4	SMAXservices out side of ESM / toolkit	Same as AWS	Link	[ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
			CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down)	Same as AWS	Link	[ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert	Runbook
		S2	CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down)	Same as AWS	Link	[ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert	Runbook
		S3	CMS no obvious impact on business:	Same as AWS
		S4	CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers	Same as AWS	Link	[ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert	Runbook
	Load	S3	Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)	Same as AWS	Link	Pod Load Avg 10s	Runbook
	Threads		container_threads on process (TBD)	Same as AWS	Link	Threads
	Pod balancing (TBD)
App metrics	Thread
	Connections
	Limits
	Smart Analytics	S3	SMAXContent data ratio(total doc/committed doc) > 1.20	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert	Runbook
	Rabbitmq (each node)	S3	SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert	Runbook
		S3	SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review)	Same as AWS	Link	[ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert	Runbook
			SMAXMessage queue not equally distributed to different cluster nodes(TBD)	Same as AWS			Runbook
	IDM	S4	SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )	Same as AWS	Link	[ S4 - Info ] [ farm-name ] IDM active users alert	Runbook
	Gateway	S2	SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert	Runbook
		S2	SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert	Runbook
	Platform	S2	SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert	Runbook
		S2	SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert	Runbook
	Serviceportal	S2	SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert	Runbook
		S2	SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins	Same as AWS	Link	[ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert	Runbook
	OpenSearch based Monitoring (TBD)		Access 5xx
			Access Response time
	Database level customer metrics		NativeSACM Transaction Context Queue	Same as AWS	Link	NativeSACM Transaction Context Queue
			NativeSACM Transaction Context Queue retries	Same as AWS	Link	NativeSACM Transaction Context Queue retries
			NativeSACM Transaction Context Queue stuck?	Same as AWS
			SLT Job queue	Same as AWS	Link
			TextDetection Job queue	Same as AWS	Link
			IndexEntities Job queue	Same as AWS	Link
			EntitiesHandler Job queue	Same as AWS	Link
			SLT Job Delay time[mins]	Same as AWS	Link
			TextDetection Job Delay time[mins]	Same as AWS	Link
			IndexEntities Job Delay time[mins]	Same as AWS	Link
			EntitiesHandler Job Delay time[mins]	Same as AWS	Link
Instrumental	Method
	Query
Others			When to scale out (overloaded)

41 KiB Raw Blame History

ESM-Cloud-Unified-Monitoring-v1.1_686083891

Legends

Introduction

Levels of monitoring

Alerts

41 KiB

Raw Blame History