ESM-Cloud-Unified-Monitoring_686074338

Legends

NEW

Check here for the severity definitions.

Introduction

This guide presents all the items related to monitoring the ESM product on SaaS.

Levels of monitoring

Alerts

Alerts comes with monitoring and experience.

Here is a reference list of items to be sent as alerts. A grafana monitoring dashboards are developed based on below list.

Monitoring Level	Category	Severity	Alert Description	Sample Chart	Alert Message	Runbook
Infrastructure	Compute		ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)	Link	[ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert	Runbook
	Compute	S2	ALB Target 5xx Count	Link
	Storage	S3	EBS Disk Queue Depth (EBS disk queue depth more than 5 for more than 10 mins)	Link	[ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert	Runbook
		S2	EBS Burst Balance Average (EBS burst balance below 40% for more than 30 mins )	Link	[ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert	Runbook
			EBS Burst Balance Average (EBS burst balance is below 0)	Link	[ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert	Runbook
		S2	EFS Burst Credit Balance (Burst credit below 40% for more than 15 mins )	Link	[ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert	Runbook
			EFS Burst Credit Balance (Burst credit is 0)	Link	[ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert	Runbook
	Virtualization
	Database	S2	RDS CPU Utilization (CPU more than 97% for more than 30 mins)	Link	[ S2 - Error ] [ farm-name ] RDS CPU Utilization alert	Runbook
		S2	CPU (sy: system >70% for more than 60 mins )	Link	[ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert	Runbook
		S2	CPU (si: soft interrupts > 15% for more than 60 mins )	Link	[ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert	Runbook
		S3	Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)	Link	[ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert	Runbook
		S2	Disk (Free Storage Space is below 500 MB)	Link	[ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert	Runbook
		S2	Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )			Runbook
		S2	Memory (Free memory less than 5% for more than 5 mins)	Link	[ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert	Runbook
			Memory (Free memory less than 2% for more than 5 mins)	Link	[ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert	Runbook
		S2	Storage (Burst Balance below 40% for more than 30 mins )	Link	[ S2 - Error ] [ farm-name ] RDS Burst Balance alert	Runbook
			RDS Burst Balance (Burst Balance is 0)	Link	[ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert	Runbook
		S2	RDS DBLoad (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)	Link Link	[ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert [ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert	Runbook
			RDS DBLoad (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)	Link Link	[ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert [ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert	Runbook
		S3	RDS DBLoadNonCPU (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)	Link Link	[ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert [ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert	Runbook
			Locks (TBD)	Link	Block Session Count
			Long active queries (TBD)	Link	long active query duration
			Capture RDS top 10 query (TBD) Clean stat_statement daily capture during runtime if CPU is more than 97% for 60 mins	Link	RDS top 10 query
			Dead tuple (TBD)	Link Link Link	dead tuple ems dead tuple rms dead tuple idm
OS (Node level)	CPU	S2	CPU more than 97% for more than 60 mins	Link	[ S2 - Error ] [ farm-name ] Node CPU Usage alert	Runbook
		S2	CPU (sy: system >70% for more than 60 mins )(mark for review)	Link	[ S2 - Error ] [ farm-name ] Node CPU System alert	Runbook
		S2	CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)	Link	[ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert	Runbook
	Memory	S3	Memory more than 95% for more than 10 mins	Link	[ S3 - Warning ] [ farm-name ] Node Mem Usage alert	Runbook
	Disk	S3	Disk usage more than 95%	Link	[ S3 - Warning ] [ farm-name ] Node Disk Usage alert	Runbook
			Disk read/write latency (TBD)	Link Link	Disk Read Latency Disk Write Latency
		S3	Inode usage > 97%	Link	[ S3 - Warning ] [ farm-name ] Disk Inode Usage alert	Runbook
			Node disk IO load (TBD)	Link	Disk IOPS
	Network		network operation latency(TBD)
			network transit error rate(TBD)	Link	Network Transit Error Rate
			network transit drop rate(TBD)	Link	Network Transit Drop Rate
			network transit queue length(TBD)
			Throughput / bandwidth (TBD)
		S3	Load (Load Avg 15m/core number > 200% for 35 mins )	Link	[ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core	Runbook
Container	CPU	S2	CPU (CPU more than 97% for more than 60 mins)	Link	[ S2 - Error ] [ farm-name ] Pod CPU usage alert	Runbook
	Memory		swap usage	Link	Pod Swap Usage
	Disk		Disk read/write latency (TBD)	Link Link	Pod Disk Read Latency Pod Disk Write Latency
	Disk	S3	Inode usage(free/total) > 97%	Link	[ S3 - Warning ] [ farm-name ] Pod Inode Usage alert	Runbook
	Network		network transit error rate(TBD)	Link	Pod Network Transit Error Rate
	Network		network transit drop rate(TBD)	Link	Pod Network Transit Drop Rate
	Unavailable service		SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade	Link	[ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
		S2	SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user	Link	[ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
		S3	SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer	Link	[ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
		S4	SMAXservices out side of ESM / toolkit	Link	[ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert	Runbook
			CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, itom-ucmdb (both are down)	Link	[ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert	Runbook
		S2	CMSimpact partial of business: itom-autopass-lms, itom-vault, itom-ucmdb (either is down)	Link	[ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert	Runbook
		S3	CMS no obvious impact on business:
		S4	CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers	Link	[ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert	Runbook
	Load	S3	Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)	Link	Pod Load Avg 10s	Runbook
	Threads		container_threads on process (TBD)	Link	Threads
	Pod balancing (TBD)
App metrics	Thread
	Connections
	Limits
	Smart Analytics	S3	SMAXContent data ratio(total doc/committed doc) > 1.20	Link	[ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert	Runbook
	Rabbitmq (each node)	S3	SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)	Link	[ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert	Runbook
		S3	SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review)	Link	[ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert	Runbook
			SMAXMessage queue not equally distributed to different cluster nodes(TBD)			Runbook
	IDM	S4	SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )	Link	[ S4 - Info ] [ farm-name ] IDM active users alert	Runbook
	Gateway	S2	SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins	Link	[ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert	Runbook
		S2	SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins	Link	[ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert	Runbook
	Platform	S2	SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins	Link	[ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert	Runbook
		S2	SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins	Link	[ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert	Runbook
	Serviceportal	S2	SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins (EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins	Link	[ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert	Runbook
		S2	SMAXHttpclient InUse > 20 for 30 mins (EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins	Link	[ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert	Runbook
	OpenSearch based Monitoring (TBD)		Access 5xx
			Access Response time
	Database level customer metrics		NativeSACM Transaction Context Queue	Link	NativeSACM Transaction Context Queue
			NativeSACM Transaction Context Queue retries	Link	NativeSACM Transaction Context Queue retries
			NativeSACM Transaction Context Queue stuck?
			SLT Job queue	Link
			TextDetection Job queue	Link
			IndexEntities Job queue	Link
			EntitiesHandler Job queue	Link
			SLT Job Delay time[mins]	Link
			TextDetection Job Delay time[mins]	Link
			IndexEntities Job Delay time[mins]	Link
			EntitiesHandler Job Delay time[mins]	Link
Instrumental	Method
	Query
Others			When to scale out (overloaded)

37 KiB Raw Blame History

ESM-Cloud-Unified-Monitoring_686074338

Legends

Introduction

Levels of monitoring

Alerts

37 KiB

Raw Blame History