27 lines
37 KiB
Markdown
27 lines
37 KiB
Markdown
# ESM-Cloud-Unified-Monitoring_686074338
|
|
## Legends
|
|
|
|
S2
|
|
|
|
S3
|
|
|
|
S4
|
|
|
|
NEW
|
|
|
|
Check here for the [severity definitions](https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Alert+Serverity+Definition).
|
|
|
|
## Introduction
|
|
|
|
This guide presents all the items related to monitoring the ESM product on SaaS.
|
|
|
|
## Levels of monitoring
|
|
|
|
## Alerts
|
|
|
|
Alerts comes with monitoring and experience.
|
|
|
|
Here is a reference list of items to be sent as alerts. [A grafana monitoring dashboards](https://github.houston.softwaregrp.net/smax-saas-ops/ESM-Saas-Monitoring) are developed based on below list.
|
|
|
|
<table><colgroup><col> <col> <col> <col> <col> <col> <col> <col></colgroup><tbody><tr><th>Monitoring Level</th><th>Category</th><th>Severity</th><th>Code</th><th>Alert Description</th><th>Sample Chart</th><th>Alert Message</th><th>Runbook</th></tr><tr><td rowspan="25">Infrastructure</td><td rowspan="2"><p>Compute</p></td><td></td><td></td><td><strong>ALB HTTP 5XX Count</strong> (More than 34 in a 3 mins time frame)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=2">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] ALB HTTP 5XX Count alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-ALBHTTP5XXCountalert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td><strong>ALB Target 5xx Count</strong></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=29">Link</a></td><td></td><td></td></tr><tr><td rowspan="5">Storage</td><td><p>S3</p></td><td></td><td><strong>EBS Disk Queue Depth</strong> (EBS disk queue depth more than 5 for more than 10 mins)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?tab=alert&orgId=1&viewPanel=4">Link</a></td><td>[ S3 - Warning ] [ farm-name ] EBS Disk Queue Depth alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-EBSDiskQueueDepthalert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td><strong>EBS Burst Balance Average</strong> (EBS burst balance below 40% for more than 30 mins )</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=6">Link</a></td><td>[ S2 - Error ] [ farm-name ] EBS Burst Balance Average alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-EBSBurstBalanceAveragealert">Runbook</a></td></tr><tr><td></td><td></td><td><strong>EBS Burst Balance Average</strong> (EBS burst balance is below 0)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=17">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] EBS Burst Balance Average alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S0EBSBurstBalanceAveragealert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td><strong>EFS Burst Credit Balance</strong> (Burst credit below 40% for more than 15 mins )</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=8">Link</a></td><td>[ S2 - Error ] [ farm-name ] EFS Burst Credit Balance alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-EFSBurstCreditBalancealert">Runbook</a></td></tr><tr><td></td><td></td><td><strong>EFS Burst Credit Balance</strong> (Burst credit is 0)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=18">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] EFS Burst Credit Balance alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S0EFSBurstCreditBalancealert">Runbook</a></td></tr><tr><td><p>Virtualization</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td rowspan="17"><p>Database</p></td><td><p>S2</p></td><td></td><td><strong>RDS CPU Utilization</strong> (CPU more than 97% for more than 30 mins)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=10">Link</a></td><td>[ S2 - Error ] [ farm-name ] RDS CPU Utilization alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSCPUUtilizationalert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td>CPU (sy: system >70% for more than 60 mins )</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=24">Link</a></td><td>[ S2 - Error ] [ farm-name ] RDS cpuUtilization System alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDScpuUtilizationSystemalert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td>CPU (si: soft interrupts > 15% for more than 60 mins )</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=22">Link</a></td><td>[ S2 - Error ] [ farm-name ] RDS CPU Soft Interrupts alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSCPUSoftInterruptsalert">Runbook</a></td></tr><tr><td><p>S3</p></td><td></td><td>Disk queue depth (EBS disk queue depth more than 5 for more than 10 mins)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=12">Link</a></td><td>[ S3 - Warning ] [ farm-name ] RDS Disk queue depth alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSDiskqueuedepthalert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td>Disk (Free Storage Space is below 500 MB)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=31">Link</a></td><td>[ S2 - Error ] [ farm-name ] RDS Disk Free Storage Space alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSDiskFreeStorageSpacealert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td>Disk (Storage has enough space to auto-scale, (Free Space + Max Autoscaling Storage - Allocated Storage) / Allocated Storage < 0.2 )</td><td></td><td></td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSstorageauto-scalingquotaisnotenough">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td>Memory (Free memory less than 5% for more than 5 mins)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=16">Link</a></td><td>[ S2 - Error ] [ farm-name ] RDS Free Memory Percentage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSFreeMemoryPercentagealert">Runbook</a></td></tr><tr><td></td><td></td><td>Memory (Free memory less than 2% for more than 5 mins)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=19">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] RDS Free Memory Percentage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S0RDSFreeMemoryPercentagealert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td>Storage (Burst Balance below 40% for more than 30 mins )</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=14">Link</a></td><td>[ S2 - Error ] [ farm-name ] RDS Burst Balance alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSBurstBalancealert">Runbook</a></td></tr><tr><td></td><td></td><td><strong>RDS Burst Balance</strong> (Burst Balance is 0)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=20">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] RDS Burst Balance alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S0RDSBurstBalancealert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td><strong>RDS DBLoad</strong> (AWS Specific, via performance insight, more than 2 times of CPU number for more than one hour)</td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=32">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=33">Link</a></p></td><td><p>[ S2 - Error ] [ farm-name ] SMA RDS DBLoad alert</p><p>[ S2 - Error ] [ farm-name ] CMS RDS DBLoad alert</p></td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSDBLoadalert">Runbook</a></td></tr><tr><td></td><td></td><td><strong>RDS DBLoad</strong> (AWS Specific, via performance insight, more than 4 times of CPU number for more than one hour)</td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=26">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=38">Link</a></p></td><td><p>[ S1 - Critical ] [ farm-name ] SMA RDS DBLoad alert</p><p>[ S1 - Critical ] [ farm-name ] CMS RDS DBLoad alert</p></td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S1RDSDBLoadalert">Runbook</a></td></tr><tr><td><p>S3</p></td><td></td><td><strong>RDS DBLoadNonCPU</strong> (AWS Specific, via performance insight, more than 1 times of CPU number more than one hour)</td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=28">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/wbQ5osfGz/1-cloudwatch-metrics-alert-dashboard?orgId=1&viewPanel=39">Link</a></p></td><td><p>[ S3 - Warning ] [ farm-name ] SMA RDS DBLoadNonCPU alert</p><p>[ S3 - Warning ] [ farm-name ] CMS RDS DBLoadNonCPU alert</p></td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RDSDBLoadNonCPUalert">Runbook</a></td></tr><tr><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">Locks (TBD)</a></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=2">Link</a></td><td>Block Session Count</td><td></td></tr><tr><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">Long active queries (TBD)</a></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=4">Link</a></td><td>long active query duration</td><td></td></tr><tr><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">Capture RDS top 10 query (TBD)</a></p><ol><li><ol><li>Clean stat_statement daily</li><li>capture during runtime if CPU is more than 97% for 60 mins</li></ol></li></ol></td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=6">Link</a></p></td><td>RDS top 10 query</td><td></td></tr><tr><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">Dead tuple (TBD)</a></p></td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=8">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=19">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=20">Link</a></p></td><td><p>dead tuple ems</p><p>dead tuple rms</p><p>dead tuple idm</p></td><td></td></tr><tr><td rowspan="14">OS (Node level)</td><td><p>CPU</p></td><td><p>S2</p></td><td></td><td>CPU more than 97% for more than 60 mins</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=181">Link</a></td><td>[ S2 - Error ] [ farm-name ] Node CPU Usage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-NodeCPUUsagealert">Runbook</a></td></tr><tr><td></td><td><p>S2</p></td><td></td><td>CPU (sy: system >70% for more than 60 mins )(mark for review)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=211">Link</a></td><td>[ S2 - Error ] [ farm-name ] Node CPU System alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-NodeCPUSystemalert">Runbook</a></td></tr><tr><td></td><td><p>S2</p></td><td></td><td>CPU (si: soft interrupts > 15% for more than 60 mins )(mark for review)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=213">Link</a></td><td>[ S2 - Error ] [ farm-name ] Node CPU Soft Interrupts alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-NodeCPUSoftInterruptsalert">Runbook</a></td></tr><tr><td><p>Memory</p></td><td><p>S3</p></td><td></td><td>Memory more than 95% for more than 10 mins</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=185">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Node Mem Usage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-NodeMemUsagealert">Runbook</a></td></tr><tr><td><p>Disk</p></td><td><p>S3</p></td><td></td><td>Disk usage more than 95%</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=187">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Node Disk Usage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-NodeDiskUsagealert">Runbook</a></td></tr><tr><td></td><td></td><td></td><td><p><a href="https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#b_Read_Write_Latencies">Disk read/write latency</a> (TBD)</p></td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=191">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=193">Link</a></p></td><td><p>Disk Read Latency</p><p>Disk Write Latency</p></td><td></td></tr><tr><td></td><td><p>S3</p></td><td></td><td><p><a href="https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#c_Number_of_inodes_on_our_system">Inode usage</a> > 97%</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=189">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Disk Inode Usage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-DiskInodeUsagealert">Runbook</a></td></tr><tr><td></td><td></td><td></td><td><p><a href="https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#d_Overall_IO_load_on_your_instance">Node disk IO load</a> (TBD)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=195">Link</a></td><td>Disk IOPS</td><td></td></tr><tr><td><p>Network</p></td><td></td><td></td><td><p>network operation latency(TBD)</p></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td><p>network transit error rate(TBD)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=197">Link</a></td><td>Network Transit Error Rate</td><td></td></tr><tr><td></td><td></td><td></td><td><p>network transit drop rate(TBD)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=199">Link</a></td><td>Network Transit Drop Rate</td><td></td></tr><tr><td></td><td></td><td></td><td><p>network transit queue length(TBD)</p></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td><p>Throughput / bandwidth (TBD)</p></td><td></td><td></td><td></td></tr><tr><td></td><td><p>S3</p></td><td></td><td>Load (Load Avg 15m/core number > 200% for 35 mins )</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/smaxmetrics1-testing/2-node-os-metrics-alert-dashboard?orgId=1&viewPanel=13">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Node Load Avg 15m/core</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-NodeLoadAvg">Runbook</a></td></tr><tr><td rowspan="17">Container</td><td><p>CPU</p></td><td><p>S2</p></td><td></td><td>CPU (CPU more than 97% for more than 60 mins)</td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=37">Link</a></td><td>[ S2 - Error ] [ farm-name ] Pod CPU usage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-PodCPUusagealert">Runbook</a></td></tr><tr><td><p>Memory</p></td><td></td><td></td><td><p>swap usage</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=29">Link</a></td><td>Pod Swap Usage</td><td></td></tr><tr><td rowspan="2"><p>Disk</p></td><td></td><td></td><td><p><a href="https://devconnected.com/monitoring-disk-i-o-on-linux-with-the-node-exporter/#b_Read_Write_Latencies">Disk read/write latency</a> (TBD)</p></td><td><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=53">Link</a></p><p><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=57">Link</a></p></td><td><p>Pod Disk Read Latency</p><p>Pod Disk Write Latency</p></td><td></td></tr><tr><td><p>S3</p></td><td></td><td><p>Inode usage(free/total) > 97%</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=31">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Pod Inode Usage alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-PodInodeUsage">Runbook</a></td></tr><tr><td rowspan="2"><p>Network</p></td><td></td><td></td><td><p>network transit error rate(TBD)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=25">Link</a></td><td>Pod Network Transit Error Rate</td><td></td></tr><tr><td></td><td></td><td><p>network transit drop rate(TBD)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=27">Link</a></td><td>Pod Network Transit Drop Rate</td><td></td></tr><tr><td rowspan="8"><p>Unavailable service</p></td><td></td><td></td><td><p>SMAXcritical path unavailable: svc portal / runtime ui/ gateway/ platform / redis / rabbitmq / bo-login / idm / bo-ats / ingress-nginx / sma-ui / bo-farcade</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=60">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] SMA Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-SMAUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td><p>SMAXimpact partial of business: others not in S0, search related (content, DIH, DAH, search, proxy) / auto pass / bo-ui / bo-user</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=61">Link</a></td><td>[ S2 - Error ] [ farm-name ] SMA Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S2SMAUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td><p>S3</p></td><td></td><td><p>SMAXno obvious impact on business: XMPP / XIE / Smart Ticket / stx / virtual agent / ppo / web socket gateway / smart-ui / ocr / smarta-installer</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=62">Link</a></td><td>[ S3 - Warning ] [ farm-name ] SMA Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S3SMAUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td><p>S4</p></td><td></td><td><p>SMAXservices out side of ESM / toolkit</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=59">Link</a></td><td>[ S4 - Info ] [ farm-name ] SMA Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S4SMAUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td></td><td></td><td><p>CMScritical path unavailable: itom-cms-gateway, itom-idm, itom-ingress-controller, itom-ucmdb-browser, tom-ucmdb-solr, <strong>itom-ucmdb (both are down)</strong></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=64">Link</a></td><td>[ S0 - Urgent ] [ farm-name ] CMS Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-CMSUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td><p>S2</p></td><td></td><td><p>CMSimpact partial of business: itom-autopass-lms, itom-vault, <strong>itom-ucmdb (either is down)</strong></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=66">Link</a></td><td>[ S2 - Error ] [ farm-name ] CMS Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S2CMSUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td><p>S3</p></td><td></td><td><p>CMS no obvious impact on business:</p></td><td></td><td></td><td></td></tr><tr><td><p>S4</p></td><td></td><td><p>CMSservices out side of ESM / toolkit: itom-ucmdb-probe, itom-ucmdb-dfp-lunux-installer, itom-ucmdb-dfp-windows-installer, itom-ucmdb-localclient-installers</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=68">Link</a></td><td>[ S4 - Info ] [ farm-name ] CMS Unavailable k8s resource alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-S4CMSUnavailablek8sresourcealert">Runbook</a></td></tr><tr><td><p>Load</p></td><td><p>S3</p></td><td></td><td><p>Load Avg 15m/core number > 200% for 35 mins (TBD, because it's not observable via current metrics)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=55">Link</a></td><td>Pod Load Avg 10s</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-PodLoadAvg10s">Runbook</a></td></tr><tr><td><p>Threads</p></td><td></td><td></td><td><p>container_threads on process (TBD)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/jQP-QeYGz/3-k8s-pod-metrics-alert-dashboard?orgId=1&viewPanel=33">Link</a></td><td>Threads</td><td></td></tr><tr><td><p>Pod balancing (TBD)</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>App metrics</td><td><p>Thread</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td><p>Connections</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td><p>Limits</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td><p>Smart Analytics</p></td><td><p>S3</p></td><td></td><td><p>SMAXContent data ratio(total doc/committed doc) > 1.20</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=2">Link</a></td><td>[ S3 - Warning ] [ farm-name ] SmartA Data Compact Ration alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-SmartADataCompactRationalert">Runbook</a></td></tr><tr><td></td><td><p>Rabbitmq (each node)</p></td><td><p>S3</p></td><td></td><td><p>SMAXqueue > 200 / 250 for more than 30 mins (200 for medium profile or lower, 250 for large profile)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=4">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Rabbitmq Queue alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RabbitmqQueuealert">Runbook</a></td></tr><tr><td></td><td></td><td><p>S3</p></td><td></td><td><p>SMAXPending Messages/Minute > 500 for more than 30 mins (Mark for review)</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=22">Link</a></td><td>[ S3 - Warning ] [ farm-name ] Rabbitmq Messages/Minute alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-RabbitmqMessagesMinutealert">Runbook</a></td></tr><tr><td></td><td></td><td></td><td></td><td><p>SMAXMessage queue not equally distributed to different cluster nodes(TBD)</p></td><td></td><td></td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-Messagequeuenotequallydistributedtodifferentclusternodes">Runbook</a></td></tr><tr><td></td><td><p>IDM</p></td><td><p>S4</p></td><td></td><td><p>SMAXActive user (per profile, medium profile > 1100 for more than 30 mins, large profile > 3000 for more than 30 mins )</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=6">Link</a></td><td>[ S4 - Info ] [ farm-name ] IDM active users alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-IDMactiveusersalert">Runbook</a></td></tr><tr><td></td><td><p>Gateway</p></td><td><p>S2</p></td><td></td><td><p>SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins</p><p>(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=12">Link</a></td><td>[ S2 - Error ] [ farm-name ] Gateway Tomcat https connector currentThreadsBusy alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-GatewayTomcathttpsconnectorcurrentThreadsBusyalert">Runbook</a></td></tr><tr><td></td><td></td><td><p>S2</p></td><td></td><td><p>SMAXHttpclient InUse > 20 for 30 mins</p><p>(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=10">Link</a></td><td>[ S2 - Error ] [ farm-name ] Gateway Httpclient InUse alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-GatewayHttpclientInUsealert">Runbook</a></td></tr><tr><td></td><td><p>Platform</p></td><td><p>S2</p></td><td></td><td><p>SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins</p><p>(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=16">Link</a></td><td>[ S2 - Error ] [ farm-name ] Platform Tomcat https connector currentThreadsBusy alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-PlatformTomcathttpsconnectorcurrentThreadsBusyalert">Runbook</a></td></tr><tr><td></td><td></td><td><p>S2</p></td><td></td><td><p>SMAXHttpclient InUse > 20 for 30 mins</p><p>(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=14">Link</a></td><td>[ S2 - Error ] [ farm-name ] Platform Httpclient InUse alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-PlatformHttpclientInUsealert">Runbook</a></td></tr><tr><td></td><td><p>Serviceportal</p></td><td><p>S2</p></td><td></td><td><p>SMAXTomcat https connector currentThreadsBusy > 30 for 30 mins</p><p>(EU8-Prod) Tomcat https connector currentThreadsBusy > 30 for 30 mins or Tomcat https connector currentThreadsBusy > 60 for 15 mins or Tomcat https connector currentThreadsBusy > 90 for 5 mins</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=18">Link</a></td><td>[ S2 - Error ] [ farm-name ] Serviceportal Tomcat https connector currentThreadsBusy alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-ServiceportalTomcathttpsconnectorcurrentThreadsBusyalert">Runbook</a></td></tr><tr><td></td><td></td><td><p>S2</p></td><td></td><td><p>SMAXHttpclient InUse > 20 for 30 mins</p><p>(EU8-Prod) Httpclient InUse > 20 for 30 mins or Httpclient InUse > 30 for 15 mins or Httpclient InUse > 80 for 5 mins</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/yxiPJiEMz/4-smax-application-metrics-alert-dashboard?orgId=1&viewPanel=20">Link</a></td><td>[ S2 - Error ] [ farm-name ] Serviceportal Httpclient InUse alert</td><td><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Alert+Runbooks+based+on+monitoring#AlertRunbooksbasedonmonitoring-ServiceportalHttpclientInUsealert">Runbook</a></td></tr><tr><td></td><td><p>OpenSearch based Monitoring (TBD)</p></td><td></td><td></td><td><p>Access 5xx</p></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>Access Response time</p></td><td></td><td></td><td></td></tr><tr><td></td><td><p>Database level customer metrics</p></td><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">NativeSACM Transaction Context Queue</a></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=10">Link</a></td><td>NativeSACM Transaction Context Queue</td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">NativeSACM Transaction Context Queue retries</a></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=12">Link</a></td><td>NativeSACM Transaction Context Queue retries</td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">NativeSACM Transaction Context Queue stuck?</a></p></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p><a href="https://rndwiki.houston.softwaregrp.net/confluence/display/ICS/Monitoring+Database">SLT Job queue</a></p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=24">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>TextDetection Job queue</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=25">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>IndexEntities Job queue</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=26">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>EntitiesHandler Job queue</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=27">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>SLT Job Delay time[mins]</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=16">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>TextDetection Job Delay time[mins]</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=28">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>IndexEntities Job Delay time[mins]</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=29">Link</a></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td><p>EntitiesHandler Job Delay time[mins]</p></td><td><a href="https://eu8-prod-monitoring.itsma-ng.com/grafana/d/gDyTSt_4zz/6-postgresql-rds-monitoring?orgId=1&viewPanel=30">Link</a></td><td></td><td></td></tr><tr><td>Instrumental</td><td><p>Method</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td><p>Query</p></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>Others</td><td></td><td></td><td></td><td><p>When to scale out (overloaded)</p></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr></tbody></table>
|