Auto-sync: 2026-04-18 17:09

2026-04-18 17:09:43 +08:00
parent 60d2f8254b
commit 3f2e1765d8
276 changed files with 17241 additions and 20 deletions
--- a/knowledgebase/csd-wiki/ICSD/ITOM-SaaS-Pain-Points_686083998.md
+++ b/knowledgebase/csd-wiki/ICSD/ITOM-SaaS-Pain-Points_686083998.md
@@ -0,0 +1,68 @@
+# ITOM-SaaS-Pain-Points_686083998
+## Introduction
+
+This page presents all the pain points for SaaS delivery on the ITOM BU level.
+
+## Legends
+
+SMAX
+
+HCMX
+
+FINOPS
+
+OO
+
+OPS
+
+## Pain points
+
+- Process
+	- OPS Don’t wait until retro, we need a system that can record such inputs.
+- Quality
+	- Recent system crashes caused by the new features. (Incident ID, Feature ID?)
+		- SMAX SMAX platform pod loads are not well balanced. (24.2 may fix it)
+		- It's not...
+		- SMAX nativeSACM consumes high resource usage, especially network.
+		- SMAX The SMAX still has OOTB index issues.
+		- SMAX Redis is the single point of failure and is not easy to debug.
+		- SMAX HCMX FINOPS OO Need more post upgrade tests. / CMS post upgrade issue is not acceptable
+		- SMAX SLT task is weekly (unplanned change is not submitted )
+		- After 2024.1, CMS’s quality is not good. So many issues.
+- SLA  
+	- SMAX Missing monitoring metrics for CMS/Native SACM
+		- CMS not fully ready for auto-healing (rolling restart takes more than 1 minute)
+		- SMAX Missing correlations on several S1 / S2 alerts, for example, 5xx errors, and soft interrupts.
+		- 5xx errors: put errors into categories
+				- Soft interrupt: more metrics / diagnostics to get the detailed breakdown of the interrupt
+		- SMAX It is missing the overall throttling mechanism/rate limit which causes unexpected outages on the farm.
+		- 24.4?
+		- OO OO upgrade takes hours to finish, the OORAS pods can only be upgraded one after another.
+		- Solved in 24.3.
+- Security
+	- OPS Missing the WAF rules rolling out, the farm is visited by malicious requests every day
+		- WIP
+		- OPS Major security KPI missing, including Qualys score, SIEM integration, etc.
+- Compliance
+	- Missing the standard/certification for EU-managed
+- Maintenance & Operation
+	- OPS Operation efforts increase when there are more farms, including upgrades, patches, etc (Automation rate is low.)
+		- OPS Monitoring need to be improved. More meaningful alerts, less false alert.
+		- Aligning the CPU alert threshold to 99.9%.
+		- OPS Troubleshooting takes lots of time.
+		- OPS Cannot always leverage Ops from other regions / how to grow up Ops from other regions
+		- OPS Too many threads, need all the members to do self-driven.
+		- OPS Need an option to better to utilize Shen Wei’s time
+		- SMAX Logging issue
+		- Accumulated logs cost more
+				- Too much logs slows down troubleshooting
+				- Too much log writing used up the network throughput
+		- OO The tenant import feature cannot handle integrations like nativeSACM and OO.
+		- SMAX OO Too many special settings to keep the system stable, and many of them can be lost during upgrade.
+- Cost
+	- SMAX HCMX FINOPS OO When customer usage increases the resource doesn't increase linearly.
+		- SMAX HCMX FINOPS OO FinOps, SMAX, and OO consume lots of resources, CMS resource usage is OK
+		- The sizing of HCMX, OO are based on tenant number instead of usage. Usually for almost all the ESM farm, OO need to be medium profile ($65K/y) or even larger, which doesn't contribute any license revenue.
+				- FinOps cost is usually more than SMAX large profile ($113K/y)
+				- SMAX sizing is not helpful for medium or large sized customers. Usually the farm need to double or triple the resource required by sizing guide.
+				- There is no sizing guide for integration, including API integration, nativeSACM, etc.