ishenwei/nexus

Fork 0

Files

weishen 3f2e1765d8 Auto-sync: 2026-04-18 17:09

2026-04-18 17:09:43 +08:00

5.2 KiB

Raw Blame History

Auto-healing-2.0_686083907

This page presents all the specifications for auto-fixing or auto-healing. It's a newer version based v1.0

Requirement

Nowadays, customers have very strict requirements on SLA. In order to get the earliest alert on the availability issue, they even run Application Performance Monitoring probe three times every minute to detect the health of the target application they are using, in order to get the earliest alert on the availability issue.

An SLA of 99.99% is becoming a standard for most of the business critical applications.

In order to meet above requirements, once a critical application issue happens, which is impacting the SLA, it need to be resolved or recovered within minutes, otherwise customer escalations will arrive.

However, it's not possible for a human to react to any critical issues within a few minutes. Thus, the auto healing idea is born, which leverages the auto healing applications to resolve the issues without human intervention.

Architecture Diagram

Overall workflow

The auto healing can be triggerred by schdule / alert / manual input trigger
Analysis process decides the workflow and triggers actions
An example of actions can be collecting logs / rolling restart / compact storage.

An example of auto healing

The monitoring system keep monitoring the SMAX App IDOL Content data ratio (total doc/committed doc), when it's reaching to more than 1.2, grafana sends the request to API gateway.
A healing action is then triggered, since there is only one action, analysis process is not triggered.
The app doing the action fetches the configuration and credentials from AWS Parameter Store. (In this case, DynamoDB is not used. It will be used when there are lots of data to be collected and consolidated.)
The app sends the request to the farm to resolve the issue.
All the audit or logs will be kept in s3.

Scope

For Auto healing 1.0, it's mainly to roll out a quick recovery option to PoC the capability of the solution.

For Auto healing 2.0, the scope is changed to below

Expanding to more farms with an easy way.
1. Todo: add tasks in basecamp
Expanding to collection actions.
1. Define the runbooks 2. Rollout the collection actions - POC
Expanding the trigger and actions.
1. Define the trigger 2. Define the action 3. Rollout the new triggers and actions
Exploring the Analysis process if possible.
Exploring the possibility of leveraging OpsB

Concepts

Trigger

The entrance of the auto-healing.

Scheduler: e.g.: 2:00 AM Daily
User input
Event: e.g.:
- ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)
  - Database Memory (Free memory less than 2% for more than 5 mins)
  - SMAX App IDOL Content data ratio(total doc/committed doc) > 1.20
  - SMAX App Tomcat https connector currentThreadsBusy > 30 for 30 mins
  - SMAX App Httpclient InUse > 20 for 30 mins

Analysis Process (Optional)

This process does the analysis and also decides the procedure of different actions like collections and healing actions. If there is only one action, analysis process is optional.

Collection Actions

The group of actions to do collection jobs.

Collect application logs
Collect application dumps (thread dump, memory dump, etc)
Collect application traces
Add information to an incident

Healing Actions

The group of actions to do healing jobs.

Rolling restart key deployments
SMAX App Smart Analytics Content Compact

Target environment

The farms with specific issue.

Farm

A deployment of suite product.

Combined triggers and healings

Scheduled healing

Weekly - Rolling restart key deployments
Weekly - Smart Analytics Content Compact

Event triggered healing

ALB 5xx alert - Rolling restart key deployments
Database free memory alert - Rolling restart key deployments
Smart Analytics Content data ratio(total doc/committed doc) alert - Smart Analytics Content Compact
Tomcat https connector threads/MAX threads alert - Rolling restart specific deployments
Httpclient InUse/Max alert - Rolling restart specific deployments

Mechanism to survive with false alarms

The auto healing steps may caused by false alarms. In order to protect the farm from those auto healing steps, it's always required to use the actions with no availability and performance impact.

For example, even the auto healing steps are triggered by accident, it should not impact the availability and performance of the farm. The mechanism can be in but not limited to below list:

The jobs can only be triggered once an hour
Once restart is required, rolling restart should be used
If the job is not executed successfully, notifications will be sent to administrators

Reference

ESM Cloud Unified Monitoring

5.2 KiB Raw Blame History