5.4 KiB
Auto-healing-2.0_686083907
This page presents all the specifications for auto-fixing or auto-healing. It's a newer version based v1.0
Requirement
Nowadays, customers have very strict requirements on SLA. In order to get the earliest alert on the availability issue, they even run Application Performance Monitoring probe three times every minute to detect the health of the target application they are using, in order to get the earliest alert on the availability issue.
An SLA of 99.99% is becoming a standard for most of the business critical applications.
In order to meet above requirements, once a critical application issue happens, which is impacting the SLA, it need to be resolved or recovered within minutes, otherwise customer escalations will arrive.
However, it's not possible for a human to react to any critical issues within a few minutes. Thus, the auto healing idea is born, which leverages the auto healing applications to resolve the issues without human intervention.
Architecture Diagram
Overall workflow
- The auto healing can be triggerred by schdule / alert / manual input trigger
- Analysis process decides the workflow and triggers actions
- An example of actions can be collecting logs / rolling restart / compact storage.
An example of auto healing
- The monitoring system keep monitoring the SMAX App IDOL Content data ratio (total doc/committed doc), when it's reaching to more than 1.2, grafana sends the request to API gateway.
- A healing action is then triggered, since there is only one action, analysis process is not triggered.
- The app doing the action fetches the configuration and credentials from AWS Parameter Store. (In this case, DynamoDB is not used. It will be used when there are lots of data to be collected and consolidated.)
- The app sends the request to the farm to resolve the issue.
- All the audit or logs will be kept in s3.
Scope
For Auto healing 1.0, it's mainly to roll out a quick recovery option to PoC the capability of the solution.
For Auto healing 2.0, the scope is changed to below
- Expanding to more farms with an easy way.
- Todo: add tasks in basecamp
- Expanding to collection actions.
- Define the runbooks 2. Rollout the collection actions - POC
- Expanding the trigger and actions.
- Define the trigger 2. Define the action 3. Rollout the new triggers and actions
- Exploring the Analysis process if possible.
- Exploring the possibility of leveraging OpsB
Concepts
Trigger
The entrance of the auto-healing.
- Scheduler: e.g.: 2:00 AM Daily
- User input
- Event: e.g.:
- ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)
- Database Memory (Free memory less than 2% for more than 5 mins)
- SMAX App IDOL Content data ratio(total doc/committed doc) > 1.20
- SMAX App Tomcat https connector currentThreadsBusy > 30 for 30 mins
- SMAX App Httpclient InUse > 20 for 30 mins
- ALB HTTP 5XX Count (More than 34 in a 3 mins time frame)
Analysis Process (Optional)
This process does the analysis and also decides the procedure of different actions like collections and healing actions. If there is only one action, analysis process is optional.
Collection Actions
The group of actions to do collection jobs.
- Collect application logs
- Collect application dumps (thread dump, memory dump, etc)
- Collect application traces
- Add information to an incident
Healing Actions
The group of actions to do healing jobs.
- Rolling restart key deployments
- SMAX App Smart Analytics Content Compact
Target environment
The farms with specific issue.
Farm
A deployment of suite product.
Combined triggers and healings
Scheduled healing
- Weekly - Rolling restart key deployments
- Weekly - Smart Analytics Content Compact
Event triggered healing
- ALB 5xx alert - Rolling restart key deployments
- Database free memory alert - Rolling restart key deployments
- Smart Analytics Content data ratio(total doc/committed doc) alert - Smart Analytics Content Compact
- Tomcat https connector threads/MAX threads alert - Rolling restart specific deployments
- Httpclient InUse/Max alert - Rolling restart specific deployments
Mechanism to survive with false alarms
The auto healing steps may caused by false alarms. In order to protect the farm from those auto healing steps, it's always required to use the actions with no availability and performance impact.
For example, even the auto healing steps are triggered by accident, it should not impact the availability and performance of the farm. The mechanism can be in but not limited to below list:
- The jobs can only be triggered once an hour
- Once restart is required, rolling restart should be used
- If the job is not executed successfully, notifications will be sent to administrators

