Files
nexus/knowledgebase/csd-wiki/ICSD/Major-Incident-Management-Process_686083938.md

9.5 KiB
Raw Blame History

Major-Incident-Management-Process_686083938

Introduction

This document describes the process and best practices for assessing, identifying, responding, communicating, and tracking when a Major Incident occurs in a Customer Cloud environment.

Identification and Detection

  • Automated Monitoring: Utilize robust monitoring tools to detect anomalies, performance issues, and potential outages.
  • User Reports: Encourage users to report issues promptly via designated channels.

Best Practice

  • In the current Cloud service, the definition of Major Incident includes the following
    • Service Outage - users cannot access the application to get any cloud services
      • Performance Degradation - Performance issues are evident in the system application through monitoring and user feedback
      • Major Functionalities - Currently it refers mainly to the main functions of each product monitored through APM

Initial Assessment

  • Incident Triage: Quickly assemble a cross-functional incident response team, including representatives from development, operations, and support.
  • Impact Analysis: Evaluate the scope and impact of the incident on users, systems, and business operations.

Best Practice

  • There are many ways in which we analyze incident and assess the impact on our customers.
    • From APM monitoring - Major Function Service Availbility Check. Currently we have Service Availability checks defined for major features in the product. Currently the Service Center team checks and alerts this monitor 24x7. As soon as a problem occurs, it will be notified in Teams Channel. However, the probability of a False Alert on this monitor is high, so it is necessary to perform a manual validation to determine if it is a real Major Incident.
      • From Unified Monitoring - Monitor Infra, K8S node, K8S pod, applicaiton with Granfa for various pre-defined levels of metrics. For Details, please refer to: ESM Cloud Unified Monitoring
      • Confirm by Manual Validation - Once both APM Monitoring and Unified Monitoring have alerted the system, we can also check the system manually by logging in. Team member need to save the login for each farm some monitoring tenant as quickly as possible to quickly define the problem.
  • Our goal is to determine if Farm has a S0/S1 level problem in the fastest way possible so that we can initiate the Incident Response process in the first place.

Incident Logging

  • Centralized Logging: Maintain a centralized incident log that captures all relevant details, timestamps, and initial impact assessment.
  • Severity Classification: Categorize incidents based on severity to prioritize response efforts.

Best Practice

  • Once the Major Incident has been confirmed, we need to notify the Service Center team to create a Centralized Incident in the PPM Essential system as a follow-up to the RCA update as well as define Corrective Actions and Preventive Usually the Incident Manager is defined as the RCA Owner to provide detailed information.

Communication

  • Internal Communication: Establish communication channels for the incident response team, ensuring timely updates and coordination.
  • External Communication: Prepare predefined messages for customers and stakeholders, providing transparency about the incident.

Best Practice

Internal Communication:

  • Create a new Teams chat group in time and add relevant stakeholders to bring attention to major incidents in time for better support.
  • Relevant stakeholders include:
    • Incident Manager: Can be a Cloud team lead or Senior team member, this role will be coordinated and directed in the event of an incident.
      • CORE CPE Engineer: CORE CPE engineers will follow up with customers' incident-related tickets and respond to customers' related questions in a timely manner.
      • Cloud Ops Engineer
      • RnD Emergency Contact
  • Internal communication is very important in order to save time and get all the relevant people involved in Incident's support.

External Communication:

  • When a major incident occurs, we should communicate with the customer as soon as possible in order to keep them up to date.
  • There are currently two main types of communication
    • Send notification to specified customer groups via PCS. For details, please refer to - Send email notification to SaaS customers via PCS
      • Publish Incident Report in SaaS Service Health Page. For details, please refer to - Operation guide for SaaS Service Health Page
  • It is best to follow the given format when posting a incident notificaiton: Major Incident Customer Communication Template

Resolution

  • Runbooks and Playbooks: Develop detailed runbooks and playbooks for common incident scenarios, outlining step-by-step resolution procedures.
  • Escalation Procedures: Define clear escalation paths for issues that require higher-level expertise or management attention.

Best Practice

  • The Cloud Service team has developed a detailed runbook to address some of the most common problems, which allows you to choose the appropriate way to recover services for different types of false alarms. For details, please refer to: Alert Runbooks based on monitoring
  • If the service is still not restored properly through the existing runbook, we need to immediately involve RnD engineers through a pre-defined escalation path.

Post-Incident Review (PIR)

  • Root Cause Analysis (RCA): Conduct a thorough RCA to identify the underlying cause of the incident.
  • Documentation: Document the incident resolution process, lessons learned, and preventive measures for future incidents.

Best Practice

  • The current best pratices are for each major incident, Cloud service team will create a wiki page to track some important information, such as what important changes have been made during the incident, the focus of the discussion. Preventive actions planned for the future. For example: 2023/11/08 - EU8 - SMAX- Service Outage
  • In addition, we will also track each major incident, and clearly define the Owner. ESM Cloud Incident Tracking List
  • The Cloud Service team will drive these processes as the primary Incident Owner.
  • Once the relevant Corrective Actions and Preventive Actions have been defined, the Cloud Service team's Incident Owner needs to record the CAPA information into the Major Incident in PPM Essential for ongoing tracking. For details, please refer to: Incident Report and Actions from RCA Owner on Essentials.pdf

Continous Improvement

  • Iterative Updates: Regularly update incident response procedures based on lessons learned from past incidents.
  • Training and Drills: Conduct regular training sessions and simulated drills to ensure the incident response team is well-prepared.

Best Practice

  • We regularly hold updated training sessions to enhance the team's understanding of the Major Incident process and to share best practices.

Monitoring and Alerting Ehancements

  • Continuous Monitoring: Implement ongoing improvements to monitoring and alerting systems to proactively detect potential issues.
  • Automated Remediation: Integrate automated remediation tools to address common incidents swiftly.

Best Practice

  • Adjust monitoring metrics in a timely manner to reduce the probability of a FALSE ALERT. We need more accurate and effective monitoring to catch problems.

Documentation and Knowledge Sharing

  • Knowledge Base: Maintain a comprehensive knowledge base with troubleshooting guides, FAQs, and resolutions for known issues.
  • Documentation Accessibility Ensure that incident response documentation is easily accessible to all team members.

Best Practice

  • We need to keep improving the runbook so that there is a consistent way for team members to monitor all levels of issues and resolve them. Alert Runbooks based on monitoring

Review and Audit

  • Periodic Audits: Conduct periodic reviews and audits of the major incident management process to identify areas for improvement.
  • Compliance Checks: Ensure that the process aligns with industry best practices and regulatory requirements.

Best Practice

  • We need plan the regularly conduct Major Incident rehersal to ensure that team members are familiar with the process and the importance of division of labor.

Training Record:

https://opentextcorporation-my.sharepoint.com/✌️/g/personal/wshen_opentext_com/EaP0NtIYS1pCn3LWaMDkpMMBX5AVF2HOQlMos7L39PMRaA?referrer=Teams.TEAMS-ELECTRON&referrerScenario=MeetingChicletGetLink.view.view