--- title: "Kubernetes" type: entity tags: - cloud - container - orchestration - devops sources: [cloud-operating-model-key-strategies-and-best-practices] created: 2026-04-25 --- # Kubernetes ## Definition Kubernetes (K8s) 是 Google 开源的**容器编排平台**,用于自动化容器化应用的部署、扩缩容和管理。是云原生 (Cloud-Native) 架构的核心基础设施,也是 Agentic AI 自主修复 (Self-Healing) 的主要目标环境。 ## Aliases - K8s - Kubernetes - Container Orchestration Platform ## Major Cloud Implementations | Provider | Service | Description | |----------|---------|-------------| | AWS | EKS (Elastic Kubernetes Service) | 托管 Kubernetes on AWS | | GCP | GKE (Google Kubernetes Engine) | 托管 Kubernetes on GCP | | Azure | AKS (Azure Kubernetes Service) | 托管 Kubernetes on Azure | ## Kubernetes Self-Healing Capabilities Kubernetes 原生提供基础 Self-Healing 能力: ```yaml # Kubernetes Self-Healing 原生机制 apiVersion: apps/v1 kind: Deployment spec: replicas: 3 strategy: type: RollingUpdate template: spec: terminationGracePeriodSeconds: 30 # 内置机制: # - 自动重启失败的容器 # - 替换不健康的 Pod # - 滚动更新确保服务可用 ``` Agentic AI 在原生能力基础上提供**更高级的自我修复**: | 能力 | Kubernetes 原生 | Agentic AI Enhanced | |------|---------------|-------------------| | Pod 重启 | ✅ 自动重启崩溃容器 | ✅ 智能分析根因 + 预防性重启 | | 扩缩容 | ✅ HPA 基于指标 | ✅ 预测性扩缩容 | | 节点恢复 | ✅ 节点故障迁移 | ✅ 主动健康检查 + 预防性迁移 | | 配置修复 | ❌ 需人工介入 | ✅ AI 自动修正 ConfigMap/Secret | ## Agentic AI Monitoring Targets ``` ┌─────────────────────────────────────────────────┐ │ Agentic AI for Kubernetes │ ├─────────────────────────────────────────────────┤ │ 监控层 │ │ ├── Pod Metrics (CPU/Memory/Network) │ │ ├── Workload Health (Deployment/ReplicaSet) │ │ ├── Node Status (Ready/Condition) │ │ └── Cluster Components (etcd, API Server) │ │ │ │ 决策层 │ │ ├── Anomaly Detection (AI) │ │ ├── Root Cause Analysis (AI) │ │ └── Action Planning (AI) │ │ │ │ 执行层 │ │ ├── kubectl API (restart/migrate/scale) │ │ ├── HPA Override (AI-driven scaling) │ │ └── Config Updates (AI-driven fixes) │ └─────────────────────────────────────────────────┘ ``` ## Example > An AI agent monitoring AWS EKS clusters detects high CPU usage due to a rogue pod: > - Pod `payment-service-v2-abc123` CPU usage: 95% > - AI correlates with recent deployment timestamp > - AI identifies: Memory leak in new version > - AI Actions: > 1. Scale deployment to 3 replicas (distribute load) > 2. Create rollback ticket > 3. Notify team via Slack > 4. Auto-rollback after approval ## Related Concepts - [[Self-Healing Systems]] — Kubernetes 是 Self-Healing 的主要载体 - [[Cloud-Native]] — Kubernetes 是 Cloud-Native 的核心 - [[Deployment Automation]] — Kubernetes 部署的自动化 - [[Container Lifecycle Hardening]] — 容器安全加固 ## Related Entities - [[Agentic AI]] — Kubernetes 是 Agentic AI 的管理对象 - EKS, GKE, AKS — 具体云服务商实现 ## Related Sources - [[how-agentic-ai-can-help-for-cloud-devops]] - [[ctp-topic-70-eks-deployment-using-iac]]