115 lines
4.1 KiB
Markdown
115 lines
4.1 KiB
Markdown
---
|
||
title: "Kubernetes"
|
||
type: entity
|
||
tags:
|
||
- cloud
|
||
- container
|
||
- orchestration
|
||
- devops
|
||
sources: [cloud-operating-model-key-strategies-and-best-practices]
|
||
created: 2026-04-25
|
||
---
|
||
|
||
# Kubernetes
|
||
|
||
## Definition
|
||
|
||
Kubernetes (K8s) 是 Google 开源的**容器编排平台**,用于自动化容器化应用的部署、扩缩容和管理。是云原生 (Cloud-Native) 架构的核心基础设施,也是 Agentic AI 自主修复 (Self-Healing) 的主要目标环境。
|
||
|
||
## Aliases
|
||
|
||
- K8s
|
||
- Kubernetes
|
||
- Container Orchestration Platform
|
||
|
||
## Major Cloud Implementations
|
||
|
||
| Provider | Service | Description |
|
||
|----------|---------|-------------|
|
||
| AWS | EKS (Elastic Kubernetes Service) | 托管 Kubernetes on AWS |
|
||
| GCP | GKE (Google Kubernetes Engine) | 托管 Kubernetes on GCP |
|
||
| Azure | AKS (Azure Kubernetes Service) | 托管 Kubernetes on Azure |
|
||
|
||
## Kubernetes Self-Healing Capabilities
|
||
|
||
Kubernetes 原生提供基础 Self-Healing 能力:
|
||
|
||
```yaml
|
||
# Kubernetes Self-Healing 原生机制
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
spec:
|
||
replicas: 3
|
||
strategy:
|
||
type: RollingUpdate
|
||
template:
|
||
spec:
|
||
terminationGracePeriodSeconds: 30
|
||
# 内置机制:
|
||
# - 自动重启失败的容器
|
||
# - 替换不健康的 Pod
|
||
# - 滚动更新确保服务可用
|
||
```
|
||
|
||
Agentic AI 在原生能力基础上提供**更高级的自我修复**:
|
||
|
||
| 能力 | Kubernetes 原生 | Agentic AI Enhanced |
|
||
|------|---------------|-------------------|
|
||
| Pod 重启 | ✅ 自动重启崩溃容器 | ✅ 智能分析根因 + 预防性重启 |
|
||
| 扩缩容 | ✅ HPA 基于指标 | ✅ 预测性扩缩容 |
|
||
| 节点恢复 | ✅ 节点故障迁移 | ✅ 主动健康检查 + 预防性迁移 |
|
||
| 配置修复 | ❌ 需人工介入 | ✅ AI 自动修正 ConfigMap/Secret |
|
||
|
||
## Agentic AI Monitoring Targets
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ Agentic AI for Kubernetes │
|
||
├─────────────────────────────────────────────────┤
|
||
│ 监控层 │
|
||
│ ├── Pod Metrics (CPU/Memory/Network) │
|
||
│ ├── Workload Health (Deployment/ReplicaSet) │
|
||
│ ├── Node Status (Ready/Condition) │
|
||
│ └── Cluster Components (etcd, API Server) │
|
||
│ │
|
||
│ 决策层 │
|
||
│ ├── Anomaly Detection (AI) │
|
||
│ ├── Root Cause Analysis (AI) │
|
||
│ └── Action Planning (AI) │
|
||
│ │
|
||
│ 执行层 │
|
||
│ ├── kubectl API (restart/migrate/scale) │
|
||
│ ├── HPA Override (AI-driven scaling) │
|
||
│ └── Config Updates (AI-driven fixes) │
|
||
└─────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Example
|
||
|
||
> An AI agent monitoring AWS EKS clusters detects high CPU usage due to a rogue pod:
|
||
> - Pod `payment-service-v2-abc123` CPU usage: 95%
|
||
> - AI correlates with recent deployment timestamp
|
||
> - AI identifies: Memory leak in new version
|
||
> - AI Actions:
|
||
> 1. Scale deployment to 3 replicas (distribute load)
|
||
> 2. Create rollback ticket
|
||
> 3. Notify team via Slack
|
||
> 4. Auto-rollback after approval
|
||
|
||
## Related Concepts
|
||
|
||
- [[Self-Healing Systems]] — Kubernetes 是 Self-Healing 的主要载体
|
||
- [[Cloud-Native]] — Kubernetes 是 Cloud-Native 的核心
|
||
- [[Deployment Automation]] — Kubernetes 部署的自动化
|
||
- [[Container Lifecycle Hardening]] — 容器安全加固
|
||
|
||
## Related Entities
|
||
|
||
- [[Agentic AI]] — Kubernetes 是 Agentic AI 的管理对象
|
||
- EKS, GKE, AKS — 具体云服务商实现
|
||
|
||
## Related Sources
|
||
|
||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||
- [[ctp-topic-70-eks-deployment-using-iac]]
|