78 lines
3.4 KiB
Markdown
78 lines
3.4 KiB
Markdown
---
|
||
title: "Infrastructure Maintainer"
|
||
type: source
|
||
tags: [agent, infrastructure, devops]
|
||
date: 2026-04-21
|
||
---
|
||
|
||
## Source File
|
||
- [[raw/Agent/agency-agents/support/support-infrastructure-maintainer.md]]
|
||
|
||
## Summary
|
||
- 核心主题:Infrastructure Maintainer 智能体专业角色的完整定义
|
||
- 问题域:系统可靠性、性能优化、技术运营管理
|
||
- 方法/机制:IaC、监控、自动化、安全加固、灾备、成本优化
|
||
- 结论/价值:提供 99.9%+ 运维能力,通过标准化交付物和流程实现基础设施可观测性
|
||
|
||
## Key Claims
|
||
- Infrastructure Maintainer 确保 99.9%+ 系统正常运行时间
|
||
- IaC 框架(Terraform)实现跨平台基础设施声明式管理
|
||
- Prometheus 监控配置支持多层次告警(infrastructure/application/database)
|
||
- 自动化备份系统通过加密和 S3 存储实现灾难恢复
|
||
- Security Hardening 集成于所有基础设施变更
|
||
- 成本优化策略实现 20%+ 年度效率提升
|
||
|
||
## Key Quotes
|
||
> "Monitoring indicates 85% disk usage on DB server - scaling scheduled for tomorrow" — Proactive communication style
|
||
> "Implemented redundant load balancers achieving 99.99% uptime target" — Reliability focus
|
||
> "Auto-scaling policies reduced costs 23% while maintaining <200ms response times" — Systematic optimization
|
||
|
||
## Key Concepts
|
||
- [[Infrastructure as Code (IaC)]]:通过代码实现一致性、版本控制的基础设施管理
|
||
- [[Prometheus Monitoring]]:时序数据库监控方案,支持多维度告警规则
|
||
- [[Terraform]]:基础设施即代码工具,声明式配置跨平台云资源
|
||
- [[Disaster Recovery]]:灾难恢复策略,RTO/RPO 为核心指标
|
||
- [[Security Hardening]]:安全加固流程,零信任架构和最小权限原则
|
||
- [[Cost Optimization]]:云成本优化策略,Right-Sizing 和 Reserved Instance
|
||
|
||
## Key Entities
|
||
- [[The Agency]]:开源 AI 智能体集合项目,Infrastructure Maintainer 是其 Support 角色之一
|
||
- [[AWS]]:基础设施云平台,提供 VPC、RDS、EC2 等服务
|
||
- [[Prometheus]]:开源监控和告警工具
|
||
- [[Terraform]]:HashiCorp 基础设施即代码工具
|
||
|
||
## Connections
|
||
- [[Support Infrastructure Maintainer]] ← is_a ← [[The Agency Agent]]
|
||
- [[DevOps 成熟度模型]] ← relates_to ← [[Infrastructure as Code (IaC)]]
|
||
- [[ITSM(IT 服务管理)]] ← relates_to ← [[Disaster Recovery]]
|
||
|
||
## Contradictions
|
||
- 未检测到与现有 wiki 内容的冲突
|
||
|
||
## Workflow Deliverables
|
||
### Monitoring System
|
||
- Prometheus scrape_configs: infrastructure(30s), application(15s), database(30s)
|
||
- Alert rules: HighCPUUsage, HighMemoryUsage, DiskSpaceLow, ServiceDown
|
||
|
||
### IaC Framework
|
||
- Terraform backend: S3 + DynamoDB state locking
|
||
- VPC with private/public subnets across availability zones
|
||
- Auto Scaling Group with ELB health checks
|
||
- RDS PostgreSQL with encrypted storage and backup retention
|
||
|
||
### Backup & Recovery
|
||
- Encrypted backup script (GPG AES256)
|
||
- S3 storage with STANDARD_IA
|
||
- Retention: 30 days local, lifecycle managed in S3
|
||
- Verification and Slack notification
|
||
|
||
## Agent Characteristics
|
||
- **Role**: System reliability, infrastructure optimization, operations specialist
|
||
- **Personality**: Proactive, systematic, reliability-focused, security-conscious
|
||
- **Success Metrics**: 99.9%+ uptime, MTTR <4 hours, 20%+ cost efficiency, 70%+ automation reduction
|
||
|
||
## Advanced Capabilities
|
||
- Multi-cloud architecture design
|
||
- Container orchestration (Kubernetes)
|
||
- Zero-trust security architecture
|
||
- Compliance automation (SOC2, ISO27001) |