nexus/wiki/sources/support-infrastructure-maintainer.md

---
title: "Infrastructure Maintainer"
type: source
tags: [agent, infrastructure, devops]
date: 2026-04-21
---

## Source File
- [[raw/Agent/agency-agents/support/support-infrastructure-maintainer.md]]

## Summary
- 核心主题：Infrastructure Maintainer 智能体专业角色的完整定义
- 问题域：系统可靠性、性能优化、技术运营管理
- 方法/机制：IaC、监控、自动化、安全加固、灾备、成本优化
- 结论/价值：提供 99.9%+ 运维能力，通过标准化交付物和流程实现基础设施可观测性

## Key Claims
- Infrastructure Maintainer 确保 99.9%+ 系统正常运行时间
- IaC 框架（Terraform）实现跨平台基础设施声明式管理
- Prometheus 监控配置支持多层次告警（infrastructure/application/database）
- 自动化备份系统通过加密和 S3 存储实现灾难恢复
- Security Hardening 集成于所有基础设施变更
- 成本优化策略实现 20%+ 年度效率提升

## Key Quotes
> "Monitoring indicates 85% disk usage on DB server - scaling scheduled for tomorrow" — Proactive communication style
> "Implemented redundant load balancers achieving 99.99% uptime target" — Reliability focus
> "Auto-scaling policies reduced costs 23% while maintaining <200ms response times" — Systematic optimization

## Key Concepts
- [[Infrastructure as Code (IaC)]]：通过代码实现一致性、版本控制的基础设施管理
- [[Prometheus Monitoring]]：时序数据库监控方案，支持多维度告警规则
- [[Terraform]]：基础设施即代码工具，声明式配置跨平台云资源
- [[Disaster Recovery]]：灾难恢复策略，RTO/RPO 为核心指标
- [[Security Hardening]]：安全加固流程，零信任架构和最小权限原则
- [[Cost Optimization]]：云成本优化策略，Right-Sizing 和 Reserved Instance

## Key Entities
- [[The Agency]]：开源 AI 智能体集合项目，Infrastructure Maintainer 是其 Support 角色之一
- [[AWS]]：基础设施云平台，提供 VPC、RDS、EC2 等服务
- [[Prometheus]]：开源监控和告警工具
- [[Terraform]]：HashiCorp 基础设施即代码工具

## Connections
- [[Support Infrastructure Maintainer]] ← is_a ← [[The Agency Agent]]
- [[DevOps 成熟度模型]] ← relates_to ← [[Infrastructure as Code (IaC)]]
- [[ITSM（IT 服务管理）]] ← relates_to ← [[Disaster Recovery]]

## Contradictions
- 未检测到与现有 wiki 内容的冲突

## Workflow Deliverables
### Monitoring System
- Prometheus scrape_configs: infrastructure(30s), application(15s), database(30s)
- Alert rules: HighCPUUsage, HighMemoryUsage, DiskSpaceLow, ServiceDown

### IaC Framework
- Terraform backend: S3 + DynamoDB state locking
- VPC with private/public subnets across availability zones
- Auto Scaling Group with ELB health checks
- RDS PostgreSQL with encrypted storage and backup retention

### Backup & Recovery
- Encrypted backup script (GPG AES256)
- S3 storage with STANDARD_IA
- Retention: 30 days local, lifecycle managed in S3
- Verification and Slack notification

## Agent Characteristics
- **Role**: System reliability, infrastructure optimization, operations specialist
- **Personality**: Proactive, systematic, reliability-focused, security-conscious
- **Success Metrics**: 99.9%+ uptime, MTTR <4 hours, 20%+ cost efficiency, 70%+ automation reduction

## Advanced Capabilities
- Multi-cloud architecture design
- Container orchestration (Kubernetes)
- Zero-trust security architecture
- Compliance automation (SOC2, ISO27001)