Auto-sync: 2026-04-21 00:02

This commit is contained in:
2026-04-21 00:02:55 +08:00
parent 177469a1cd
commit cb7c11e14f
235 changed files with 16567 additions and 237 deletions

View File

@@ -0,0 +1,64 @@
---
title: "SRE Weekly Issue #513"
source: "https://sreweekly.com/sre-weekly-issue-513/"
author:
- "[[lex]]"
published:
created: 2026-04-20
description:
tags:
- "clippings"
---
[View on sreweekly.com](https://sreweekly.com/sre-weekly-issue-513/)
[Organizational Second Hit Syndrome](https://www.adaptivecapacitylabs.com/2026/03/07/organizational-second-hit-syndrome/)
A previously unpublished article by the late Dr. Richard Cook!
> Organizational Second Hit Syndrome is an incident-related phenomenon analogous to neurological second-impact-syndrome (SIS). It occurs when a major incident creates a vulnerable period during which a second incident generates strong, widespread, and sometimes destructive organizational reactions.
John Allspaw and Dr. Richard I. Cook — Adaptive Capacity Labs
[Mount Mayhem at Netflix: Scaling Containers on Modern CPUs](https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac)
Over 20k mounts to run 100 containers! And NUMA issues too. This one really drives home the fact that SREs need to be cognizant of all layers of the stack.
Harshad Sane and Andrew Halaney — Netflix
[Cost Is a Distributed Systems Bug](https://dzone.com/articles/cost-is-a-distributed-systems-bug)
Cost explosion is a reliability problem. I love the idea of surfacing sudden cost increase as an alert that something is probably going wrong.
David Iyanu Jonathan — DZone
[Autoscaling Is Not Elasticity](https://dzone.com/articles/autoscaling-is-not-elasticity-1)
> Autoscaling is reactive, not resilient. Without caps, metrics, or overrides, it can worsen failures. True elasticity requires policy, testing, and bottleneck awareness.
Raise your hand if your system has ever autoscaled itself to death. ✋
David Iyanu Jonathan — DZone
[The On-Call Problem AI Can Actually Solve](https://www.runllm.com/blog/the-on-call-problem-ai-can-actually-solve)
> Heinrich Hartmann argues AIs most valuable role in SRE isnt autonomous remediation. Its making sure on-call engineers have the context to fix incidents fast.
Peter Farago — RunLLM
[Quick thoughts on GitHub CTOs post on availability](https://surfingcomplexity.blog/2026/03/12/quick-thoughts-on-github-ctos-post-on-availability/)
As usual, I enjoy reading Lorins analysis of GitHubs writeup on their incidents just as much as the writeup itself, if not more. Saturation, a security mechanism causing an outage, and more.
Lorin Hochstein
[From vendors to vanguard: Airbnbs hard-won lessons in observability ownership](https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3)
Airbnb made a big move, migrating to a new observability stack. They explain how they structured the project to deliver a big win as early as possible, building buy-in.
Callum Jones — Airbnb
[5 Ways That Resilience Cant Be Automated](https://uptimelabs.io/articles/5-ways-that-resilience-cant-be-automated/)
Each one of these is like a pile of War Stories all gathered up into a tidy package of we can learn from.
Karan Nagarajagowda — Uptime Labs