Auto-sync: 2026-04-21 00:02

2026-04-21 00:02:55 +08:00
parent 177469a1cd
commit cb7c11e14f
235 changed files with 16567 additions and 237 deletions
--- a/Clippings/SRE
+++ b/Clippings/SRE
@@ -0,0 +1,64 @@
+---
+title: "SRE Weekly Issue #513"
+source: "https://sreweekly.com/sre-weekly-issue-513/"
+author:
+  - "[[lex]]"
+published:
+created: 2026-04-20
+description:
+tags:
+  - "clippings"
+---
+[View on sreweekly.com](https://sreweekly.com/sre-weekly-issue-513/)
+
+[Organizational Second Hit Syndrome](https://www.adaptivecapacitylabs.com/2026/03/07/organizational-second-hit-syndrome/)
+
+A previously unpublished article by the late Dr. Richard Cook!
+
+> Organizational Second Hit Syndrome is an incident-related phenomenon analogous to neurological second-impact-syndrome (SIS). It occurs when a major incident creates a vulnerable period during which a second incident generates strong, widespread, and sometimes destructive organizational reactions.
+
+John Allspaw and Dr. Richard I. Cook — Adaptive Capacity Labs
+
+[Mount Mayhem at Netflix: Scaling Containers on Modern CPUs](https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac)
+
+Over 20k mounts to run 100 containers! And NUMA issues too. This one really drives home the fact that SREs need to be cognizant of all layers of the stack.
+
+Harshad Sane and Andrew Halaney — Netflix
+
+[Cost Is a Distributed Systems Bug](https://dzone.com/articles/cost-is-a-distributed-systems-bug)
+
+Cost explosion is a reliability problem. I love the idea of surfacing sudden cost increase as an alert that something is probably going wrong.
+
+David Iyanu Jonathan — DZone
+
+[Autoscaling Is Not Elasticity](https://dzone.com/articles/autoscaling-is-not-elasticity-1)
+
+> Autoscaling is reactive, not resilient. Without caps, metrics, or overrides, it can worsen failures. True elasticity requires policy, testing, and bottleneck awareness.
+
+Raise your hand if your system has ever autoscaled itself to death. ✋
+
+David Iyanu Jonathan — DZone
+
+[The On-Call Problem AI Can Actually Solve](https://www.runllm.com/blog/the-on-call-problem-ai-can-actually-solve)
+
+> Heinrich Hartmann argues AI’s most valuable role in SRE isn’t autonomous remediation. It’s making sure on-call engineers have the context to fix incidents fast.
+
+Peter Farago — RunLLM
+
+[Quick thoughts on GitHub CTO’s post on availability](https://surfingcomplexity.blog/2026/03/12/quick-thoughts-on-github-ctos-post-on-availability/)
+
+As usual, I enjoy reading Lorin’s analysis of GitHub’s writeup on their incidents just as much as the writeup itself, if not more. Saturation, a security mechanism causing an outage, and more.
+
+Lorin Hochstein
+
+[From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership](https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3)
+
+Airbnb made a big move, migrating to a new observability stack. They explain how they structured the project to deliver a big win as early as possible, building buy-in.
+
+Callum Jones — Airbnb
+
+[5 Ways That Resilience Can’t Be Automated](https://uptimelabs.io/articles/5-ways-that-resilience-cant-be-automated/)
+
+Each one of these is like a pile of War Stories all gathered up into a tidy package of we can learn from.
+
+Karan Nagarajagowda — Uptime Labs