Auto-sync: 2026-04-21 00:02
This commit is contained in:
64
Clippings/SRE Weekly Issue 513.md
Normal file
64
Clippings/SRE Weekly Issue 513.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: "SRE Weekly Issue #513"
|
||||
source: "https://sreweekly.com/sre-weekly-issue-513/"
|
||||
author:
|
||||
- "[[lex]]"
|
||||
published:
|
||||
created: 2026-04-20
|
||||
description:
|
||||
tags:
|
||||
- "clippings"
|
||||
---
|
||||
[View on sreweekly.com](https://sreweekly.com/sre-weekly-issue-513/)
|
||||
|
||||
[Organizational Second Hit Syndrome](https://www.adaptivecapacitylabs.com/2026/03/07/organizational-second-hit-syndrome/)
|
||||
|
||||
A previously unpublished article by the late Dr. Richard Cook!
|
||||
|
||||
> Organizational Second Hit Syndrome is an incident-related phenomenon analogous to neurological second-impact-syndrome (SIS). It occurs when a major incident creates a vulnerable period during which a second incident generates strong, widespread, and sometimes destructive organizational reactions.
|
||||
|
||||
John Allspaw and Dr. Richard I. Cook — Adaptive Capacity Labs
|
||||
|
||||
[Mount Mayhem at Netflix: Scaling Containers on Modern CPUs](https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac)
|
||||
|
||||
Over 20k mounts to run 100 containers! And NUMA issues too. This one really drives home the fact that SREs need to be cognizant of all layers of the stack.
|
||||
|
||||
Harshad Sane and Andrew Halaney — Netflix
|
||||
|
||||
[Cost Is a Distributed Systems Bug](https://dzone.com/articles/cost-is-a-distributed-systems-bug)
|
||||
|
||||
Cost explosion is a reliability problem. I love the idea of surfacing sudden cost increase as an alert that something is probably going wrong.
|
||||
|
||||
David Iyanu Jonathan — DZone
|
||||
|
||||
[Autoscaling Is Not Elasticity](https://dzone.com/articles/autoscaling-is-not-elasticity-1)
|
||||
|
||||
> Autoscaling is reactive, not resilient. Without caps, metrics, or overrides, it can worsen failures. True elasticity requires policy, testing, and bottleneck awareness.
|
||||
|
||||
Raise your hand if your system has ever autoscaled itself to death. ✋
|
||||
|
||||
David Iyanu Jonathan — DZone
|
||||
|
||||
[The On-Call Problem AI Can Actually Solve](https://www.runllm.com/blog/the-on-call-problem-ai-can-actually-solve)
|
||||
|
||||
> Heinrich Hartmann argues AI’s most valuable role in SRE isn’t autonomous remediation. It’s making sure on-call engineers have the context to fix incidents fast.
|
||||
|
||||
Peter Farago — RunLLM
|
||||
|
||||
[Quick thoughts on GitHub CTO’s post on availability](https://surfingcomplexity.blog/2026/03/12/quick-thoughts-on-github-ctos-post-on-availability/)
|
||||
|
||||
As usual, I enjoy reading Lorin’s analysis of GitHub’s writeup on their incidents just as much as the writeup itself, if not more. Saturation, a security mechanism causing an outage, and more.
|
||||
|
||||
Lorin Hochstein
|
||||
|
||||
[From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership](https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3)
|
||||
|
||||
Airbnb made a big move, migrating to a new observability stack. They explain how they structured the project to deliver a big win as early as possible, building buy-in.
|
||||
|
||||
Callum Jones — Airbnb
|
||||
|
||||
[5 Ways That Resilience Can’t Be Automated](https://uptimelabs.io/articles/5-ways-that-resilience-cant-be-automated/)
|
||||
|
||||
Each one of these is like a pile of War Stories all gathered up into a tidy package of we can learn from.
|
||||
|
||||
Karan Nagarajagowda — Uptime Labs
|
||||
Reference in New Issue
Block a user