The Night DNS Died: A Pager Tale About File Descriptors, Runaway Logs, and Debugging in the Dark
Tales of Debugging While Drunk
Let me tell you a story.
It was a quiet autumn evening. The kind of evening that whispers, “Pour a glass of wine and relax.” And I did. One glass turned into three—maybe four—but who’s counting? Indeed, not the person whose pager was supposed to remain silent.
But it didn’t.
Bzzzt.
Bzzzt.
BZZZT.
The wh
ole damn service was down. Fuck, that Amazon COE is going to hurt. Really love those early Wednesday morning drive-ins for a colonoscopy.
And that’s when the adrenaline hits harder than any cabernet.
Act I: All Systems Down, and No One Knows Why
A critical service—let’s call it the nexus of our control plane—had fallen over. This wasn’t some toy microservice tucked away in a dark corner of the mesh. This was the hub through which everything flowed. Traffic was backed up. Alerts were cascading across the SRE channel. Metrics? Gone. Control plane servers weren’t emitting anything.
Keep reading with a 7-day free trial
Subscribe to Wired for Scale: Sid Rao's Musings to keep reading this post and get 7 days of free access to the full post archives.