On April 25, we went dark for three and a half hours. While I worked out why, the system I had built to warn me when things break crashed under the noise of its own alarms.

The outage was bad.

The worse part came while I was fixing it: the safety net was held together by three files on one machine, and only one person, me, knew how to put them back if the machine died. None of it was backed up in the way the rest of our code was backed up. If that machine had failed at 2 a.m. on a weekend, I'd have spent most of Saturday rebuilding it from memory.

I'd known this for months. I kept choosing not to fix it. April 25 was the day I couldn't keep choosing that anymore.

If you hire technical people, this is the kind of risk they're carrying on your behalf, whether you know it or not. This piece is about what it looks like, why smart people let it sit, and what to ask.

What the risk actually is

Most software companies have systems that run quietly in the background. Monitoring tools. Backup jobs. The thing that sends invoices on the first of the month. The product depends on them.

These systems usually get built quickly, often by one person under pressure. Once they work, nobody touches them. Over months, they accumulate small adjustments: a fix here, a workaround there. These exist only on the machine they run on, and only in the head of the person who made them.

That's the risk. Systems fail all the time and get rebuilt; that's normal. The risk is that when it fails, the only path to recovery runs through one person's memory.

If that person is on vacation, you wait. If they're sick, you wait. If they leave the company, you may not recover at all. You rebuild from scratch and hope you didn't lose anything important along the way.

Why smart people let it sit

Because the math, in the moment, always favors letting it sit.

Fixing the problem takes a focused weekend. That weekend has a visible cost: a feature doesn't ship, a customer waits, a bug stays open. The cost of not fixing it is invisible: a bad day that hasn't happened yet, that might not happen this quarter, that the engineer can probably handle if it does.

Week after week, the visible cost beats the invisible one. The engineer makes a defensible choice each time, and the debt gets a little larger.

The trap

Discipline alone doesn't solve it. This is how risk works when one side of the ledger is concrete and the other is hypothetical. You can carry this kind of debt for a long time without anyone, including the person carrying it, noticing how heavy it's gotten.

Until something tips.

The warnings I ignored

Three times this spring, the machine running our monitoring system had gone down.

Mar 16
Power outage. Machine went dark.
4 days down
Apr 15
Windows update gone bad. System unresponsive.
15.5 hours down
Apr 25
Cold-boot glitch. The monitoring system itself also crashed.
3.6 hours down

Each time, I recovered it by hand. Each recovery took longer than it should have, because the steps lived in my head, not on paper. Each one was a warning I treated like an inconvenience instead of a signal.

April 25 was the warning I couldn't argue with. The system whose job was to tell us when things break was now the thing breaking. If we lost the machine that day, we'd lose visibility into everything else at the same moment we needed it most.

The decision that mattered

When I finally sat down to fix it, the obvious move was to throw out the three files and write something cleaner. A weekend of work, no baggage.

I didn't.

The three files weren't pretty, but they were right. Over three months, I'd baked small fixes into them: adjustments I'd made the day after specific incidents, settings tuned to how my particular network behaves, a threshold I'd raised five times because the default kept sending false alarms at 3 a.m. A clean rewrite meant either re-deriving every one of those fixes or shipping without them. The first was a month of work. The second meant trading one kind of fragility for another.

So I did the harder, less satisfying thing. I captured exactly what was running, including every quirk and workaround, and put that into version control. Then I built a one-command process to re-deploy it from scratch onto a new machine, in case I ever need to.

It's uglier than the rewrite would have been. It's also impossible to be wrong about, because the source of truth is the system I know is working.

This is worth knowing if you hire engineers: the right answer is often the one that looks worse on paper. A senior engineer will sometimes pick the slower, less elegant path because it's the one with fewer ways to silently lose something. If you push them toward the cleaner story, you can lose what you didn't know you had.

What actually changed

my memory a backed-up repository
half a day, by me five minutes, anyone with the repo
three scattered backup files full version history
no drift detection one command reports it

The system runs the same as it did on April 25. The only difference is that I no longer carry it in my head.

System ownership transition

Memory-based system

  • × One person
  • × Undocumented process
  • × Scattered files
Fixed

Owned system

  • Repository
  • Version control
  • Repeatable recovery
Before
  • Memory
  • Manual recovery
  • Single machine
  • Hidden risk
After
  • Repository
  • One-command deployment
  • Backup
  • Shared ownership

The same shape, in your business

The pattern travels beyond engineering.

Every working business carries at least one version of it. A customer who's 40% of revenue. A deal pipeline that depends on one rep's relationships. A bookkeeping system only the founder can navigate. A vendor with no real backup. A password vault on a single laptop.

In each case, you have something that's working now, that you've been meaning to address later, where the cost of fixing it is bounded and visible, and the cost of carrying it is unbounded and invisible. Until it isn't.

You can't avoid these. Every working business has some. The judgment is to name them. Write them down. Put a rough cost on what happens if each one fails on the worst day. Re-read the list every quarter and ask whether anything has gotten cheaper to fix or more expensive to carry.

The muscle is keeping a current account of which risks you're carrying and why.

What to ask the technical people you hired

Three questions, in order. Ask them plainly. A good engineer will be glad you asked.

If the machine running [thing] died right now, how long until we're back up?
You're listening for a specific answer. "An hour" is fine. "I'd have to check" is fine if followed by "let me get back to you tomorrow." A long pause is the signal.
Does the recovery require you specifically, or could someone else on the team do it from a written process?
"Just me" is fine in a small company. It's the information that tells you where your concentration risk is.
What's on your list of things you've been meaning to fix but haven't?
Every engineer has this list. The good ones can name three items off the top of their head. The list itself is more useful than any one item on it, because it tells you they're keeping the account.

You don't need to fix everything they name. You need to know it's named.

April 25 was the day my account got updated. I'd been carrying a number I hadn't checked in months, and the number had grown.

The fix took a week of evenings. I should have done it in February.

I'll do the next one in February.

Bí ọmọdé bá ṣubú, a wo iwajú; bí àgbàlagbà bá ṣubú, a wo ẹ̀hìn.
When a child falls, he looks ahead; when an elder falls, he looks behind.
Originally published on Medium. Read the original article →