What actually changes when reliability becomes a board-level problem

Every technology executive remembers the incident that changed how they think about reliability. Not a blip on a dashboard. Not a P2 bumped to P1 during a Monday morning review. The one that brought public attention, uncomfortable board questions and a sudden awareness that the reliability of your systems has much broader consequences than you thought.

I think about that moment often.

When reliability graduates to a board-level problem, the context shifts entirely. The importance of “error budgets” and “latency percentiles” and “partial service degradation” fades. Revenue loss becomes the focus. Protecting long-term shareholder value is where the conversation centers. Risk management is called into question. Named customer accounts and how their customers were impacted become more important. Incidents in this context can have ripple effects beyond the architecture diagram. They become less of a technical event and more of a fiduciary event with technical roots. That distinction matters more than most engineers realize.

Boards don’t think in terms of uptime. Their focus is on negligence, fiduciary duty and whether the people they trusted to manage risk actually managed it. Increasingly, boards treat systems like payroll, payment rails and healthcare not as highly available infrastructure but as non-negotiable obligations. That’s not a reliability target. It’s a governance stance.

After a major disruption and public attention, many companies execute a reliability reset. I’ve done one at almost every company I’ve worked for. At Salesforce, we called it a “Trust Reset.” The work is real and it’s hard: hardening systems, building chaos engineering practices, injecting failures to expose problems before production does it for you, tightening runbooks and building more graceful degradation patterns. That work matters. It’s the foundation. But I’ve seen organizations execute technically flawless reliability programs and still lose customer trust after a major incident.

What many executives miss is that system reliability is about protecting revenue. But to galvanize action that drives meaningful change, the mission needs to be anchored to people. Not just systems. The humans on the other end of the SLA. Everything else follows from that.

The ‘chief apology officer’

I used to joke that my real title was “Chief Apology Officer.” After a major incident, I was often the one sitting across from impacted customers. These weren’t comfortable conversations but they gave me something most senior engineering executives never get: A direct line to the people actually using the product.

Be careful what you wish for.

I was utterly unprepared for the emotional weight of hearing how our incidents affected real lives. The cold grip of being a 911 dispatcher when your system stops responding. “I just kept hitting refresh” she told me. On the other end of that frozen screen was someone waiting for help that wasn’t coming. Or the small shoe-repair business owner in Milwaukee who missed his payroll processing cutoff. He didn’t care about “elevated latency.” He cared about paying the college student he employed part-time.

“I used to babysit that kid” he told me.

These are the stories that stay with me. They change the stakes entirely. They make you realize that behind every severity one incident, there are people doing real work and making real decisions about real lives. We owe them more than a status page update.

The trust rebuild journey

What made the difference for us was treating postmortems not as retrospective artifacts, but as customer contracts. Before the shift, some remediation items died quietly in quarterly planning cycles. After it, every action item was tied to a named sponsor and a delivery date with executive-visible reporting.

The result wasn’t just cleaner postmortems. It was a measurable reduction in repeat incidents and easier renewal conversations with previously impacted accounts.

I remember a specific conversation six weeks after a major outage. The customer’s VP of IT had our postmortem open on her laptop. She pointed to an action item and asked, “Did you actually do this?” We had. She nodded slowly, closed the laptop and for the first time in the meeting, she relaxed. That moment changed how I viewed those documents forever. They weren’t internal artifacts. They were promises. To our customers. To their businesses. To their customers.

We began sharing postmortems with impacted customers and attending renewal meetings to speak to the outcomes directly. We called these “trust rebuild journeys.” The customer didn’t care about our blameless culture. They cared that we kept our word.

This led to restraint as much as execution. Not every action item could be pursued. Resources are finite. But the ones that mattered were clearly owned, tracked and resolved. That visibility rebuilt the trust our customers had in us.

The unified command

As organizations scale, ownership becomes distributed by design. Decision-making moving through hierarchies at a deliberate pace works during peacetime. But during a major disruption of service (known as wartime), it can create hesitation, competing narratives or a quiet assumption that someone else will make an important call.

I call this decision latency. The gap between understanding the problem and acting on it.

In one major incident, it took a few minutes to roll back a change and restore stability. The rollback was automated but the decision wasn’t. The costliest moment was the 37 minutes spent debating whether to roll back or fix-forward. That delay extended customer impact far more than the underlying defect. To an engineer, 37 minutes can pass quickly. To a 911 dispatcher with a frozen screen, it doesn’t.

So we introduced the game clock. When a severity-one incident was declared, a clock started. Not from when the bridge opened, from the moment impact began. Every major decision had a predefined time limit. If it wasn’t made, escalation was automatic. Decision latency shortened immediately.

But we had to be careful. A game clock can easily create a bigger problem. In complex distributed systems, a hasty rollback can be more catastrophic than the original issue. So we paired the game clock with Unified Command: A temporary power structure where one person holds explicit authority, everyone else supports them and the mandate expires when the incident closes. CEO level command. Real authority.

We codified the whole sequence in a playbook: All impacted customers notified within 10 minutes of impact, customer-facing teams alerted by SMS, email and automated phone calls five minutes before that. Clarity created buy-in. Execution consistency created credibility.

When we combined explicit command authority with a predefined and automated playbook, incident response became less of an ‘emergency’. It became routine. The rules of engagement were clear. Executives knew when to lean in and when to trust the person running the incident.

As we strengthened our follow-through from postmortem to engineering solution, failovers happened earlier and eventually became automated. They evolved into continuous site switches across hundreds of global data centers and public cloud regions. That turned a reactive recovery step into a proactive way to identify hidden dependencies, before they caused customer impact.

At a different company, I embedded a large team of reliability engineers horizontally across 213 software scrum teams. A simple truth emerged during that time. The complex task of reliability engineering can go unseen, yet it’s the only engineering function whose failures surface directly as revenue risk. I started to speak the language of revenue.

The language of revenue

We were at a downtown restaurant in San Francisco, sipping old-fashioneds in the quiet hush of the before-dinner crowd. We were huddled over a 13-inch laptop looking at slides. That’s when a VP of Sales at Datadog told me “No one else talks to us about revenue impacted by incidents or wants to track it the way you do.”

Many CTOs and CIOs still frame incidents in terms of severity and blast radius. Those are engineering constructs. They describe the shape of the problem without describing its cost. CFOs think about net revenue retention, churn risk, customer lifetime value and expansion revenue. If we want reliability to be treated as a strategic investment rather than an insurance policy, we have to speak that language.

The reframe is straightforward. When you present an outage as “we had a P1 for 90 minutes affecting 12% of customers,” you’re describing a technical event. When you present it as “we put $4.2 million in annual recurring revenue at risk across 37 named accounts, three of which are in active renewal cycles,” you’re describing a business problem that demands a business response.

That reframe changes everything. It changes the resource allocation. It changes the urgency of remediation. It changes whether reliability earns a seat at the strategy table or remains a line item in the infrastructure budget.

We put in place a simple rule: Every severity-one incident required a revenue impact assessment within 24 hours. It included named accounts affected, ARR at risk and renewal stage exposure. We treated reliability not just as a business response to a crisis, but as a long-term capital improvement project. Once revenue entered the conversation, it didn’t just justify the emergency fixes. It justified funding the deep, often invisible work that keeps everyone online.

The ripple effects, the future and the window of opportunity

Speaking the language of revenue gave systems reliability a seat at the boardroom table. But some of the stories in this article belong to a specific era.

And I can see that era ending.

Building systems that can withstand real-world failure is genuinely hard. They require judgment as much as engineering. Chaos engineering isn’t a buzzword. It takes deep engineering strength to inject failure into systems that process payroll, payments and healthcare transactions and years of testing for hidden dependencies to have the confidence to do it in production. Our teams that built these systems didn’t just keep the lights on. They engineered resilience into massive systems that couldn’t afford to be wrong.

That’s what makes the current moment so charged.

A new generation of well-funded startups is building autonomous agents that detect and remediate failure patterns without human intervention. Operational playbooks that took years to codify have become training data. But the hardest reliability problems were the cascading failures across service boundaries that nobody thought about or modeled, the strange behavior under load that didn’t exist in staging and data corruption that only surfaced three days after a deployment. Those are judgment problems as much as engineering problems. And we’re already hollowing out the teams that can solve them.

I wrote in a previous piece about how AI cost savings create tomorrow’s problems when the humans who understood the systems are no longer there to catch what automation misses. That window of opportunity to plan for our people is still open. But it’s narrowing. And the broader economic engine I explored in The Genesis Gamble, where technology investment needs to build genuine productive capacity, applies here too. Reliability infrastructure isn’t exempt from that test.

So this is where the tension lives. The work described in this article, the game clocks, unified command, the trust rebuild journeys was built by people, for people. The next chapter may look very different. But the obligation doesn’t change.

Every technology executive remembers the incident that changed how they think about reliability. What actually changed for me wasn’t the tooling or the processes, though those mattered. What changed was the answer to the board’s real question. They never asked whether our systems were stable. They asked whether we understood what we were protecting.

The 911 dispatcher, the homebuyer, the shoe repair shop owner in Milwaukee: They are not edge cases in our SLA calculations.

They are the reason the SLA exists.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: What actually changes when reliability becomes a board-level problem
Source: News