Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

What actually changes when reliability becomes a board-level problem

Every technology executive remembers the incident that changed how they think about reliability. Not a blip on a dashboard. Not a P2 bumped to P1 during a Monday morning review. The one that brought public attention, uncomfortable board questions and a sudden awareness that the reliability of your systems has much broader consequences than you thought.  

I think about that moment often. 

When reliability graduates to a board-level problem, the context shifts entirely. The importance of “error budgets” and “latency percentiles” and “partial service degradation” fades. Revenue loss becomes the focus. Protecting long-term shareholder value is where the conversation centers. Risk management is called into question. Named customer accounts and how their customers were impacted become more important. Incidents in this context can have ripple effects beyond the architecture diagram. They become less of a technical event and more of a fiduciary event with technical roots. That distinction matters more than most engineers realize.  

Boards don’t think in terms of uptime. Their focus is on negligence, fiduciary duty and whether the people they trusted to manage risk actually managed it. Increasingly, boards treat systems like payroll, payment rails and healthcare not as highly available infrastructure but as non-negotiable obligations. That’s not a reliability target. It’s a governance stance. 

After a major disruption and public attention, many companies execute a reliability reset. I’ve done one at almost every company I’ve worked for. At Salesforce, we called it a “Trust Reset.” The work is real and it’s hard: hardening systems, building chaos engineering practices, injecting failures to expose problems before production does it for you, tightening runbooks and building more graceful degradation patterns. That work matters. It’s the foundation. But I’ve seen organizations execute technically flawless reliability programs and still lose customer trust after a major incident. 

What many executives miss is that system reliability is about protecting revenue. But to galvanize action that drives meaningful change, the mission needs to be anchored to people. Not just systems. The humans on the other end of the SLA. Everything else follows from that. 

The ‘chief apology officer’

I used to joke that my real title was “Chief Apology Officer.” After a major incident, I was often the one sitting across from impacted customers. These weren’t comfortable conversations but they gave me something most senior engineering executives never get: A direct line to the people actually using the product. 

Be careful what you wish for. 

I was utterly unprepared for the emotional weight of hearing how our incidents affected real lives. The cold grip of being a 911 dispatcher when your system stops responding. “I just kept hitting refresh” she told me. On the other end of that frozen screen was someone waiting for help that wasn’t coming. Or the small shoe-repair business owner in Milwaukee who missed his payroll processing cutoff. He didn’t care about “elevated latency.” He cared about paying the college student he employed part-time. 

“I used to babysit that kid” he told me.  

These are the stories that stay with me. They change the stakes entirely. They make you realize that behind every severity one incident, there are people doing real work and making real decisions about real lives. We owe them more than a status page update. 

The trust rebuild journey 

What made the difference for us was treating postmortems not as retrospective artifacts, but as customer contracts. Before the shift, some remediation items died quietly in quarterly planning cycles. After it, every action item was tied to a named sponsor and a delivery date with executive-visible reporting. 

The result wasn’t just cleaner postmortems. It was a measurable reduction in repeat incidents and easier renewal conversations with previously impacted accounts. 

I remember a specific conversation six weeks after a major outage. The customer’s VP of IT had our postmortem open on her laptop. She pointed to an action item and asked, “Did you actually do this?” We had. She nodded slowly, closed the laptop and for the first time in the meeting, she relaxed. That moment changed how I viewed those documents forever. They weren’t internal artifacts. They were promises. To our customers. To their businesses. To their customers. 

We began sharing postmortems with impacted customers and attending renewal meetings to speak to the outcomes directly. We called these “trust rebuild journeys.” The customer didn’t care about our blameless culture. They cared that we kept our word. 

This led to restraint as much as execution. Not every action item could be pursued. Resources are finite. But the ones that mattered were clearly owned, tracked and resolved. That visibility rebuilt the trust our customers had in us.  

The unified command 

As organizations scale, ownership becomes distributed by design. Decision-making moving through hierarchies at a deliberate pace works during peacetime. But during a major disruption of service (known as wartime), it can create hesitation, competing narratives or a quiet assumption that someone else will make an important call. 

I call this decision latency. The gap between understanding the problem and acting on it. 

In one major incident, it took a few minutes to roll back a change and restore stability. The rollback was automated but the decision wasn’t. The costliest moment was the 37 minutes spent debating whether to roll back or fix-forward. That delay extended customer impact far more than the underlying defect. To an engineer, 37 minutes can pass quickly. To a 911 dispatcher with a frozen screen, it doesn’t. 

So we introduced the game clock. When a severity-one incident was declared, a clock started. Not from when the bridge opened, from the moment impact began. Every major decision had a predefined time limit. If it wasn’t made, escalation was automatic. Decision latency shortened immediately. 

But we had to be careful. A game clock can easily create a bigger problem. In complex distributed systems, a hasty rollback can be more catastrophic than the original issue. So we paired the game clock with Unified Command: A temporary power structure where one person holds explicit authority, everyone else supports them and the mandate expires when the incident closes. CEO level command. Real authority. 

We codified the whole sequence in a playbook: All impacted customers notified within 10 minutes of impact, customer-facing teams alerted by SMS, email and automated phone calls five minutes before that. Clarity created buy-in. Execution consistency created credibility. 

When we combined explicit command authority with a predefined and automated playbook, incident response became less of an ‘emergency’. It became routine. The rules of engagement were clear. Executives knew when to lean in and when to trust the person running the incident.  

As we strengthened our follow-through from postmortem to engineering solution, failovers happened earlier and eventually became automated. They evolved into continuous site switches across hundreds of global data centers and public cloud regions. That turned a reactive recovery step into a proactive way to identify hidden dependencies, before they caused customer impact.  

At a different company, I embedded a large team of reliability engineers horizontally across 213 software scrum teams. A simple truth emerged during that time.  The complex task of reliability engineering can go unseen, yet it’s the only engineering function whose failures surface directly as revenue risk. I started to speak the language of revenue.  

The language of revenue 

We were at a downtown restaurant in San Francisco, sipping old-fashioneds in the quiet hush of the before-dinner crowd. We were huddled over a 13-inch laptop looking at slides. That’s when a VP of Sales at Datadog told me “No one else talks to us about revenue impacted by incidents or wants to track it the way you do.” 

Many CTOs and CIOs still frame incidents in terms of severity and blast radius. Those are engineering constructs. They describe the shape of the problem without describing its cost. CFOs think about net revenue retention, churn risk, customer lifetime value and expansion revenue. If we want reliability to be treated as a strategic investment rather than an insurance policy, we have to speak that language. 

The reframe is straightforward. When you present an outage as “we had a P1 for 90 minutes affecting 12% of customers,” you’re describing a technical event. When you present it as “we put $4.2 million in annual recurring revenue at risk across 37 named accounts, three of which are in active renewal cycles,” you’re describing a business problem that demands a business response. 

That reframe changes everything. It changes the resource allocation. It changes the urgency of remediation. It changes whether reliability earns a seat at the strategy table or remains a line item in the infrastructure budget. 

We put in place a simple rule: Every severity-one incident required a revenue impact assessment within 24 hours. It included named accounts affected, ARR at risk and renewal stage exposure. We treated reliability not just as a business response to a crisis, but as a long-term capital improvement project. Once revenue entered the conversation, it didn’t just justify the emergency fixes. It justified funding the deep, often invisible work that keeps everyone online.  

The ripple effects, the future and the window of opportunity 

Speaking the language of revenue gave systems reliability a seat at the boardroom table. But some of the stories in this article belong to a specific era. 

And I can see that era ending. 

Building systems that can withstand real-world failure is genuinely hard. They require judgment as much as engineering. Chaos engineering isn’t a buzzword. It takes deep engineering strength to inject failure into systems that process payroll, payments and healthcare transactions and years of testing for hidden dependencies to have the confidence to do it in production. Our teams that built these systems didn’t just keep the lights on. They engineered resilience into massive systems that couldn’t afford to be wrong. 

That’s what makes the current moment so charged.  

A new generation of well-funded startups is building autonomous agents that detect and remediate failure patterns without human intervention. Operational playbooks that took years to codify have become training data. But the hardest reliability problems were the cascading failures across service boundaries that nobody thought about or modeled, the strange behavior under load that didn’t exist in staging and data corruption that only surfaced three days after a deployment. Those are judgment problems as much as engineering problems. And we’re already hollowing out the teams that can solve them.  

I wrote in a previous piece about how AI cost savings create tomorrow’s problems when the humans who understood the systems are no longer there to catch what automation misses. That window of opportunity to plan for our people is still open. But it’s narrowing. And the broader economic engine I explored in The Genesis Gamble, where technology investment needs to build genuine productive capacity, applies here too. Reliability infrastructure isn’t exempt from that test. 

So this is where the tension lives. The work described in this article, the game clocks, unified command, the trust rebuild journeys was built by people, for people. The next chapter may look very different. But the obligation doesn’t change. 

Every technology executive remembers the incident that changed how they think about reliability. What actually changed for me wasn’t the tooling or the processes, though those mattered. What changed was the answer to the board’s real question. They never asked whether our systems were stable. They asked whether we understood what we were protecting. 

The 911 dispatcher, the homebuyer, the shoe repair shop owner in Milwaukee: They are not edge cases in our SLA calculations.  

They are the reason the SLA exists. 

This article is published as part of the Foundry Expert Contributor Network.
Want to join?


Read More from This Article: What actually changes when reliability becomes a board-level problem
Source: News

Category: NewsMarch 25, 2026
Tags: art

Post navigation

PreviousPrevious post:デュアルユースの現実:軍民両用技術が社会実装を加速し、難しさも増やす理由NextNext post:How to rescue failing AI initiatives

Related posts

SAS makes AI governance the centerpiece of its agent strategy
April 29, 2026
The boardroom divide: Why cyber resilience is a cultural asset
April 28, 2026
Samsung Galaxy AI for business: Productivity meets security
April 28, 2026
Startup tackles knowledge graphs to improve AI accuracy
April 28, 2026
AI won’t fix your data problems. Data engineering will
April 28, 2026
The inference bill nobody budgeted for
April 28, 2026
Recent Posts
  • SAS makes AI governance the centerpiece of its agent strategy
  • The boardroom divide: Why cyber resilience is a cultural asset
  • Samsung Galaxy AI for business: Productivity meets security
  • Startup tackles knowledge graphs to improve AI accuracy
  • AI won’t fix your data problems. Data engineering will
Recent Comments
    Archives
    • April 2026
    • March 2026
    • February 2026
    • January 2026
    • December 2025
    • November 2025
    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.