Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

Your cloud provider is a single point of failure

The morning of Monday, Oct 20, 2025, I went to my healthcare provider’s portal to pay a bill. This was my experience:

Screenshot of healthcare portal showing internal server error

Jim Wilt

Upon calling my provider to pay over the phone, they were unable to take my payment as their internal systems were also down, leaving us customers hanging with no direction on how to proceed.

My healthcare provider’s SaaS was completely functional; however, their integrated payment vendor, which is reliant on AWS infrastructure, apparently has ineffective redundancies. So, the 10/20/2025 AWS outage resulted in a most unfortunate experience for any customer or employee hoping to utilize this important capability while hindering my healthcare organization from receiving revenue.

Who is to blame? AWS? The payment vendor? Ultimately, my healthcare provider is responsible for their customers’ (and employees’) inability to interact with their services. A cloud outage is not in the same acts-of-nature class as hurricanes, earthquakes, tornadoes, etc., but we do treat them as such and that is simply wrong because these outages can be mitigated.

This is a clear and far too common industry-wide epidemic: poor adoption and execution of cloud computing resilience, resulting in unreliable critical services to both customers and employees.

As reported directly by AWS’ Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region, a latent race condition in DynamoDB’s DNS management system led to an empty DNS record for the US-EAST regional endpoint, causing resolution failures affecting both customer and internal AWS service connections. This adversely affected the following services: Lambda, ECS, EKS, Fargate, Amazon Connect, STS, IAM Console Sign-In and Redshift.

On 10/29/2024, Microsoft 365 (m365.cloud.microsoft or portal.office.com) experienced an outage due to the rollout of an impacting code change. This affected Microsoft 365 admin center, Entra, Purview, Defender, Power Apps, Intune and add-ins & network connectivity in Outlook. This is all documented by Microsoft in Users may have experienced issues when accessing m365.cloud.microsoft or portal.office.com.

Both of these recent outages required vendors to halt automated processes and manually navigate recovery to bring affected systems back to an operational state. Let’s face it: Cloud providers are not magical and are subject to the same recovery patterns as any enterprise.

Outages are a reality of any system or platform and affect literally every organization. Hence:

Your cloud provider is a single point of failure!

Corporate infrastructure strategies vary from total dependence on provider vendors to actively taking ownership and architecting necessary redundancies for critical systems. When underlying provider outages occur, it is often a catalyst to revisit enterprise resilience strategy, even if you are not directly affected.

When examining an enterprise’s fault-tolerant architectures (which rarely even exist), it may be a good time to instead consider fault avoidance architectures. The latter kicks in when bad happens and the former actively monitors triggers to avoid bad.

This type of introspective examination is too often overlooked, as it is far easier for enterprises to fall into believing the many myths that govern their IT strategy and operations, especially when it comes to Cloud.

Unpacking the myths

Myth #1 – A single cloud provider reduces complexity

Vendors will place every kind of study and incentive in front of enterprise leadership to back the fallacy that locking into their platform is in the best interests of their company. Let’s be clear: It is always in the best interests of the vendor. This concept is then passed down from leadership to engineers who are encouraged to believe what their leadership tells them, and we get into a situation where thousands upon thousands of companies are under the control of a single vendor. Scary, right?

When it comes to multi-zonal resilience, app-cross-region resilience, blast-radius reduction, and resilient app patterns, there is additional complexity. Knowing that these approaches have dependencies on complex fine-tuned cloud infrastructures means there is no easy button.

Myth #2 – Cloud platform component defaults are generally a good starting point

Relying on easy-button best practices is what gets enterprises into trouble. The responsibility of an IT cloud infrastructure team is to work with solution architects and engineers to fine-tune their designs to optimize efficiencies, resilience and performance while controlling costs. Cloud vendor default configurations are necessary as they set a functional starting point, but they should never be trusted as a sound design. In fact, they can produce unnecessarily large loads on default regions when left unchecked. The AWS US-East-1 region is historically the most affected region when it comes to outages, and yet so many critical enterprise systems run exclusively in that region.

Vendor plug-and-play architectures must be scrutinized before going into production.

A responsible architecture governance practice should have a policy to avoid known outage-prone regions and single-point-of-failure configurations. These should be vetted in the architecture review board before ever going to production.

Myth #3 – My cloud provider/vendor will take care of me

Service level agreements (SLA) are paid out service credits tied to the cost of the affected service, not cash refunds. They generally start at 10% of service charges, never resulting losses. Your enterprise will literally get pennies back on dollars lost.

The July 2024 CloudStrike outage cost CloudStrike around $75M + $60M they paid out in service credits. This pales an order of magnitude when compared to just one customer, Delta Airlines, which lost $500M net. Parametrix Insurance’s detailed analysis estimates the total direct financial loss facing the US Fortune 500 companies is $5.4B. CloudStrike literally paid pennies on the dollar for their error, so an enterprise’s reliance on a vendor must be managed knowing this reality.

The 11/18/2025 Cloudflare outage, with its 20% hold on global web traffic, equally affected hundreds of millions of accounts, including major systems like X (Twitter), OpenAI/ChatGPT, Google’s Gemini, Perplexity AI, Spotify, Canva and even all three cloud providers. This heightens how a single vendor/platform dependency is a real threat to business continuity.

Enterprises must protect themselves because their vendors won’t.

Future purchase and contract negotiations should pivot toward SLA penalties that are based on enterprise losses over enterprise service costs. Unfortunately, this will drive service costs higher, but it builds in better financial protection when reliant on systems outside of your control.

Myth #4 – Multi-cloud is too expensive and too demanding

To mitigate the impact of regional cloud outages, enterprises that adopt multi-cloud architectures that prioritize resilience, portability, and failover orchestration find additional benefits when they are implemented with a mindset of fault avoidance and cost/performance optimization. This means multiple triggers govern where workloads run, resulting in optimal efficiencies.

This needs to be a priority effort backed by the C-Suite and requires a culture shift to succeed; hence, multi-cloud deployments are exceedingly rare. Still, those who have done this reap benefits well beyond resilience (e.g., large orgs like Walmart, Goldman Sachs, General Electric and BMW as well as SMBs like FirstDigital, Visma and Assorted Data Protection).

The NIST Cloud Federation Reference Architecture (NIST Special Publication 500-332) is a great document to establish a baseline grounding of these concepts.

  • Active-active resilience is a pattern for mission-critical apps (e.g., financial trading, healthcare, e-commerce checkout). It maximizes resilience and availability, but at a higher cost due to duplicated infrastructure and complex synchronization. This pattern lends itself best toward fault avoidance with all the goodness of proactive efficiency and optimization triggers.
  • Active-passive failover is a pattern where a primary cloud handles all traffic, and a secondary cloud is on standby. It provides disaster recovery without the full cost of active-active, but will introduce some downtime and requires a robust replication strategy. It clearly is only a fault-tolerant approach.
  • Cloud bursting is a pattern where applications run primarily in one cloud but “burst” into another during demand spikes, providing elastic scalability without over-provisioning. It can also provide a good degree of fault tolerance.
  • Workload partitioning (best-of-breed placement) is a pattern where different workloads are assigned to the cloud provider best suited for them. It greatly optimizes performance, compliance, and cost by leveraging provider strengths, but will not be fully fault-tolerant.

Myth #5 – Cloud has failed. It’s time to get out

This is a recurring theme each time there is a major cloud outage, often tied equally to a cost comparison between on-premises vs. cloud (yes, cloud almost always costs more). The reality is that while there is true value to cloud in the overall infrastructure strategy, there also is value in prioritizing an investment in infrastructure choices, leveraging sensible hybrid strategies. Two effective strategic architectures are based on edge and Kubernetes. Edge reduces blast radius, while Kubernetes provides portable resilience across providers. Both are recommended when aligned with workload architecture and operational maturity.

  • Edge-integrated resilience extends workloads to the edge while maintaining synchronization with central clouds. Local edge nodes can continue operations even if cloud connectivity is disrupted, then reconcile state once reconnected. In addition to adding a moderate level of resiliency, it also benefits from ultra-low latency for real-time processing (e.g., IoT, manufacturing robotics, autonomous vehicles). This approach is often found in factory, retail store, and branch office use cases.
  • Kubernetes-orchestrated resilience is a cloud-agnostic orchestration layer that can be leveraged locally and across multiple providers. In addition to a prominent level of resilience, this type of service mesh (e.g., Istio, Linkerd) adds traffic routing and failover capabilities that reduce vendor lock-in. Overall, it is a foundational enabler for multi-cloud, giving enterprises a consistent control plane across providers and on-premises.

Calls to action

There are two major enterprise IT leadership bias camps: Build and Buy. Both play a factor in every enterprise.

The reference architecture patterns shared above address Build bias workloads, which include integrations with Buy workloads.

Buy bias workloads are too often subject to vendor-defined SLAs discussed above, which are terribly limiting to 10-100% credits for charges based on the duration of an outage as penalties. Realistically, that really is not going to change; however, SaaS quality over the past 20 years has increased substantially:

Chart: SaaS Quality Dimensions, 2005-2025

Jim Wilt

This becomes the new bar and offers a great measure an enterprise can leverage for both themself and their vendors:

The 1-9 Challenge: Every SaaS vendor, integrator and internal enterprise solution should provide one “9” better than their underlying individual hosting platforms alone.

For example, when each cloud vendor provides a 99.9% SLA for a given service, leveraging an active-active multi-cloud architecture raises that SLA well beyond 99.99%.

Take control of your critical services first and leverage these patterns as a baseline for net-new initiatives moving forward, making high resilience your new norm.

Bottom line: the enterprise is always responsible for its own resiliency. It’s time to own this and take control!

This article was made possible by our partnership with the IASA Chief Architect Forum. The CAF’s purpose is to test, challenge and support the art and science of Business Technology Architecture and its evolution over time as well as grow the influence and leadership of chief architects both inside and outside the profession. The CAF is a leadership community of the IASA, the leading non-profit professional association for business technology architects.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?


Read More from This Article: Your cloud provider is a single point of failure
Source: News

Category: NewsDecember 10, 2025
Tags: art

Post navigation

PreviousPrevious post:生成AIの熱狂が直面する「物理的な壁」――サーバー室の外側で起きている電力・冷却・サプライチェーンの地殻変動NextNext post:Digitalización e IA, el cambio estructural que está redefiniendo la automoción

Related posts

샤오미, MIT 라이선스 ‘미모 V2.5’ 공개···장시간 실행 AI 에이전트 시장 겨냥
April 29, 2026
SAS makes AI governance the centerpiece of its agent strategy
April 29, 2026
The boardroom divide: Why cyber resilience is a cultural asset
April 28, 2026
Samsung Galaxy AI for business: Productivity meets security
April 28, 2026
Startup tackles knowledge graphs to improve AI accuracy
April 28, 2026
AI won’t fix your data problems. Data engineering will
April 28, 2026
Recent Posts
  • 샤오미, MIT 라이선스 ‘미모 V2.5’ 공개···장시간 실행 AI 에이전트 시장 겨냥
  • SAS makes AI governance the centerpiece of its agent strategy
  • The boardroom divide: Why cyber resilience is a cultural asset
  • Samsung Galaxy AI for business: Productivity meets security
  • Startup tackles knowledge graphs to improve AI accuracy
Recent Comments
    Archives
    • April 2026
    • March 2026
    • February 2026
    • January 2026
    • December 2025
    • November 2025
    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.