The end of cloud-first: What compute everywhere actually looks like

In 2016, I was working on software for field area network gateways — routers installed in substations and roadside utility cabinets and expected to run unattended for years. Each gateway sat at the root of a low-power wireless mesh connecting thousands of smart meters. The radios were slow, the links were lossy and the backhaul was expensive.

We didn’t debate architecture. We reacted to constraints.

The gateways validated meter events, aggregated readings and made local decisions before anything reached a centralized system. Raw telemetry rarely left the field; the network couldn’t support it, and the latency of a round-trip to a datacenter would have broken real-time grid operations.

At the time, no one called this “compute everywhere.” It wasn’t a strategy — it was simply the only design that worked.

Years later, I saw the same pattern repeat in very different systems: video pipelines that moved inference closer to where content was already served, ML models pushed onto devices that once only forwarded data, and cloud platforms evolving to support compute outside centralized regions. Cloud-first didn’t fail; the assumptions underneath it stopped holding.

This article isn’t about distributed compute as a trend. It’s about the mechanics: where inference actually runs, how data really moves and what breaks operationally once workloads leave the cloud.

What “compute everywhere” actually means

“Compute everywhere” isn’t a rebrand of edge computing. It’s a recognition that modern systems need computation at multiple tiers — and that those tiers must cooperate.

A useful way to think about it is as a spectrum:

Device layer

Sensors and microcontrollers doing basic filtering or inference locally. If you want a concrete reference point, the community around the Edge AI Foundation (formerly the tinyML Foundation) is a good place to start.

Gateway layer

Aggregation points that translate protocols, correlate events and decide what’s worth sending upstream — often using constrained-network IP and routing stacks such as:

6LoWPAN
RPL

Edge layer

Compute running in regional points of presence (PoPs) close to data sources and users, often operated by CDN or edge-compute providers.

Cloud layer

Centralized resources for training models, coordinating fleets and doing analytics that benefit from global context.

The shift isn’t about replacing the cloud. It’s about recognizing that “where should this computation run?” no longer has a default answer.

Why the equation changed

A few forces converged to break the cloud-first assumption.

IoT was never cloud-first

Early industrial IoT systems were built around the assumption that data was expensive to move. Devices were power-constrained, often battery-operated and communicated over lossy, low-bandwidth networks.

In utility and smart-metering deployments, that reality shaped the stack. Standards such as IEEE 802.15.4g were developed specifically for Smart Utility Networks operating under these constraints.

Shipping everything upstream wasn’t just inefficient; it was often impossible. Architectures assumed local aggregation and selective reporting because the network simply could not sustain continuous raw backhaul.

That constraint wasn’t new.

What changed was the data — and how quickly it stopped being manageable.

Data got heavier

As systems began incorporating cameras, radar, lidar and high-frequency industrial sensors, payloads stopped looking like measurements and started looking like streams.

A single video feed can consume sustained megabits per second even after compression. Multiply that across dozens of cameras in a factory, retail location or intersection, and continuous upstream transport stops being a cost optimization problem and becomes a hard architectural constraint.

This is why large-scale video and sensor deployments rarely ship raw data upstream once they move past pilot scale. Bandwidth adds up faster than most teams expect, and the cost isn’t just monetary.

Links saturate. Latency gets spiky. And once uplinks are congested, failures start to couple: a problem that used to be isolated to one site suddenly bleeds into the broader system because everyone is fighting for the same constrained path.

At a macro level, industry data reflects the same pressure. IDC’s Datasphere work (in a Seagate-hosted report) captures the scale of global data growth and how much of it originates outside centralized data centers.

Network forecasts tell a similar story: Cisco’s Annual Internet Report consistently highlights video as a major driver of IP traffic growth,

Those reports don’t tell you how to design your system; but they explain why the old defaults keep breaking.

In practice, teams respond in remarkably similar ways across domains. They reduce data before it moves. They aggregate, filter and extract features close to where data is generated. Raw data stays local unless there’s a clear operational or analytical reason to ship it upstream.

Once volume crosses a certain threshold, compute follows it, not as a matter of fashion, but because the network is no longer a neutral substrate.

ML learned to run small

Until recently, meaningful inference required GPUs in centralized clusters. That constraint shaped architectures as much as any design preference.

That’s no longer true.

Post-training quantization, model distillation and hardware-aware optimization are now mainstream — and supported directly in production toolchains. Google’s edge documentation for post-training quantization (via LiteRT / TensorFlow Lite workflows) is a good concrete reference.

As a result, models that once demanded datacenter-class hardware can now run within power and memory budgets measured in single-digit watts, particularly when paired with purpose-built edge accelerators and optimized runtimes. (Again, the Edge AI Foundation community is a useful signpost here.)

What made this viable wasn’t a single breakthrough, but a convergence: smaller models and better tooling at the edge to run inference continuously without blowing power or cost budgets.

Physics and regulations

Some constraints are absolute. In fiber, propagation delay alone sets a floor on latency. A commonly used rule of thumb is roughly 4.9 microseconds per kilometer, often rounded to 5 µs/km.

Regulatory constraints are just as unforgiving. Data residency and processing requirements under GDPR and similar frameworks shape where certain data can be processed.

Edge inference helps keep sensitive data local, with only aggregated or anonymized results sent upstream.

What this looks like in production

Once you accept those constraints, you keep seeing the same architectural shapes.

Decisions at the grid edge

In utility systems, milliseconds matter. Fault detection and isolation must happen locally to maintain grid stability. Gateways execute control logic continuously, while centralized systems focus on planning, analytics and model updates.

The cloud remains essential — but it’s not in the real-time control loop.

Video processed near where content lives

CDN operators and edge platforms increasingly provide compute capabilities at or near their PoPs. When video is already distributed close to users for delivery efficiency, processing it locally avoids unnecessary data movement.

You can see this kind of edge/cloud split discussed in live video analytics work, including Microsoft Research’s Rocket project.

Devices now decide

Across industrial and retail environments, devices that once forwarded raw measurements now filter, classify and act locally. Central systems still matter for aggregation, long-term analysis and retraining — but they’re no longer in the critical path for every decision the system makes.

Operational complexity

Here’s where the “compute everywhere” pitch gets fuzzy. The tooling evolved fast. Operating it is slower, harder work.

Deployment isn’t continuous anymore

Cloud deployments assume constant connectivity. Edge devices do not. Some synchronize once a day. Others disappear for weeks.

Updating software or models turns into a logistics problem: staged rollouts, health checks and the ability to stop or roll back when things go wrong. Those patterns show up explicitly in job-based fleet update mechanisms — for example, AWS IoT jobs.

Partial failures are normal

In fleets of thousands of devices, something is always broken. Power issues, network partitions, hardware variation and firmware bugs create a steady state of partial failure.

Observability is harder, too. A silent device might be offline — or dead. Distinguishing between the two requires explicit design, often based on heartbeats and deadlines rather than continuous metrics.

Fleet diversity

Over time, edge fleets drift. Hardware revisions, firmware versions and configuration exceptions accumulate. A model that works on most devices fails on a minority due to subtle differences no one documented.

Maintaining homogeneity becomes an operational necessity, not an aesthetic preference.

How teams actually decide what runs where

The teams that navigate this transition well don’t start with an “edge strategy.” They start by asking uncomfortable questions about their workload.

Where does the data originate, and what does it cost to move? Data gravity usually matters more than latency. If data is generated at the edge, shipping models outward is often cheaper and simpler than pulling raw data back to the cloud.
What constraints are non-negotiable? Physics sets latency floors. Regulations restrict data movement. Power and connectivity shape what you can assume about availability. When one of these forces compute outward, it’s better to accept it early than fight it later.
What are you actually optimizing for? I’ve seen teams push inference to the edge in the name of “latency” when their application could tolerate hundreds of milliseconds. The result was a large increase in operational complexity with no user-visible benefit. Measure what actually matters before you distribute anything.
Can you operate it? This is the question teams skip. Running edge infrastructure requires skills many cloud-native organizations don’t have: embedded systems experience, fleet management and tolerance for intermittent connectivity. If you can’t reliably update devices or reason about partial failures, keeping workloads centralized is often the safer choice.

The new default

Compute everywhere isn’t a new layer you bolt onto an existing architecture. It’s a change in what teams assume by default.

The cloud didn’t become irrelevant. It stopped being the reflexive answer to every placement question.

Organizations that navigate this well don’t frame the problem as edge versus cloud. They treat the device-to-cloud continuum as a design space and make explicit choices within it. Inference runs close to where data is generated. Training and coordination stay centralized, where aggregation pays off. Analytics lives where global visibility actually adds value.

What surprised me wasn’t that teams moved compute out of the cloud. It was how rarely they did it because they wanted to — and how often they did it because they had to.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: The end of cloud-first: What compute everywhere actually looks like
Source: News