AI efficiency beyond the model: Rethinking code, hardware and cloud

As AI adoption grows, I see fellow enterprise leaders realizing that just implementing AI is not enough. We need to develop and adopt the best, fastest and most efficient AI models. It’s not just a matter of pride about who has the shiniest toy; optimizing models for efficiency can be the difference between a failed pilot and an effective business strategy.

At the most extreme end of the spectrum, inefficient use of AI can cost billions of dollars. Sam Altman, CEO of OpenAI, made headlines when he admitted on X that his company loses tens of millions of dollars every time people say “please” and “thank you” to his AI models, even though he added that he feels it’s money well spent.

Model efficiency also matters for those of us not operating at OpenAI’s scale. A more efficient model helps reduce overall costs because it doesn’t require as powerful or expensive hardware, uses less electricity, delivers output faster and can operate with a smaller cloud footprint.

Models that are optimized for efficiency deliver lower latency, improved scalability, increased flexibility and are less likely to drift. In my experience, all of this adds up to higher profit margins, a sharper competitive edge and a faster time to market, which are crucial whether you’re planning to use your model internally or sell it to others.

The new CIO investment dilemma

For a long time, it was believed that hardware must continually increase in power to enable models to grow in size. Then DeepSeek v2 came along and demolished all those theories. It showed that more efficient hardware can deliver equivalent results with less compute power by running smaller, smarter models.

Now, those of us in the CIO seat face a new dilemma: should we increase investment in computing power, focus on hardware or concentrate on software?

In my view, the correct answer is: all the above. AI efficiency is a full-stack problem. Hardware, compilers, runtime and model architecture must be co-designed to work in harmony; otherwise, we’re wasting money and failing to achieve the results we need. Today, choosing GPUs vs. custom accelerators vs. CPUs affects which model optimizations are viable.

Hardware power constraints model capabilities

It remains true that even the most powerful model in the world can’t function without access to the necessary hardware. Hardware performance is ultimately bounded by memory bandwidth, interconnect speed and compute units, no matter how optimized our models are.

This means that scalability depends on interconnects. Multi-node training and large inference clusters hinge on the performance of NVLink, InfiniBand or Ethernet fabric, not just model quality, so decisions about hardware investments or cloud providers can be critical to overall functionality.

“The pace of innovation is directly tied to advances in GPUs, tensor processing units (TPUs) and custom accelerators. The real question isn’t just what models we can build, but whether we have the compute infrastructure to support them,” says Gaurav Dewan, a research director at Avasant. “Models can only grow as powerful as the chips, memory systems and data center networks sustaining them.”

Compute power isn’t everything

That said, in my experience, you can’t just throw computing power at every problem. Choices about hardware and cloud architecture determine how effectively users can tap into the potential of compute resources. Modern AI workloads are often memory-bound rather than compute-bound, so faster HBM, cache hierarchies and interconnects directly lower latency.

What’s more, the energy for computing power is limited. Companies can’t always afford the compute power they want, with 58% saying their AI cloud costs are too high. Cost per inference is hardware-driven and compute is usually the biggest line item in AI TCO. It’s not even easy to find space for enough GPUs, creating board-level power and cooling constraints in enterprise AI. More efficient silicon reduces data center strain, sustainability risk and cost per token/inference.

Additionally, reliability and utilization affect ROI. Features like MIG partitioning, hardware scheduling and fault tolerance determine how fully we can monetize expensive accelerators. Performance per watt is now the bottom line, with CIOs like me striving to get more out of every existing GPU per watt, dollar and square meter. We need to make our hardware more efficient by fine-tuning models and software to maximize capability.

“DeepSeek’s breakthrough suggests that AI models no longer need to scale indefinitely in size and complexity to achieve superior performance. Instead, they can be algorithmically optimized to deliver the same, if not better, results while consuming significantly fewer resources,” explains Matthew Taylor in his post on LinkedIn.

Rethinking cloud strategy in the age of AI

That cost pressure has forced many of us to revisit assumptions we held for the better part of a decade. Cloud computing has reached an uncertain crossroads. The hyperscaler-by-default posture that defined the last era of enterprise IT no longer survives a serious look at AI economics.

When inference costs scale linearly with usage and training runs can consume an annual infrastructure budget in weeks, the question I hear in every CIO conversation is the same: does our cloud strategy still match the workload we are actually running?

In my experience, the answer is increasingly no, at least not without significant rebalancing. Private clouds, written off as legacy not long ago, are quietly making a comeback. The combination of predictable cost structures, tighter control over data residency and the sensitivity of the proprietary data feeding our AI systems is making on-premise and colocation options compelling again, particularly for regulated industries.

At the same time, purpose-built neoclouds for GPU workloads, along with sovereign clouds responding to jurisdictional and data-protection mandates, are steadily chipping away at the dominance of AWS, Azure and GCP. None of these alternatives replace the hyperscalers outright, but they are forcing every CIO I know to think about cloud as a portfolio rather than a single vendor relationship.

What I have found is that navigating this shift takes more than a procurement decision. It takes a clear-eyed view of where each workload genuinely belongs. Training, inference, retrieval, fine-tuning and experimentation each carry different cost curves, latency profiles and data-gravity considerations. As organizations move towards the agentic AI era, the underlying data platform becomes equally important, requiring architectures that can support multimodal data, real-time processing and governance at scale.

The enterprises I have seen handle this best treat cloud strategy as an ongoing exercise in workload placement, not a one-time platform commitment.

That is also where the conversation tends to outgrow internal teams.

As AI moves from pilots to production, the questions get harder: how to architect data foundations that survive model churn, how to govern AI without strangling it, how to translate technical efficiency into measurable business value. I have seen organizations lean on specialist partners to think through these problems alongside them. Among the consultancies working at this intersection is Artefact, founded in Paris and operating across data strategy, AI engineering and enterprise transformation. Its work includes governance, platform development, operating models and workforce enablement—areas that have become increasingly important as organizations move from AI pilots to large-scale deployment.

What I find useful about these consultancies is not the technology recommendations themselves; it is the pattern recognition they bring from seeing similar cloud and AI transitions play out across geographies and sectors. In a moment when every CIO is rewriting the playbook simultaneously, that outside vantage point matters more than it used to.

Hardware is often underused and misused

A lot of hardware goes unused or underutilized. Often, GPUs sit idle due to deployment complexity and data infrastructure bottlenecks, so enterprises don’t see the value of the compute power they’re paying for. When data and computing are on two separate chips, compute is wasted moving data between the two locations.

Likewise, models that exceed accelerator memory or require excessive HBM traffic suffer steep latency and cost penalties. Optimizing models to align with hardware means that all the compute power is being put to good use.

Techniques like operator fusion, activation management, fine-tuning smaller models, pruning unnecessary parameters and memory-aware architectures keep more of the model resident on the accelerator, reduce unnecessary read/write cycles and combine steps so data is touched fewer times.

Kfir Aberman, founding member at Decart AI, explains this approach. “Our solution to this was to optimize our kernels for how [Nvidia GPU] Hopper works. Essentially, we created a single ‘mega kernel’ that enables the chip to process all of a model’s computations in a single, continuous pass. By doing this, we eliminate all of the stopping, starting and data movement, allowing more of the GPU to be utilized more of the time, speeding up processing by an order of magnitude.”

When models match accelerator characteristics such as tensor core shapes, SIMD widths and kernel libraries, this keeps expensive silicon working effectively and translates theoretical FLOPs into real throughput.

More hardware can’t overcome model mismatch

Another way that organizations undermine ROI on their own AI investments is by ignoring coordination efficiency.

They’ll buy large GPU clusters but pay little attention to what seem like minor issues with batching and alignment. Unfortunately, when batch sizes are wrong, work is split inefficiently and network links become bottlenecks, you see expensive but underutilized clusters.

Ultimately, more GPUs don’t guarantee more performance. Parallelism and batching must match the system topology. Effective scaling depends on aligning data, tensor and pipeline parallelism and batch sizing with the actual interconnect bandwidth and node configuration.

The magic happens when model and hardware come together

The lesson that those of us in CIO roles are learning is that symbiosis between model and hardware is critical. Code determines what our AI can do, hardware determines how efficiently we can afford to do it and co-design determines whether our AI program scales economically and successfully.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: AI efficiency beyond the model: Rethinking code, hardware and cloud
Source: News