Most enterprises right now are running two AIs.
The first AI is the visible, exciting one: developer-led copilots, RAG pilots in customer support, agentic PoCs someone spun up in a cloud notebook and the AI that quietly arrived inside SaaS apps. It’s fast, easy to get up and running, with a very impressive potential and usually lives just outside the formal IT perimeter.
The other AI is the one the CIO has to defend: the one that must be governed, costed, secured and mapped to board expectations. Those two AIs are starting to collide — which is exactly what May Habib described when she said 42% of Fortune 500 executives feel AI is “tearing their companies apart.”
As with past waves of innovation, AI follows an inevitable path: new tech starts in the developer’s playground, then becomes the CIO’s headache and finally matures into a centrally managed platform. We saw that with virtualization, then with cloud, then with Kubernetes. AI isn’t the exception.
Application and business teams have been getting access to powerful generative AI tools that help them solve real problems without waiting for a 12-month IT cycle; that’s what generative AI has been doing so far. Yet, success breeds sprawl and enterprises are now dealing with multiple RAG stacks, different model providers, overlapping copilots in SaaS and no shared guardrails.
That’s the tension showing up in 2025 enterprise reporting — AI value is uneven and organizational friction is high. We have definitely reached the point where IT has to step in and say: this is how our company approaches AI — a single way to expose models, consistent policies, better economics and plenty of visibility. That’s the move McKinsey describes as “build a platform so product teams can consume it.”
What’s different with AI is where the pain is. With cloud adoption, for example, security and network were the first blockers. With AI, the blocker is inference — the part that delivers the business returns, touches private and confidential data and is now the main source of opex. That’s why McKinsey talks about “rewiring to capture value,” not just adding more pilots. And this matches the widely reported results of a recent MIT study: 95% of enterprise gen-AI implementations have had no measurable P&L impact because they weren’t integrated into existing workflows.
The issue isn’t that models don’t work — it’s that they weren’t put on a common, governed path.
Platformization as the path to governance and margin
The biggest mistake we can make today is treating AI infrastructure like a static, dedicated resource. The demands of language models (large and small), the pressure of data sovereignty and the relentless drive for cost reduction all converge on one conclusion: AI inference is now an infrastructure imperative. And the solution is not more hardware; it’s a CIO-led platformization strategy that enforces accountability and control, making AI a strategic infrastructure service. This requires a strong separation of duties and the implementation of a scale-smart philosophy versus just a scale-up approach.
Enforce a separation of duties and create the AI P&L center
We must elevate the management of AI infrastructure to a financial priority. This mandates a clear split: the infrastructure team focuses entirely on the platform — ensuring security, managing the distributed topology and driving down the $/million tokens cost — while the data science teams focus solely on business value and model accuracy.
This framework, which I call the AI P&L center, ensures that resource choices are treated as direct financial levers that increase margin and guarantee compliance. Research highlights that CIOs are increasingly tasked with establishing strong AI governance and cost control frameworks to deliver measurable value.
Shift from scale-up to scale-smart optimization
The technical strategy must implement a scale-smart philosophy — a continuous process of monitoring, analyzing, optimizing and deploying models based on economic policy, not just load. This involves deep intelligence to perfectly map the model’s needs to the infrastructure’s capabilities. This operational shift is essential because it enables the effective use of resources in support of the requirements coming from the adoption of two of the most critical pieces of innovation in artificial intelligence:
- Small language models (SLMs). Highly specialized SLMs fine-tuned on proprietary data deliver far greater accuracy and contextual relevance for specific enterprise tasks than giant, generic LLMs. This move saves money not just because the models are smaller, but because their higher precision reduces costly errors. Studies show that enterprises deploying SLMs report better model accuracy and faster ROI compared to those using general-purpose models. Gartner has predicted that by 2027, organizations will use task-specific SLMs three times more often than general-use LLMs.
- Agentic workflows. Next-generation applications use agentic AI, meaning a single user query cascades through multiple models. Managing these sequential, multimodel workflows requires an intelligent platform that can route requests based on key-value (KV) cache proximity and seamlessly execute optimizations like automatic prefill/decode split, flash attention, quantization, speculative decoding and model sharding across heterogeneous GPUs and CPUs. These are techniques that, in plain terms, drastically reduce latency and cost for complex AI tasks.
In both cases and more in general any time a model is used to perform inference, achieving a double-digit reduction in $/million tokens is possible only when every request is automatically routed based on cost policy and optimized by techniques that continuously tune the model’s execution against the heterogeneous hardware, but that will only be possible if a centralized and unified platform is designed and built to support inference across the enterprise.
Addressing today’s inefficiencies of AI inference serving
The traditional approach we use to manage most of our enterprise infrastructure — what I call the scale-up mentality — is failing when applied to continuous AI inference and can’t be used to build the inference platform needed by CIOs. We’ve been provisioning dedicated, oversized clusters, often purchasing the newest and largest GPUs and replicating the resource-intensive environment required for training.
This is fundamentally inefficient for at least two key reasons:
- Inference is characterized by massive variability and idle time. Unlike training, which is a continuous, long-running job, inference requests are spiky, unpredictable and often separated by periods of inactivity. If you’re running a massive cluster to serve intermittent requests, you’re paying for megawatts of wasted capacity. Our utilization rates drop and the finance team asks tough questions. The true cost metric that matters now isn’t theoretical throughput; it’s dollars per million tokens. Gartner research shows that managing the unpredictable and often spiraling cost of generative AI is a top challenge for CIOs. We are optimizing for economics, not just theoretical performance.
- The deployment landscape is hybrid by mandate. It’s inconceivable to think that AI inference will run in a centralized, homogeneous environment. For regulated industries, such as financial services and health care or for operations that rely on proprietary internal data, the data often cannot leave the secure environment. Inference must occur on premises, at the data edge or in secure colocation facilities to meet strict data residency and sovereignty requirements. Trying to force mission-critical workloads through generic cloud API endpoints often cannot satisfy these strict regulatory and security requirements, driving a proven enterprise pattern toward hybrid and edge services. Taking things down one more level, we must keep in mind that the hardware is heterogeneous as well — a mix of CPUs, GPUs, DPUs and specialized processing units — and the platform must manage it all seamlessly.
Mastering the inference platform: An infrastructure imperative for the CIO
A unified platform is not about forcing alignment to a single model; it’s about establishing the governance layer necessary to unlock a much wider variety of models, agents and applications that meet enterprise security and cost management requirements.
The transition from scale-up to scale-smart is the essential, unifying task for the technology leader. The future of AI is not defined by the models we train, but by the margin we capture from the inference we run.
The strategic mandate for every technology leader must be to embrace the function of platform owner and financial architect of the AI P&L center. This structural change ensures that data science teams can continue to innovate at speed, knowing the foundation is secure, compliant and cost-optimized.
By enforcing platformization and adopting a scale-smart approach, we move beyond the wild west of uncontrolled AI spending and secure a durable, margin-driving competitive advantage. The choice for CIOs is clear: Continue to try managing the escalating cost and chaos of decentralized AI or seize the mandate to build the AI P&L center that turns inference into a durable, margin-driving advantage.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Read More from This Article: How you can turn 2025 AI pilots into an enterprise platform
Source: News

