What you need to know — and do — about AI inferencing

Inferencing is an important part of how the AI sausage is made. But, unlike the proverbial sausage — whose making is best left unobserved — it’s critical to know what inference is, why it’s important and what decisions you can and should make around it. Over the past two years, I have had the opportunity to observe and understand how generative AI (gen AI) inferencing is evolving and accelerating, both through my interactions with practitioners here at Red Hat as well as with customers.

Inferencing is actually the driving force behind what most people, in my experience, have come to understand as AI: It’s the operational phase of the machine learning process — the point at which machine learning models recognize patterns and draw conclusions from information that they haven’t seen before. This is the foundation for generative AI and for AI’s ability to think and reason like humans. As you expand the extent to which you integrate AI into new and existing applications, the decisions you make around inferencing technology will become more and more impactful.

Redefining what’s possible

During the past few years, I’ve watched the rise of AI agents redefine what’s possible in software delivery, automation and intelligent operations. But pretty much every conversation I have about deploying agents eventually circles back to the same core tension: serving and inferencing language models efficiently and securely — whether massive multi-task models capable of reasoning or smaller, fine-tuned ones trained for precision.

There are endless use cases for inferencing in enterprise applications, but I find it useful to explain the impact of inference in terms of applications in healthcare and finance. In healthcare, for example, a previously unseen piece of data — an extra heartbeat or an unexpected transaction — can be put into the context of an AI model’s training over time to identify potential risk. This combination of training and inferencing enables AI to mimic human thinking based on evidence and reasoning — the “why” of AI. Without inferencing, AI models and agents cannot apply learned data to new situations or problems presented by users.

Indeed, if you want to integrate new and existing data for decision-making, automation and/or insight generation — especially if the results must be delivered in real time at scale — you must prioritize AI inferencing capabilities as a key part of your technology stack.

OK…how do you do that?

AI inference servers enable the jump between training and inferencing. An AI inference server runs a compatible AI model. And when a user prompts or queries the model for an answer, the model server is the runtime software that uses the logic and knowledge in the model to produce an answer (tokens). I believe it’s one of the most — if not the most — important pieces in the puzzle for enterprises building AI applications. And it’s not just me saying this: According to Market.US, the global AI inference server market size will be worth about $133.2 billion by 2034, from $24.6 billion in 2024.

With that said, it’s important to understand that there are some significant challenges associated with delivering inferencing capabilities. Based on the work I’ve done with customers, the biggest issues include:

Inferencing is resource-intensive and expensive. As with AI model training, inferencing requires a great deal of computational power, specialized hardware and robust data pipelines. All of this is especially the case when inferencing at scale, in real time, while ensuring low latency and high availability.
Inferencing requires sophisticated infrastructure, application and API management and skills. Deploying inference servers and applications involves hardware orchestration, networking and storage management and the need to optimize for both performance and cost. After all, inferencing is in service of AI agents and therefore there is a need for traditional API management and DevOps. Any mistakes you make along the way will reduce the usefulness of inferencing and could even result in inferences that are inaccurate or outdated.
Inferencing increases security and privacy risk. Inferencing often involves processing sensitive and proprietary data, increasingly at the network edge or on devices. Ensuring compliance with data privacy laws and protecting data at every step of the process requires a defense-in-depth strategy built into the infrastructure and across platforms.

The right inference server solution can help you address these challenges. In fact, getting it right, from the start, is imperative: As gen AI models explode in complexity and production deployments scale, inference can become a significant bottleneck, devouring hardware resources and threatening to cripple responsiveness to users and customers and inflate operational costs.

Compounding this is the linkage among configuration (platform, deployment details, tooling), models and inference. I’ve seen incorrect or mismatched configurations that have led to inefficiency, poor performance, deployment failures and scalability issues. Any one of these can result in significant negative consequences for the business, from a slow or questionable response that creates mistrust among users, to exposure of sensitive private data, to costs that quickly spiral out of control.

Inference server pathways

Once you have decided that, yes, inference servers are a must, you can go down the usual pathways to procure one: build, buy or subscribe. However, there are some issues unique to the technology that you will have to consider for each of these pathways.

Build your own inference server

Building a bespoke inference server involves assembling hardware tailored for AI, deploying and operating advanced serving software, optimizing for scale and efficiency and securing the full environment — not to mention maintaining and updating all that over time. This approach provides tight flexibility and control, with the ability to leverage what makes the most sense for the business.

For example, vLLM has become the de facto standard for DIY inference servers, with many organizations choosing the open source code library for its efficient use of the hardware needed to support AI workloads. This significantly improves speed, scalability and resourcefulness when deploying AI models in production environments. However, building an inference server also requires extensive knowledge and skills, as well as strict operational discipline. You will have to determine whether you have these resources in-house and, if you do, whether you can sustain all of this over time. Trust me, it is not easy!

Inference servers can also be subscribed to as a service. IaaS (no, not that IaaS, but inference as a server) offerings allow organizations to deploy, scale and operate AI model inference without the need to buy and maintain dedicated hardware or set up complex local infrastructure. However, you will need to consider your organization’s tolerance for the risk associated with sending sensitive data to third-party cloud providers, as well as the latency and performance issues that can result when inferencing is performed remotely.

Other considerations include the potential for cost fluctuations and vendor lock-in. Using third-party providers may not even be an option depending on the regulatory mandates and privacy laws your organization must comply with.

Buy a pre-built inference server

A number of providers offer inference server software, and this pathway to inference bridges some of the gaps in the DIY and IaaS models. For example, you can purchase inference servers — on their own or as part of a platform — that provide DIY levels of control with managed-level ease, while also offering portability across platforms and freedom from lock-in to a single cloud provider or runtime environment.

Indeed, when acquiring a pre-built inference server, it’s important to keep cost, security, manageability and a commitment to openness and flexibility top of mind. An inference server needs to support your specific AI strategy, and not the other way around. This means that a given inference server must support any AI model, on any AI hardware accelerator (aka GPU), running on any cloud, preferably as close to your data as technically feasible.

Beyond supporting choice and flexibility, additional requirements include strong security measures (such as confidential computing practices, cryptographic model validation and strong endpoint protection), robust data governance and compliance controls, safety and guardrails (to protect against input threats via prompt injection and leaking of sensitive, private and regulated information) and flexible support for a variety of hardware and deployment environments (including edge, cloud and on-premises). These features help keep sensitive data and models protected, help meet regulatory requirements and make it possible to keep choice and control front-and-center with regards to the technology stack and vendors.

You should also prioritize inference servers that can deliver high performance and scalability while keeping costs and resources in check, with tools for efficient resource use, model version management and traffic optimization. For example, pre-built inference servers that use vLLM take advantage of the code library’s advanced memory management techniques, such as PagedAttention, continuous batching and optimized GPU kernel execution. When the number of users, prompts and the models to be served increases, one has to think about scaling to meet the demands with the required response times while optimizing the available compute resources. This is where inference-aware and distributed inferencing with open-source technologies such as llm-d comes into the picture. Continuous monitoring, centralized management and automated security updates are essential for operational resilience, as is model explainability, bias detection and transparent audit trails for trusted, transparent and accountable AI.

Inference decision-making

As AI becomes increasingly embedded in business operations, adopting the right inferencing strategy — whether by building, subscribing or buying — is essential for enabling intelligent, real-time decision-making at scale.

I’ve observed that any decision around inferencing is among the most critical you can make as your organization starts or continues along the AI path. It’s no wonder, then, that these decisions are also among the most overwhelming, especially given the rate at which AI technology and its applications and place in society is evolving. The most practical way forward is to be action-oriented and to start somewhere; then experiment, learn and iterate rapidly. While initially It may be difficult to identify what’s “right,” understand that it’s more important than ever to anchor any decision in principles that will move the business forward, no matter how things change.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: What you need to know — and do — about AI inferencing
Source: News