How to get gen AI spend under control

There’s no end in sight to the gen AI boom. Every week, we see new advances in the technology, new use cases, and new fears of AI overwhelming humanity or, at least, some industries. Experts predict radical realignments and the emergence of new industrial superpowers, similar to what we saw during the dot-com transition.

Some companies, with their very survival at stake, are willing to spend any amount of money in order to stay relevant. Others just want to stay ahead of their slowest competitors, or just take advantage of the productivity gains and new business opportunities expected to come as a result of gen AI. But no matter how important AI may or may not be to a company, there’s no point in wasting money. Gen AI offers many opportunities to spend too much and get too little in return when, instead, companies can use their gen AI budgets more strategically, allowing them to reap more benefits from investments and pull ahead of their competitors. The key to getting further and faster while spending less money is being more thoughtful and careful about first steps.

According to the latest McKinsey data, 65% of organizations report they’re now regularly using gen AI, nearly double the percentage from 10 months earlier, and three-quarters predict that gen AI will lead to significant or disruptive change in their industries in the years ahead.

And gen AI spend will double in 2024 compared to 2023, IDC projects, and will reach $151 billion in 2027. But according to a Lucidworks’ survey of 2,500 business leaders from mid-June, the rate of growth in gen AI spend is leveling off, fueled in large part by cost concerns. Last year, only 3% of respondents said that gen AI implementation cost was a concern. This year, 46% of respondents said it was — a 14-fold increase. A similar survey by Gartner from May showed that estimating and demonstrating business value was the top barrier to adoption for gen AI. The primary reasons why these costs can escalate quickly when a company starts to deploy AI at scale include token costs, unexpected additional costs, and AI sprawl.

Token costs

One company that’s seen all of these is cloud consultancy DoIT, both in its own internal projects, and on the projects it works on for its customers. Tokens, or pieces of words that form the basis of most gen AI pricing structures, are a strange metric.

“Tokens are not a unit of value,” says Eric Moakley, the company’s head of product management. “So the way you value something and the way you pay for it are completely different.”

With token-based pricing, customers pay AI vendors based on the length of the questions they ask and the length of answers they get from the AI in return. In order to get more accurate answers, companies make the questions, or prompts, longer, embedding specific instructions on how the answers should be formed, providing general information about the company, and information from internal databases. Some answers require follow-up questions, or fact-checking. And it all adds up. Spending tokens is a bit like gambling in a casino, says Moakley.

“You’ve got chips all of a sudden, and you need to constantly think about connecting it back to the return you’re getting,” he says.

So to control operational costs, DoIT is strategic in its gen AI investments and expenses, he says. “We do track it,” he says. For example, one of the best use cases he’s found was also one of the cheapest. When company engineers spin up an AWS server, and a bill arrives, it’s written in a language of SKUs, hourly rates, discounts and credits. If there’s a cost anomaly, it can be hard to figure out what a specific line item means. So DoIT added functionality, asking a gen AI model to explain these terms.

“It’s a very narrow use case,” says Moakley. “It’s just a button next to the information. You don’t get prompted; you don’t get to adjust it. And we found it to be very valuable.”

Sure, this is functionality that AWS itself might eventually provide, but DoIT was experimenting with gen AI anyway, and this was a very simple project.

“It’s an easy thing for an LLM to do,” he says. “We get the right information at the right time, and we were able to build it fast thanks to AI. The generative AI was already trained on the data we needed for it because we were also working on other things.”

And the functionality only took a couple of hours of development time. “We just asked, how hard would it be to add the views they were looking at anyway,” he adds. But after that came the governance piece. Who was making the request? What service are they calling? How many tokens will it take, and how will it translate into dollars? And is it worth building, or easier to wait for the vendor to add the functionality themselves?

“I think the time to market advantage is often worth it, speaking from a product perspective,” says Moakley.

But the company has also terminated a number of gen AI investments because the performance indicators weren’t there, he says. “The customers didn’t respond to it,” he says. “It wasn’t giving us the lift we wanted.”

DoIT also optimizes its LLM interactions to control the number of tokens.

“We’re careful to prune the data and inputs,” he says. “And responses can’t be beyond a certain length — we’re not writing a book. And when possible, we try to be less open-ended and more targeted. The more you can reduce interactivity the easier it is and the costs become more fixed.”

Test the waters

Another way to reduce token costs is to be strategic about which model is being used. A cheaper model might still give good results, and it might be faster. For example, consulting firm Publicis Sapient recently worked on a customer-facing project for Marriott Homes & Villas, a short-term rental company.

“If you want to vacation on a beach home and bring your dogs, it’ll give you a list of homes based on queries on a back-end that was fine-tuned on property data,” says Sheldon Monteiro, the company’s chief product officer. Then the firm looked at the improvement to conversions — the increase in revenues from adding the gen AI search functionality. After all, the most expensive model doesn’t necessarily provide the best business value.

“You might get a better response from GPT 4, but the actual conversion rates weren’t too different from GPT 3.5,” he says. “So we eventually settled on GPT 3.5.”

And, like DoIT, Marriott Homes and Villas found that a controlled LLM query, embedded into the application, worked better than an open-ended chatbot.

“We realized that people don’t want to have a conversation,” Monteiro says. “They immediately want to get into it, to show what their vacation might look like.”

Once the AI model got the results, visitors would be immediately taken to a standard search experience, familiar to everyone who’s used online services.

“We don’t give you a text response ever; it’s just a list of homes with a new parameterized search,” he adds. Not only does this eliminate the opportunity for back-and-forth chatbot conversations to rack up token costs, but it also removes any possibility for users to abuse the system.

Another way to get a good handle on total costs is to not jump straight from proof of concept to production, but to do a small-scale roll-out first.

“If you put it out to your entire customer base, you might be surprised by how widespread the adoption is,” Monteiro says. “But if you expose it first to a small number of users, say, 1%, and base your modeling on how users will actually use the experience, you can predict what will happen when you scale up to the full 100%.”

The key is to take a disciplined approach to modeling the costs. “Not just as a paper exercise, but with a small percentage of users in production,” he says. And once a model is chosen, it’s not the end of the road.

“With the rate of model evolution, the good news is that as the technology continues to improve, inference costs are actually dropping,” he says. “OpenAI and other providers are reducing the costs of their older models and also making dramatically improved capabilities available, which cost more money.”

These new capabilities are yet another opportunity for enterprises to decide whether they’ll create actual business value.

But there are also plenty of use cases where a smaller LLM, traditional machine learning, or even a keyword search might be good enough. “Don’t use a large language model to do something that a small language model or a rules-based system can do,” Monteiro says. And there are more benefits to doing so than just cost reduction.

“If we use a small language model trained on a particular domain, we can get responses very rapidly,” he says. “But a keyword search is going to be much faster than putting it into a language model.”

Latency costs

The costs of using gen AI go beyond figuring out what any particular prompt might cost. There’s also the cost of latency. That might not be apparent in a proof of concept, but once a project is in production with real documents and real users and starts to scale up, performance will begin to suffer.

“When we ingest thousands of documents, on any of the LLMs, the response time is anywhere from 30 to 60 seconds because the context window gets filled up,” says Swaminathan Chandrasekaran, head of solution architecture for digital solutions at KPMG. “People say they can’t wait 60 seconds to ask their next question. So we increase capacity, add dedicated instances, and costs start to spiral.”

There’s also a throughput per minute limit that’s set by the hyperscalers, which is an issue for many large enterprises, including for KPMG itself. “Client zero is KPMG,” he says. “We’re experimenting with setting up our own Nvidia cluster to see if we can solve for the latency problem,” he says.

In addition to swapping out expensive commercial models for open source ones, or small language models (SLMs), KPMG is also experimenting with alternatives to traditional AI processing hardware. For example, it’s possible to run some SLMs on general purpose hardware, or even embed them into web applications for in-memory classification and generation.

For example, an e-commerce system that needs gen AI to summarize product reviews doesn’t need to use a big language model in the cloud. “It can be embedded into my e-commerce application,” Chandrasekaran says.

Similarly, a product classification engine can classify new SKUs as they come in, or a health care application can classify claims. “These are very specialized language models,” he says. Quantization is another technique for getting better performance out of a language model, he says, though it results in lower precision.

Finally, caching is another option to solve the latency issue when people ask the same questions all the time.

“The challenge is if the question is worded differently,” he says. “But there are similarity techniques.”

Gen AI has all the old costs, as well. “There’s the cost of storage, development, and running the application,” says Chandrasekaran. For example, he adds, it recently cost his team $7,000 to set up a Llama 3 deployment on Azure because it wasn’t yet available on a pay-as-you-go basis.

“You had to set it up,” he says. “And the compute needed to run a 70 billion model is significant. We set it up ourselves, provisioned a server, deployed the model, and then there was usage on top of that.”

Azure now offers a pay-as-you-go option where customers just pay the token costs, but for enterprises looking to deploy on-prem models, the set-up costs still exist.

“In an ideal world, that would be the best scenario because you’re no longer constrained by token costs,” he says. “The only cost you pay is for infrastructure. But you still need to have the compute capacity and other things, like networking.”

Oversight costs

When gen AI is moved into production, another unexpected cost might be the required oversight. Many systems require humans in the loop or expensive technical guardrails to check for accuracy, reduce risk, or for compliance reasons.

“I don’t think we expected the regulations to come so soon,” says Sreekanth Menon, global head of AI at Genpact. “Once generative AI came in, it became a leadership top topic, and all the governments woke up and said we need regulations.”

The EU Act is already in place, and there’s US work in progress. “Now companies have to accommodate that when developing AI, and that’s a cost,” he says. But the regulations aren’t a bad thing, he adds. “We need regulations for the AI decisions to be good and fair,” he says.

Adding in regulatory compliance after systems are built is expensive, too, but companies can plan ahead by putting good AI governance systems in place. Ensuring the security of gen AI models and associated systems is also a cost that companies might not be prepared for. Running a small-scale production test will not only help enterprises identify compliance and security issues, he says, but will help them better calculate other ancillary costs like those associated with additional infrastructure, search, databases, API, and more. “Think big, test small, and scale quick,” he says.

AI sprawl

In the past, with traditional AI, it might have taken a year or two of experimenting before an AI model was ready for use, but gen AI projects move quickly.

“The foundation models available today are allowing enterprises to quickly think of use cases,” says Menon. “Now we’re in a stage where we can think of an experiment and then go into production quickly.” He suggests that enterprises restrain themselves from doing all the AI projects all at once, have a cost mechanism in place and clear objectives for each project, then start small, scale wisely, and continuously invest in upskilling.

“Upskilling is a cost, but it will help you to save on other costs,” he says.

Matthew Mettenheimer, associate director at S-RM Intelligence and Risk Consulting, says he often sees gen AI sprawl within companies.

“A CIO or a board of directors wants to enable AI across their business, and before they know it, there’s quite a bit of spending and use cases,” he says.

For example, S-RM recently worked with a large consumer manufacturer that decided to push AI enablement through their business without first building a governance structure. “And every single department went off to the races and started trying to implement generative AI,” he says. “You had overlapping contracts with different tools for different parts of the organization, which really started to bloat their spend. Their marketing department was using one tool, their IT team was using another. Even within the same department, different teams used different tools.”

As a result, the company was paying for similar services over and over again, with each group having its own contracts, and no efficiencies from doing things at scale. And people were getting subscriptions to gen AI products they didn’t know how to use.

“There were a lot of great intentions and half-baked ideas,” he says. As a result, there was a massive uptick in IT spending, he says. Enterprises need to start by understanding where gen AI can really make an impact. Then enterprises should build their projects step by step, in a sustainable way, rather than going out and buying as much as possible. Some areas of particular concern, where companies might want to hold off on spending, are use cases that might hold culpability for the organization.

“If you’re an insurance provider, using AI to determine if a claim will be paid or not can land you in a bit of liability if the AI mechanism isn’t used or calibrated properly,” Mettenheimer says. Instead, prioritize use cases where workers can be freed up to handle more complex tasks.

“If someone is spending five hours a week updating the same spreadsheet and you can reduce that time to zero hours per week, that really frees up that individual to be more productive,” he adds. But if it takes as much time to check the AI’s work product as it saves, it’s not really making the job more efficient.

“Generative AI is a really powerful and incredible tool, but it’s not magic,” he says. “There’s a misconception that AI will be able to do everything without the need for any manual processes or validation, but we’re not at that point yet.”

He also recommends against doing AI projects where there are already perfectly good solutions in place.

“I know of a few cases where people want to use AI so they can feel like they’re getting a competitive edge and can say that they’re using AI for their product,” he says. “So they lay AI on top of it, but they’re not getting any benefits other than just saying they’re using AI.”

Senior executives are eager to get going on gen AI, says Megan Amdahl, SVP of partner alliances and operations at Insight.

“But without a firm destination in mind, they can spend a lot of time on cycles that don’t achieve the outcomes they’re hoping for,” she says. For example, clients often go after small use cases that improve efficiency for a small number of people. It can sound like a great project, but if there’s no way for it to be expanded, you can easily wind up with a sea of point solutions, none of which produces real business impact.

“Here at Insight, we were selecting which team to go after in order to improve help desk feedback,” she says. One strong use case had a team size of 50 who were checking the status of customer orders. But not only was the team small, the people were located in low-cost locations. Improving their efficiency with gen AI would have some impact, but not a significant one. Another team was creating bills of materials for clients, and it was much larger. “We went after the team size of 850 instead so it would have a broader impact,” she says.

In addition to selecting projects with the widest possible impact, she also recommends looking for those that have a narrower scope, as far as data requirements are concerned. Take for example a gen AI help desk assistant.

“Don’t go after every type of question that the company can get,” she says. “Narrow it down, and monitor the responses you get back. Then the amount of data you need to pull in is reduced as well.”

Organizing data is a significant challenge for companies deploying AI, and an expensive one, as well. The data should be clean, and in a structured format to reduce inaccuracy. She recommends that companies looking to decide which gen AI projects to do first should look at ones that focus on revenue generation, cost reduction, and improving affinity to their brand.

Read More from This Article: How to get gen AI spend under control
Source: News