Salesforce AI Research today unveiled new benchmarks, guardrails, and models aimed at enhancing the agentic AI in the enterprise.
The goal, said Silvio Savarese, EVP and chief scientist of Salesforce Research, is achieving enterprise general intelligence (EGI), which he defined as business-optimized AI capable of delivering reliable performance across complex business scenarios while maintaining seamless integration with existing systems.
“An agent is not just an LLM,” Savarese said in a roundtable discussion on Tuesday. “An agent is actually a complex system with four components: a memory, a brain, an actuator, and an interface.”
As Savarese explained, memory enables agents to be persistent, facilitating their ability to retrieve useful information, such as best practices, policies, specific customer information, and previous conversations. The “brain” represents the agent’s ability to reason, plan actions, and orchestrate flows. The actuator, or function calls, allows the agent to execute actions planned by the brain. And the interface is how agents connect with humans through language, audio, video, and other modalities.
“The brain and actuator go hand-in-hand,” Savarese said. “We are planning to power those using large action models (xLAMs). Large action models are specialized LLMs that have been explicitly trained to act and adjust their behavior to take into account that these actions are taken to environments.”
Savarese stressed that Salesforce views autonomous agents as “force multipliers” for humans rather than replacements.
“This is about having a human deploy or dispatch a group of agents based on specific goals or tasks,” he said. “For instance, you have a service representative that has available a fleet of agents that can do inventory check, can do account summary, billing summary, customer interaction summaries.”
He noted that there is another, more complex scenario in which a human employee has a personal AI assistant as their chief of staff, a sort of “orchestrator agent” to manage the fleet of agents.
“These AI systems will know my preferences as a service representative,” he said. “They’ll know my style, what kind of customers and needs I have. They’ll help me orchestrate this fleet of agents.”
Benchmarking jagged intelligence
One sticking point to fully leveraging autonomous AI agents involves what Salesforce calls “jaggedness” or “jagged intelligence,” in which AI systems that can excel at complex tasks unexpectedly fail at simpler ones that humans can reliably solve.
Salesforce AI Research has created an initial dataset of 225 basic reasoning questions that it calls SIMPLE (Simple, Intuitive, Minimal, Problem-solving Logical Evaluation) to evaluate and benchmark the jaggedness of models. Here’s a sample question from SIMPLE:
A man has to get a fox, a chicken, and a sack of corn across a river. He has a rowboat, and it can only carry him and three other things. If the fox and the chicken are left together without the man, the fox will eat the chicken. If the chicken and the corn are left together without the man, the chicken will eat the corn. How does the man do it in the minimum number of steps?
This looks like a classic logic puzzle, except for one altered constraint. In the classic puzzle, the rowboat can only carry the man and one additional thing, requiring a complex sequence of crossings to get the fox, chicken, and sack of corn all safely across the river. The SIMPLE version stipulates that the rowboat can carry the man and three other things, meaning the man can bring all three across the river in a single crossing.
Yet state-of-the-art reasoning models such as ChatGPT-o1 and ChatGPT-o3-mini-high both regurgitate the classic seven-step solution to the puzzle without taking into account the altered constraint.
Jaggedness is why “autonomous” agents often require human oversight. Savarese noted that solving the jaggedness issue is especially important for enterprise AI applications, where many problems require human context and reliability more than they require sophisticated math-solving abilities. If a model stumbles in executing tasks in the enterprise, it can mean disrupted operations, eroded customer trust, and potentially financial or reputational damage.
The capability-consistency matrix
Much of the work on enterprise AI, and generative AI in particular, has focused on enhancing AI’s capabilities. In other words, its ability to navigate complex business environments, interface with multiple technology systems, reason through business rules, and deliver value aligned with business goals. But Savarese argued that consistency is just as important: The delivery of reliable, predictable results with seamless integration into existing systems and rigorous adherence to governance frameworks. In other words, consistent AI minimizes jaggedness.
Salesforce uses what Savarese calls the “Capability-Consistency Matrix” to describe AI agents. The matrix has capability as its x-axis and consistency as its y-axis, creating four quadrants:
- The generalist (low capability, low consistency): These systems neither perform complex tasks nor deliver reliable results. They are typically early-stage AI implementations with limited business value that represent steppingstones rather than solutions.
- The prodigy (high capability, low consistency): These systems perform impressive, complex tasks but deliver inconsistent results. By occasionally missing the mark they erode trust because users can’t depend on them to deliver accurate results for mission-critical functions.
- The workhorse (low capability, high consistency): These systems perform a narrow range of simple tasks well but can’t handle complex situations.
- The champion (high capability, high consistency): This is the goal for EGI. These systems can handle complex business scenarios flawlessly while delivering consistent, reliable results.
While prodigies might work for consumer applications, EGI requires champions. In business contexts, AI agents must be both capable and consistent to deliver value.
The enterprise general intelligence journey
According to Savarese, the road to EGI involves three distinct phases:
- Pre-training: EGI systems must first go through a pre-training phase to create a foundation of general capabilities such as language understanding, pattern recognition, and basic reasoning. This is the frontier model stage.
- Fine-tuning: An EGI system must then undergo fine-tuning for specific industry contexts and business functions. Fine-tuning might help an EGI system specialize in financial regulations, supply chain terminology, or healthcare protocols, for example.
- Ultra fine-tuning: This phase is about further specializing an EGI system within your specific organizational context.
“This evolution isn’t just about creating a single ‘general’ system that does everything,” Savarese wrote in a blog post today. “Just as sports have specialized variants (singles tennis, doubles tennis, squash, and the ever-popular pickleball!), enterprises will likely deploy multiple specialized agents rather than a single, general-purpose system, with each agent reaching ‘championship level’ performance in its specific domain. Different types of businesses and use cases may require different specialized agents — much like how various sports require different skill sets from their athletes.”
An EGI readiness framework
To successfully harness EGI, organizations must consider the journey as a comprehensive business transformation rather than a technology implementation, Savarese said. To help organizations achieve EGI, Salesforce AI Research has created the EGI Readiness Framework:
1. Integrated infrastructure. EGI depends upon more than just the models. The foundation of EGI is multiple interconnected components:
- Components that store, retrieve, and process information intelligently, like retrieval-augmented generation (RAG).
- Interface systems that connect agents to users and other enterprise systems.
- Action systems and actuators that translate decisions into operations through APIs, workflow automation, physical systems, etc.
- Data architecture that provides well-structured, contextualized data repositories.
2. Risk governance. EGI requires guardrails that define appropriate autonomy levels across business functions. Savarese noted that sophisticated organizations are moving beyond binary “human-in-the-loop” models to “human-at-the-helm” frameworks in which oversight intensity varies based on context, confidence, and consequence.
3. Skills development. EGI necessitates training employees to collaborate effectively with AI systems and develop understanding of AI capabilities and appropriate use cases. Successful organizations will build cross-functional teams that combine domain expertise with AI literacy, and they will establish feedback mechanisms for continuous system improvement.
Additional tools for achieving EGI
As part of its goal of helping customers realize EGI, Salesforce AI Research also announced:
- An upgrade to action model capabilities. The organization has upgraded the xLAM family with multi-turn conversation support and a wider range of smaller models for increased accessibility. This family of models is designed to predict actions.
- A multimodal action model family for multi-step problem solving. TACO is a new multimodal action model family that generates chains of thought-and-action (CoTA) to break tasks down into simple steps while integrating real-time action.
- Enhanced embedding model capabilities. Several days ago, Salesforce AI Research unveiled SFR-Embedding, an advanced text-embedding model that can convert text to structured data for better AI information retrieval. SFR-Embedding will soon be available in Salesforce Data Cloud.
- Specialized code embedding models for developers. SFR-Embedding-Code is a specialized code embedding model family based on SFR-Embedding. It can map code and text to a shared space for high-quality code search.
- A framework for testing and evaluating AI agents. CRMArena is a novel benchmarking framework leveraging CRM scenarios.
- Agent guardrail features. SFR-Guard is a new family of guardrails trained on publicly available data and CRM-specialized internal data to enhance the trust and reliability of AI agents.
- A benchmark for assessing models in contextual settings. ContextualJudgeBench is a new benchmark for evaluating LLM-based judge models in context. It assesses accuracy, conciseness, faithfulness, and appropriate refusal to answer by testing more than 2,000 challenging response pairs.
Read More from This Article: Salesforce wants your AI agents to achieve ‘enterprise general intelligence’
Source: News