The buzz around agentic AI and agent development is inescapable. That shouldn’t be surprising based on a Gartner study that shows “by 2028, 33% of enterprise software applications will include agentic AI, up from less than 1% in 2024, enabling 15% of day-to-day work decisions to be made autonomously.” The future of AI is agentic. Agentic AI is an AI system that uses components that can act autonomously to complete tasks and achieve goals. These AI systems are evolving beyond having conversations and are progressing to a state of getting things done.
While there are plenty of articles that discuss building agentic AI solutions, not many cover agentic AI deployment.
AI teams face many complex challenges as they work to develop and scale their agentic AI systems. The traditional approaches for implementing AI have focused on AI chatbots, single declarative agents or synchronous processes. As the industry moves to take advantage of more complex solutions, we are seeing multi-agent architectures and dynamic workflows orchestrating agents to perform tasks. These more complex solutions require more focus and thought on the activities needed for deployment and operations.
As we look at agentic AI DevOps it is important to remember that it brings together the strengths of generative AI and agentic AI. Generative AI brings insights, content and recommendations while agentic AI brings autonomous decisions and executing actions. Together they work in unison, but the combination introduces more complexity, more moving parts and therefore more difficulties. Agentic AI DevOps pushes beyond existing and traditional DevOps practices and activities.
Agentic AI development is more complex than other types of AI application development. More activities come into play, including business logic and workflows, natural language, machine learning, data management, security, monitoring and more. All these pieces need to come together seamlessly for a successful deployment.
This requires the development and operations teams (and dedicated DevOps teams, if your organization has that) to lean in together and embrace the activities that will be required. It is no longer about individual development teams working in their silos on AI proofs-of-concept, it is about the larger team efforts that require the skills of the developers, data scientists, DevOps teams, data analysts and the rest of the teams. Agentic AI calls for a more horizontal approach to deployment, to coordinate and automate the deployment processes and prevent issues before they impact business operations.
We are going to cover the activities that are needed as part of your agentic AI DevOps activities.
The art and science of deployment…or, productionization
There is no one-size-fits-all when it comes to agentic AI. As you build out your agentic AI solution, keep in mind that no single architecture will fit all agentic scenarios. However, the design factors and architecture you choose will affect the deployment and operations activities.
Deploying AI solutions has been a change from traditional deployment activities. Agentic AI is another change. Agentic AI moves you away from delivering code at specific milestones to a process of continual deployment with testing and evaluation activities that are ongoing. As we mature in our AI solutions, we need to move away from the stages we have known that consisted of coding, testing and deploying to production and instead move to a process of experiment and build, evaluate and test, adapt, transition to the deployment environment, version, repeat.
Productionization is the activity of transitioning to the production environment. This process requires careful planning. It needs to consider deployment automation, testing and evaluations, live data, synthetic test data sets integration with external systems, whole system monitoring, scalability as well as performance all under normal and abnormal conditions.
As you think about the activities that need to be planned for this productionization, you need to consider the design patterns and architecture attributes that were used in the development of the agentic solution as those will have a direct effect on the deployment, automation, monitoring and operations. This is because the development team is already working on developing the solution and understands the requirements, implementation and assumptions that need to be made.
Experiment and build phase
- Design factors:
- Did the implementation isolate agents by function to ensure maximum reuse?
- Were evaluation tests created for each scenario for each agent?
- Is there monitoring for each agent and segmentation to ease the versioning tasks?
- Have you built in and created tests for scale, performance, reliability, repeatability, cost tracking, risk detection and other business metrics?
- Behavioral composition
- Will agents be used across multiple solutions?
- Who is allowed to modify an agent?
- Who versions agents?
- Have individual agents been designed for a specific use case and is their behavior constrained to that use case?
- Communication
- In a multi-agent solution, are your agents communicating directly with each other or through an orchestration layer? Directly agent-to-agent communication affects how you will be able to version your agents and must be understood for a successful implementation.
- Are your agents communicating directly to an API endpoint or through an API gateway? Communicating directly to an API endpoint may affect versioning of your agent if that API endpoint changes.
Ensure that you understand the following from the experiment and build phase to know what needs to be done in the evaluate and test and automation phases.
Evaluate and test phase
As you evaluate and test the solution, keep in mind that outages aren’t only the results of hackers, but typically the result of preventable glitches and oversights in the development process. These issues can be found and mitigated during this phase and are essential to make sure that the consumers of your solution don’t lose patience.
Evaluations are a critical step in building generative AI applications. Evaluations are tests that can be applied to retrieval quality, answer generation, jailbreaking, content moderation, responsible AI and application guardrails. They are how you ensure that your agentic AI application is performing to your expectations. Each deployable unit should have its own evaluations.
By using tools such as the Azure AI Evaluation SDK you can harness the capabilities that are designed to assess the performance of generative AI applications.
When incorporating the Evaluation SDK, you can take advantage of evaluation metrics, evaluators, safety and quality evaluations as well as generating synthetic data.
The SDK has built-in evaluators for different scenarios including query and response, retrieval-augmented generation as well as additional scenarios. Two of the many evaluators are the GroundednessEvaluator and the RelevanceEvaluator. The Groundedness Evaluator measures how well the response is grounded in the source context and the Relevance Evaluator assesses the ability to understand the input and produce a coherent and appropriate response.
The evaluators measure performance using mathematical metrics and AI-assisted quality and safety evaluators. You can also create your own evaluators to tailor the evaluation process to your specific needs. This lets you get insights into your custom solution’s capabilities and limitations.
The SDK includes evaluators for assessing the safety and quality of AI-generated content as well. This is crucial for ensuring that your AI applications meet safety standards and provide high-quality output.
In addition, you can generate synthetic datasets for evaluation activities. This is particularly useful for creating test datasets that closely mimic real-world scenarios.
This SDK is essential for ensuring that AI applications meet the desired performance, quality and safety standards before deployment.
As your solution grows and you continue to scale, you need to make sure that the same rigorous testing and evaluation you did in the pre-deployment phase is being done continuously as your solution scales in production.
In addition to performing evaluation tests, you also need to make sure that you are performing red team testing. Red team testing is crucial in simulating real-world testing to get a realistic assessment of how well you are prepared for vulnerabilities. Red team testing unmasks these vulnerabilities and allows you to mitigate them before your solution moves to production. It also lets you test your incident response in a controlled environment.
Having insufficient red-team activities can introduce risk. You need to ensure that your system is secure, provide unbiased results as well as ensure that the system complies with company compliance and governance rules. Use tools like PyRIT to help with your red teaming activities specifically to proactively identify risks in generative AI systems. You need to ensure that the system will perform as expected in real-world activities and take care of identifying blind spots.
Lastly, as you add your custom metrics and monitoring output as part of the evaluation and testing phase you should also include the following measures in every agentic AI implementation:
- Measuring task completion quality. Task completion quality is a critical metric that directly affects system performance. This measure tracks efficiency, accuracy and resource utilization in agent task execution. Measure the percentage of activities resolved without human intervention. Output metrics on instances that required human intervention. The instances that did require human intervention will need to be reviewed to determine if any changes are required.
- Measuring the handling of edge cases and failure modes. Evaluation activities need to test agents to evaluate their ability to handle edge cases and recover from failures. Testing needs to cover these scenarios so that they can be safely modeled to contribute insights and understanding of the limitations and robustness of the solution.
- Measuring quality metrics. Quality metrics provide quantifiable data on performance standards, including performance, reliability and scalability. Implementing metrics in this phase ensures that, through continuous evaluation, agents will achieve greater efficiency and reliability. You will also have the metrics and output to prove that they can effectively adapt to dynamic environments.
Transition to the deployment environment
During the transition to deployment, we need to ensure that we have automated the entire process. We need to automate the testing and evaluation activities and the full deployment process.
The most crucial safeguard around recovery from outages and ensuring a smooth software delivery, with more reliability and efficiency, is to automate the software development pipeline.
Automation reduces deployment timelines and increases reliability. We need to always be thinking about shifting left to detect defects earlier and quicker. To eliminate repetitive, error-prone tasks, increase productivity and provide faster releases into production. Automation helps the process of quickly deploying new versions, or quickly deploying for scale.
As the team is building the automation process, one of the decisions that needs to be made is around the number of deployment pipelines. If the solution components will always be deployed together and versioned at the same time, then you may be more productive by creating a single pipeline. However, larger implementations containing multiple agents, where those agents need to be versioned independently, or agents that will be shared across solutions, should have a deployment pipeline dedicated to each agent or agent type.
In the process of automating everything, we need to invest in the time required to make sure that we are getting the benefits that come from being able to deploy consistently and reliably. We need to know that we can trust the deployment process and have the consistency needed. We also need to ensure that we are automating the testing activities that will be required to run during and afterward. We need to have trust in both the deployment as well as ensuring that we can trust the results and actions of each agent.
We also need to make sure that we include automating monitoring and governance processes to ensure that the solution adheres to compliance and security standards.
Your solution may have been built using tools such as Azure AI Foundry with the AI Agent capabilities, including Semantic Kernel for orchestrating the workflow along with Autogen for agent development or built in Google Vertex AI with the Agent Builder, or others, and all have the base scripts required to build and deploy. However, there are deployment frameworks that are purpose-built to help with the automation of deployment. Some of those are tools like GitHub Actions, Azure DevOps Pipelines or Daggar.
While GitHub Actions and Azure DevOps pipelines have been around for a while. Dagger is an interesting addition to the frameworks listed. Dagger is a DevOps platform co-founded by Solomon Hykes, who co-founded Docker. Dagger lets you build composable CI/CD pipelines and workflows quickly through composable software components. This lets you automate, test and debug pipelines locally so that muti-person teams can collaborate on the process. Dagger provides modules to let you glue together the numerous specialized components required for a deployment process. This helps reduce the number of homemade scripts that typically must be written, tested and maintained. Dagger also helps prevent Development to Integration drift (Dev/CI). Dev/CI drift occurs when you end up having multiple configuration or deployment components created in multiple phases of the process. Instead of writing the same automation twice, you can write a Dagger plan once instead of specifying a Docker compose file. This is very handy if you are deploying to Docker packages or utilizing Kubernetes to manage and scale containerized solutions.
I had mentioned Governance and monitoring earlier in this section, but we need to dive into this again as both topics are so critical to the success of your solution. As we look at the process of experiment and build, evaluate and test, adapt and transition to the deployment environment governance and monitoring must be incorporated into every step. Build this into your culture. We have seen countless case studies that show that DevOps activities and practices work best in an environment of collaboration and continuous learning. Successful DevOps relies on creating a mindset and moving through action, with experimentation, feedback and shared ownership between teams.
AI governance is about having a strategy to stay in control. It is about action and it provides the guardrails that help mitigate risks and ensure ethical AI usage. Monitoring provides the ability to get feedback that folds into the continuous learning aspects. It provides feedback and answers to the experimentation activities. It builds trust with stakeholders and drives better business outcomes.
Governance guardrails are an essential mechanism to ensure that agents operate appropriately, within safe and ethical boundaries but more importantly, that their actions are consistent with the policies that your organization has put in place.
Governance of AI agents still requires a focus on operations, error handling, security, input validation, grounding, output validation and safety of responses.
With agentic AI we also need to have governance policies that provide guidance for input validation, grounding, output validation, safety of responses, ethical responses, real-time validation and testing, requirements for monitoring and the required functionality of having a human in the loop.
This human-in-the-loop functionality is essential during run time, but also during testing before and after deployment. Since agents can make autonomous actions, guardrails need to be factored into the process. Guardrails such as automation of runtime activities to mitigate risk and continue alignment of rules set through governance function.
You will need to incorporate monitoring at every facet of the solution from output logs, to output from every automation process to full runtime logging. Within your monitoring process, you should also include information that will assist you with quota management, cost management and visibility of all errors, requests and responses. Collecting the monitoring output isn’t enough. It used to be acceptable to store all the monitoring data and then review it once an error or outage was detected. With agentic AI systems, you need to implement continuous monitoring and review of the data to search for anomalies or activities that would necessitate a human in the loop scenario. You may even want to create an agent to review and monitor your monitoring data and take action in real time.
Deployment: Automating the LLM operations lifecycle
Keep in mind that everything surrounding artificial intelligence and agentic AI is still evolving. We are seeing models being released faster, which introduces model management activities that we didn’t have to manage previously. Tooling is evolving and new frameworks are being released that make processes easier and more streamlined and that can reduce technical debt. You need to ensure your AI solution evolves as well. You will need to iterate your solutions more frequently than you would have with your traditional non-AI solutions. You also need to ensure that you have a versioning strategy to keep up with modifications and new features.
If you aren’t planning updates, with a versioning strategy, as well as updating the iterative tests, your AI system will become obsolete. This can cause unreliability and it becomes a technical debt that you will struggle to maintain.
The benefits of fully automating the LLM operations lifecycle to enhance efficiency, consistency and reliability, while also supporting continuous improvement, cost-effectiveness and compliance far outweigh the cost.
Agentic AI solutions have immense potential for businesses seeking to automate tasks, enhance efficiency and incorporate the benefits of agentic AI. But if you aren’t deploying, testing, monitoring and automating the process it doesn’t matter how good your solution is or what the potential could have been.
In this article, we have covered the processes around agentic AI DevOps but I want you to take away five things that you should ensure are your foundational baseline required as the basis for every successful implementation:
- Automate, automate, automate: Automate tasks, create automation pipelines, automate testing, automate evaluations, automate the deployment of monitoring.
- Deploy to containers and virtual environments: Run solutions in Docker containers to isolate the agents and constrain their access.
- Restrict access: Limit the agents’ access to resources, and to the internet, as well as data repositories to prevent unauthorized access or data oversharing.
- Monitor: Monitor output logs, performance logs and custom metrics during and after execution to identify issues that require human review. Create and compare against the baseline to identify and easily identify unintended behavior.
- Human oversight: Run tests with humans in the loop to supervise the agents and ensure that you have included all scenarios that will require human intervention.
Fully automating the LLM operations lifecycle will enhance efficiency, consistency and reliability, while also supporting continuous improvement, cost-effectiveness and compliance.
Stephen Kaufman serves as a chief architect in the Microsoft Customer Success Unit Office of the CTO focusing on AI and cloud computing. He brings more than 30 years of experience across some of the largest enterprise customers, helping them understand and utilize AI ranging from initial concepts to specific application architectures, design, development and delivery.
This article was made possible by our partnership with the IASA Chief Architect Forum. The CAF’s purpose is to test, challenge and support the art and science of Business Technology Architecture and its evolution over time as well as grow the influence and leadership of chief architects both inside and outside the profession. The CAF is a leadership community of the IASA, the leading non-profit professional association for business technology architects.
Read More from This Article: From concept to reality: A practical guide to agentic AI deployment
Source: News