Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

AI is ready to take over Python programming, but not much else

Tests of how well 19 large language models (LLMs) complete and perform complicated multi-step tasks has shown that they are both error-prone and, in many cases, unreliable.

The findings are contained a preprint paper, LLMs Corrupt Your Documents When You Delegate, written by Microsoft researchers  Philippe Laban, Tobias Schnabel and Jennifer Neville based on a benchmark they created called DELEGATE-52 that allowed them to simulate workflows that might be part of a knowledge worker’s tasks. The paper is currently under review.

They said that the benchmark contains 310 work environments across 52 professional domains including coding, crystallography, genealogy and music sheet notation. Each environment consists of real documents totaling around 15K tokens in length, and five to 10 complex editing tasks that a user might ask an LLM to perform.

And, they stated in the paper’s abstract: “Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.”

Those mistakes are significant, they said. “The findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing an average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%.”

Benchmark exercise receives a thumbs up

Brian Jackson, principal research director at Info-Tech Research Group, found the findings very interesting. “Putting a list of LLMs to the test across different work domains yields a lot of useful insights,” he said. “I think this type of benchmark exercise could be helpful to enterprise developers who are looking to leverage agentic AI to automate specific workflows and understand the limits of what can be achieved.”

However, he said, “what we shouldn’t conclude from this is that, because these foundation models caused document degradation after 20 edits, they can’t be used to automate work in a certain field. It just means they can’t do all of the work as they are currently constructed.”

But, Jackson stated, “in an enterprise environment where having an accurate output is crucial, you wouldn’t take that approach. You would design the automation flow with stronger guardrails in place to prevent errors. This could be done by using multiple agents that play different roles, such as one that makes the edits and another that checks for errors and makes corrections.”

Sanchit Vir Gogia, chief analyst at Greyhound Research, said, “the Microsoft paper should be read as a serious warning about delegated AI, not as a claim that enterprise AI has failed. That distinction matters. The paper is still a preprint, so it deserves careful handling, but its central question is exactly the one CIOs should be asking: can AI preserve the integrity of complex work over repeated delegation?”

The study, he said, is stronger than what he described as “the usual AI benchmark theatre,” because it tests work products, not just looking at clever one-off answers. “It uses reversible editing tasks, domain-specific evaluators, and a round-trip method to see whether a document returns intact after repeated edits. In too many cases, it does not.”

That is the point, explained Gogia. “This is not merely about hallucinations. It is about artefact integrity.”

AI is ‘not yet trustworthy enough’

He added that the headline finding is “uncomfortable: even the strongest models corrupt about a quarter of document content by the end of long workflows, while average degradation across all tested models reaches roughly 50%. The paper also finds that performance varies sharply by domain. Python is the only domain where most models are ‘ready,’ and the best model reaches that threshold in only 11 of 52 domains.”

AI is not failing because it cannot write, said Gogia, it is failing because it cannot yet preserve.

The study, he pointed out, “is especially useful because it shows how errors accumulate. Bigger documents worsen outcomes. Longer interaction worsens outcomes. Distractor files worsen outcomes. Short tests flatter the system, while longer workflows expose it. That maps rather neatly to the enterprise world, where work is messy, files are stale, context is noisy and the most important documents are rarely the simplest ones.”

The honest conclusion, he said, “is not that AI should be kept out of enterprise workflows. It is that delegated AI is not yet trustworthy enough to be left alone with consequential artefacts.”

When AI edits an important document such as a contract, a ledger, a policy, a codebase, a board paper, or a compliance record, Gogia warned, the enterprise still owns the damage.

Mitigation approaches

In order to prevent that damage, Jackson suggested, enterprises can do additional training and fine-tuning of models to be better adapted to their specific workflows: “These foundation models are very good at doing a lot of different tasks, but less good at doing one specific task very well. So, enterprises that want to achieve that may need to improve the models themselves by training on their own data.”

For example, “[the Microsoft paper] points out one multi-agent setup that led to more degradation instead of less, so the method to detect degradation must be well-designed to be effective,” he said. “Another approach that some enterprise platforms have introduced is a way to deterministically verify the output for accuracy using mathematical verification. So, knowing what domains prove more difficult for a single LLM to automate is useful, as developers can plan to add more verification steps to the process.”

He said, “depending on the model, for example, if it’s totally open source or if it’s proprietary, you can have more flexibility in terms of how much you can customize it. So, an enterprise developer might look at these results, pick the LLM best at automating their desired domain, and then send it in for additional training to master the process.”

People do not disappear

According to Gogia, the paper also shows something more precise than ‘AI still needs people.’ “It shows that AI changes the human layer from production to supervision, validation, and accountability. That is a rather different operating model from the one being sold in many boardroom conversations.”

People, he said, “do not disappear. Their work moves. This is the uncomfortable part for enterprises chasing headcount reduction. The people best placed to catch AI errors are often the same people organizations are hoping to replace, reduce, or redeploy. Remove too much domain expertise from the workflow, and the enterprise also removes the people who know when the AI has quietly damaged the work.”

Expertise becomes more valuable, not less, said Gogia: “The paper reinforces this because stronger models do not merely delete content. They often corrupt it. Weaker models are easier to catch when they visibly drop material. Frontier models are more awkward because the content remains present but becomes wrong, distorted, or subtly altered. That requires knowledgeable review, not casual inspection.”


Read More from This Article: AI is ready to take over Python programming, but not much else
Source: News

Category: NewsMay 13, 2026
Tags: art

Post navigation

PreviousPrevious post:“업그레이드 강제 없다”…레드햇, ‘무기한 지원’ RHEL 서비스 공개NextNext post:깃랩 “개발자 도구 비용 최대 100배 증가”…AI 시대 과금 구조 대전환

Related posts

AI, power and the trade-off between freedom and innovation
May 14, 2026
Building an AI CoE: Why you need one and how to make it work
May 14, 2026
AI-driven layoffs aren’t making business sense
May 14, 2026
CIOs are put to the test as security regulations across borders recalibrate
May 14, 2026
How deepfakes are rewriting the rules of the modern workplace
May 14, 2026
Decision-making speed is a hidden constraint on transformation success
May 14, 2026
Recent Posts
  • AI, power and the trade-off between freedom and innovation
  • Building an AI CoE: Why you need one and how to make it work
  • AI-driven layoffs aren’t making business sense
  • CIOs are put to the test as security regulations across borders recalibrate
  • How deepfakes are rewriting the rules of the modern workplace
Recent Comments
    Archives
    • May 2026
    • April 2026
    • March 2026
    • February 2026
    • January 2026
    • December 2025
    • November 2025
    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.