Anthropic’s Claude 3.5 Sonnet large language model has gained a new ability: operating a computer.
The new ability, which the company is calling “computer use,” is currently in beta test. It enables developers to instruct Claude 3.5 Sonnet, through the Anthropic API, to read and interpret what’s on the display, type text, move the cursor, click buttons, and switch between windows or applications — much as today’s robotic process automation (RPA) tools can be instructed — much more laboriously — to do.
To apply its ability to use a computer, Claude 3.5 Sonnet starts from a prompt defining its goal, identifies the steps necessary to reach that goal, and then scans screenshots much as a human would look at the screen of a computer to figure out how to perform those steps.
Key to that is Claude 3.5 Sonnet’s new-found ability to return the coordinates of a feature in an image, enabling it to position the cursor on a button or in a text box on screen.
Claude 3.5 Sonnet needs definitions of the tools and software on the computer it will operate, and authorization to access them. It then sends requests to use the tools, and examines the response to see if has succeeded or whether it needs to continue using the tool to complete its task.
How does this help with automation?
Anthropic claims that the computer use ability represents a shift in AI development that could open up use cases that previously remained untapped.
“Up until now, LLM developers have made tools fit the model, producing custom environments where AIs use specially-designed tools to complete various tasks. Now, we can make the model fit the tools,” Anthropic company wrote in a blog post, adding that the idea is to fit Claude into computer environments that people use daily and allow it to use software on that environment as a human user would.
The message has clearly been heard by robotic process automation (RPA) vendor UIpath, which said Tuesday that it has integrated Claude 3.5 Sonnet in three of its products: UiPath Autopilot for everyone, Clipboard AI, and a new medical record summarization tool.
Claude’s computer use capability could shake up the RPA market, wrote Paul Chada, co-founder of AI startup Doozer AI, because it is not held up by constraints such as requiring constant maintenance or breaking when user interfaces change. “Anthropic’s new approach addresses these core challenges: Adaptive Interaction: Instead of hard-coded scripts, it actually understands what it’s looking at,” Chada wrote in a LinkedIn post, adding that other advantages of the system include it working across any interface and its potential to improve through usage and feedback.
Limitations
Anthropic cautioned that the computer use ability is still in beta and comes with several limitations. For example, it said, it may struggle to operate applications on screens with resolutions higher than XGA (1024×768) or WXGA (1280×800) due to issues with image scaling.
The company also warned users of the risk of prompt-injection attacks: If Claude navigates to a webpage with images or text containing instructions, these “may override user instructions or cause Claude to make mistakes,” it said.
To limit such risks, Anthropic recommends limiting Claude 3.5 Sonnet’s internet access to approved domains only in order to reduce exposure to malicious content; not giving the model access to sensitive data such as account login information to prevent information theft; and using a dedicated virtual machine or container with minimal privileges to prevent direct system attacks or accidents.
Additionally, it suggests that a human supervisor should be on hand to “confirm decisions that may result in meaningful real-world consequences as well as any tasks requiring affirmative consent, such as accepting cookies, executing financial transactions, or agreeing to terms of service.”
These limitations, including chances that Claude might follow commands found in content even if it conflicts a user’s instructions, have also sparked scepticism among testers.
“Anthropic’s agent is not really usable right now, it gets stuck constantly and consumes probably about $1 of tokens every 4 minutes of browsing or so,” Peter Gostev, head of AI at gift retailer Moonpig, wrote in a LinkedIn post after testing the new ability.
Do it yourself?
When it comes to software development — another of Claude 3.5 Sonnet’s capabilities — the new computer use ability still leaves much to be desired, Martin Bechard wrote in another LinkedIn post.
“If this seems like Anthropic left a lot of the work to be done by you, you are correct. And all of the other Agentic frameworks essentially do the same — they use a model to figure out what needs to be done, then an application builder actually interprets the instructions and does the actual data retrieval or other work as commanded by the LLM,” Bechard wrote.
The software engineer also goes on to suggest that OpenAI has a similar tool.
“OpenAI also has a Tools feature that works essentially the same way – define the callable tools, which allows GPT to interrupt its chain of thought and return a request for a function call to be performed by the calling application in order to continue with the appropriate system data,” Bechard wrote in his post, adding that OpenAI also has an Assistants API that introduces a layer between the application and the LLM in order to maintain the context without having to send it on every call, but otherwise is still interruption-based.
Read More from This Article: How Anthropic’s new ‘computer use’ ability could further AI automation
Source: News