Claude just patched the biggest hole in every agent workflow ever built

Claude shipped three new agent capabilities last week. One of them changes the maintenance math entirely.

May 12, 2026

Your agent runs great for a week.

Then it starts failing. Not catastrophically, just quietly. It forgets a file format quirk it handled fine on day three. It stops following a tone preference that worked in session twelve. You go back in, reteach it, it runs clean for a few days, then it forgets again.

You are not fixing a bug. You are refilling a leaky bucket. For the last three years, this has been the hidden tax on building with AI agents. The tools get smarter. The models get better. But the maintenance load stays the same, because nothing carries forward between sessions except what you manually wrote into the system prompt when you set the thing up.

That changed on May 6.

Anthropic ran Code with Claude in San Francisco. Most of the coverage went straight to the model news. The part worth paying attention to as an operator was buried under it. Three features shipped the same day for Claude Managed Agents. Dreaming, outcomes, and multiagent orchestration. Two are in public beta. One is in research preview. Together they cut the maintenance cost on running an agent that actually keeps working.

What “Dreaming” Actually Does

The name sounds like marketing. The mechanic is actually straightforward.

Dreaming runs in the background between sessions. It reads what the agent did in recent jobs and scans for three kinds of patterns. Recurring mistakes the agent keeps making. Workflows it has converged on across different jobs. Preferences that show up across your team of agents. Then it rewrites the agent’s memory store based on what it finds. Old notes get condensed. Important ones get promoted. The next session starts with a curated set of notes from the agent’s own past instead of a blank slate.

A few things worth flagging.

Model weights don’t change. Dreaming is structured note-taking applied to an agent’s persistent memory, not training. The agent gets a text summary of what worked and what failed. You can let dreaming update memory automatically, or you can require human review before any change lands. On anything high-stakes, the latter is the right default.

Harvey, the legal AI startup, ran dreaming in pilot before the public launch. According to Anthropic’s announcement, Harvey saw task completion rates rise roughly 6x in internal testing. The root cause was mundane. Their agents kept forgetting filetype quirks and tool-specific workarounds between sessions, so the same legal-drafting jobs failed in the same way over and over. Dreaming made the workarounds stick.

That’s one data point and Anthropic didn’t ship an independent benchmark next to it. Harvey’s workflow is long-form legal drafting, which is the exact shape of problem where persistent memory pays off most. On simpler stateless tasks, the lift is going to be smaller.

Outcomes: The Feature With the Most Immediate Use

Dreaming is a research preview. Access is by request and it’s not production-ready for most operators yet. Outcomes is in public beta and worth using right now.

Here’s how it works. You write a rubric in plain language describing what a good output actually looks like. The agent does its work. A separate grader runs in its own context window, scoring the output against your rubric without picking up the agent’s reasoning along the way. When something falls short, the grader tells the agent exactly what to fix and sends it back for another pass. You can wire this to a webhook too, so the agent runs, the grader signs off, and you get notified only when the output meets your criteria.

Anthropic’s internal benchmarks show outcomes improved task success by up to 10 points over a standard prompting loop. File generation quality rose 8.4% on .docx outputs and 10.1% on .pptx outputs. Those are Anthropic’s own numbers so apply the appropriate discount, but the direction is consistent with what practitioners are reporting externally.

Wisedocs, a document-review startup, built a quality check agent using outcomes to grade each review against their internal guidelines. They reported reviews now run 50% faster while staying aligned with their team’s standards. The speed gain comes from cutting the back-and-forth. The agent self-corrects before a human reviewer ever sees the output.

What gets me about outcomes is how much of the current agent-quality problem it addresses without technical complexity. The reason most operators are still manually checking agent output isn’t because they distrust Claude. It’s that there’s no way to tell Claude what “good” looks like in a way that persists across runs. A rubric is that mechanism. Writing it forces you to get specific about your standard, which surfaces assumptions you probably hadn’t made explicit.

Multiagent Orchestration: When a Single Agent Hits Its Ceiling

Multiagent orchestration is for workflows that have grown beyond what one agent can handle well.

Here’s the setup. A lead agent breaks a complex job into pieces and delegates each piece to a specialist subagent with its own model, prompt, and tools. Up to 20 specialists run in parallel on a shared filesystem. The lead agent can check back in with subagents mid-workflow. Every agent’s activity is individually traceable in the Claude Console.

Netflix’s platform team is already using it. They built an analysis agent that processes build logs from hundreds of source repositories. Multiagent orchestration lets the lead agent fan the batch out to subagents scanning in parallel, reporting back only the patterns worth acting on. What would have been a serialized scan across a huge codebase runs in a fraction of the time now.

Spiral, a writing tool built by the publication Every, runs the same structure differently. A Haiku-based lead agent handles incoming requests and delegates drafting to Opus-based subagents. When a user asks for multiple drafts, the Opus subagents run side by side. Drafts only reach the user if they clear an outcomes rubric scored against Every’s editorial principles.

That combination shows how these features layer. Multiagent orchestration handles the work distribution. Outcomes handles the quality check. You get parallel speed without dropping the bar.

For most SMB operators, multiagent orchestration is a month or two away from being the right next step. It makes sense once you have a workflow that’s consistently producing clean output but taking too long because it’s processing things one at a time. If your current agent is still producing inconsistent output, fix that first with outcomes before adding parallel complexity.

The Honest View on What Is Still Rough

A few things to hold in mind before going deep on any of these.

Dreaming is a research preview. You can request access but it’s not production-ready, and Anthropic has flagged the security risk clearly. Giving agents structured persistent memory expands the attack surface for prompt-injection. If a malicious input convinces an agent that a wrong instruction is correct, dreaming can consolidate that wrong instruction into long-term memory where it applies to future sessions. Human review on memory updates is the right default for any workflow where that risk matters.

Harvey’s 6x number is compelling and should be treated skeptically. It’s Anthropic’s customer publishing a result on Anthropic’s platform at Anthropic’s developer conference. The result might hold up broadly. It might also be specific to the legal-drafting workflow where Harvey had an unusually clear before-and-after. Real-world results will vary by workflow type.

Outcomes and multiagent orchestration are in public beta. Things will break. Edge cases will surface. The docs are solid but still evolving.

Managed Agents pricing runs at $0.08 per session-hour plus token costs. For light workloads that’s basically nothing. For heavy parallel orchestration across a fleet of agents, it adds up fast. Run the math before you scale that.

What To Do With This

First: Set up an outcomes rubric for one existing agent workflow this week. Pick a workflow where you’re currently reviewing output manually before using it. Write 3 to 5 plain-language criteria describing what “good” actually looks like. Attach it to your Managed Agent. The rubric forces you to articulate your standard, which is useful even before the grading loop shows results.

Second: Request dreaming access if you have an agent that runs the same workflow repeatedly. The access form is in the Claude Managed Agents docs. If your agent keeps failing in familiar ways between sessions, dreaming is built exactly for that problem. Use human review on memory updates until you have enough confidence in what patterns it’s surfacing.

Third: Skip multiagent orchestration for now unless you already have a clean, high-volume workflow. The complexity overhead isn’t worth it until your single-agent setup is producing consistent output. Build the foundation with outcomes first.

The pattern with operators who pull ahead isn’t that they chase every new feature. It’s that they find the specific friction point in their current workflow and fix that one thing. Right now, for most people running agents, that friction point is output quality variability and session-to-session memory loss. Outcomes and dreaming are the most direct tools that have ever shipped for both of those.

If you want to think through whether Managed Agents are the right architecture for what you’re building, that’s a good conversation to have before you start wiring things together. muddventures.com/book

Andrew

Discussion about this post

Ready for more?