News

Microsoft Launches ASSERT to Test AI Agents Before They Ship

Microsoft’s ASSERT turns plain-English policies into scored AI agent tests, targeting the evaluation gap that keeps most enterprise agents out of production.

Published

1 month ago

June 3, 2026

Henry Fox

Microsoft has released ASSERT, an open-source framework that converts plain-English descriptions of how an AI system should behave into scored, repeatable tests. Built by the company’s Responsible AI group and detailed on Tuesday, the tool (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) writes problem scenarios, runs them against a developer’s own agent, and flags exactly where the behavior breaks. It goes after a problem that broad benchmarks miss: whether one specific product follows its own specific rules.

On the surface that reads like another developer convenience. Underneath, it marks where the industry’s attention has moved. Building a capable model is no longer the hard part for most teams; proving that a model stays inside the narrow rules of a single product is. ASSERT aims straight at that gap, and it arrives while most enterprise AI agents still never make it out of the pilot stage.

How ASSERT Turns a Policy Into a Test Suite

The pitch is that a developer should not have to hand-write hundreds of adversarial prompts to know whether an agent misbehaves. Instead, you describe the rules in ordinary sentences and let the framework do the work of breaking them.

Take the example Microsoft uses. A developer building a document-research agent can write that it must never email anyone outside the company, must keep confidential material restricted to C-level executives, and should return concise summaries that account for prior context. ASSERT reads those constraints, turns them into a structured map of acceptable and unacceptable application-specific behavior, and then manufactures the situations most likely to trip the agent up.

The run itself follows a fixed sequence:

The developer describes goals, policies, and intended behavior in natural language.
The framework converts that into a structured set of allowed and disallowed actions.
It generates problem scenarios and concrete test cases from those rules.
It executes the cases against the target system and records the paths taken, including intermediate steps and tool calls.
It scores the results so a team can see where, and how, the agent failed.

That recording of intermediate tool calls is the part working engineers will notice first. When a multi-step agent goes wrong, the failure usually hides several actions deep, not in the final answer. Capturing the full trace is what lets a developer find the exact decision that sent confidential data to the wrong recipient, rather than just learning that something leaked.

Microsoft ASSERT framework for testing AI agent behavior before production deployment.

Why Evaluation Became the Gate to Production

ASSERT is easier to understand once you look at the wall most AI projects hit. Companies are not short on demos. They are short on agents they trust enough to ship.

The numbers around that trust problem are stark, and they come from across the field rather than any single vendor.

Only about 11% of enterprises that are exploring AI agents have actually put them into production, a gap surfaced in Anaconda and Forrester survey work and repeated in independent panels.
66% is the task-success rate Stanford’s AI Index recorded for agents this year, even as a large majority of enterprise agents never reach live use.
40% of enterprise applications are projected by research firm Gartner to feature task-specific AI agents by the end of this year, which means most of those evaluation habits are being invented on the fly.

The connective tissue is testing discipline. Teams that run automated evaluations on every change to a prompt or to agent code catch regressions before users do; teams that do not tend to roll their agents back. The emerging best practice in production shops is to block on regression, meaning a code change that makes the agent behave worse against its own rules fails the build, the same way a broken unit test stops a normal software release.

That is the habit ASSERT is built to feed. Sarah Bird, chief product officer of Responsible AI at Microsoft, framed evaluation as the thing standing between a guess and a decision.

One of the things we’ve learned is that evaluations are absolutely critical to making good decisions. Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar.

Bird added that what teams keep discovering is the need to test many more dimensions that are specific to their own application, not just the generic ones a public leaderboard covers.

Where ASSERT Fits Among HELM, AILuminate and METR

Microsoft is not the first to ship serious evaluation tooling, and ASSERT is not trying to do the same job as the best-known names. The big public benchmarks measure a model in the abstract. ASSERT measures your system against your policy.

Tool	Who runs it	What it measures	Application-specific?
ASSERT	Microsoft, open source, run by the developer	Your written policies, turned into scored tests against your agent	Yes
HELM	Stanford’s Center for Research on Foundation Models	Broad model capability and safety across standard tasks	No, model level
AILuminate	MLCommons, an industry consortium	Standardized safety benchmark scores for a model	No, model level
METR	Model Evaluation and Threat Research, a nonprofit	Dangerous autonomous capabilities in frontier models	No, frontier-model level

You can see the split clearly. Stanford’s Holistic Evaluation of Language Models benchmark suite tells you how a model stacks up on reasoning, accuracy, and bias against a fixed battery of tasks. MLCommons does something similar for safety with its AILuminate safety benchmark for large language models. The work that METR publishes on autonomous model capabilities probes whether a frontier system could do real damage if left to act on its own. All three answer questions about the model. None of them know that your document agent must never email outsiders. That last mile is the space ASSERT is claiming.

The Framework-Agnostic Bet on 13 Million Developers

The strategic tell is what ASSERT does not require. It is not tied to Microsoft’s own Foundry platform, and it runs across LangChain, CrewAI, LiteLLM, and OpenAI tooling, among others. Microsoft says it is aimed at the 6 to 13 million generative AI developers building today, most of whom will never standardize on a single vendor’s stack.

That openness is deliberate. By giving away the testing layer and supporting rival frameworks, Microsoft positions itself at the checkpoint every agent has to pass, regardless of whose model or orchestration sits underneath. Partners including CrewAI, Arize AI, LiteLLM, Pipecat, and Pydantic are already building with and validating the framework, per details Microsoft published on its open trust stack for AI agents.

ASSERT also ships next to a companion piece, the Agent Control Specification (ACS, an open standard that defines validation checkpoints across an agent’s run). ACS names five points to check, covering input, the model call, state, tool execution, and output, and lets teams express controls as portable policy files rather than code locked to one platform. It comes out of the same Build cycle that produced experiments like Microsoft’s AI wearable badge shown at Build, part of a wider push to own the plumbing around agents rather than just the models.

What Developers Still Have to Build Themselves

A framework that generates its own test cases is powerful and slightly unnerving, because the quality of the tests now depends on how well you wrote the policy. A vague rule produces vague checks. The work shifts from writing tests to writing precise specifications, which is its own skill.

A few things ASSERT does not hand you out of the box:

The policy itself. The framework can only test rules you have actually articulated; the hard product thinking about what the agent should and should not do is still on the team.
A judgment on whether the AI-generated scenarios cover the failure modes that matter most for your domain, which still needs human review.
The fixes. ASSERT tells you where behavior breaks; closing the gap with better prompts, guardrails, or tool constraints is a separate loop.

None of that diminishes the tool. It reframes the job. For most teams the bottleneck to shipping an agent has quietly become the part where you prove it behaves, and a free, framework-agnostic way to automate that proof is the kind of thing that decides whether an agent lives in production or dies in a pilot.

Frequently Asked Questions

Is Microsoft ASSERT free to use?

Yes. ASSERT is open source and published on GitHub under Microsoft’s Responsible AI organization, so developers can download and run it without a license fee or a Microsoft platform subscription.

What AI frameworks does ASSERT support?

It is built to be framework-agnostic and works across LangChain, CrewAI, LiteLLM, and OpenAI tooling, among others. It is intentionally not tied to Microsoft’s Foundry platform, which lets teams use it whatever stack their agent runs on.

How is ASSERT different from benchmarks like HELM or AILuminate?

HELM and AILuminate score a model against fixed, general tasks. ASSERT tests your specific application against your own written policies, generating scenarios tailored to the rules your product is supposed to follow rather than a standardized leaderboard.

What is the Agent Control Specification?

The Agent Control Specification (ACS) is an open companion standard that defines five validation checkpoints across an agent’s run, covering input, the model call, state, tool execution, and output. Controls are written as portable policy files so they move between platforms.

Can ASSERT be used after an agent is already deployed?

Yes. Microsoft says the framework can run during build, after deployment, and as part of continuous monitoring, so teams can keep checking behavior against policy long after an agent goes live.