DeepTest: Systematic Testing Grounded in Specification Knowledge Bases

Authors: Sarah Fakhoury, Saikat Chakraborty, Lef Ioannidis, Mei Yang, Denny Sun, Madan Musuvathi, Nikhil Swamy

TL;DR; In the era of agentic exploit-generation tools, proactively hardening codebases is essential. We are sharing early results from DeepTest, a framework for systematic specification-driven testing. We’ve deployed it across dozens of Microsoft components and open-source projects, contributed thousands of new tests, finding and fixing dozens of bugs in the process. DeepTest is part of efforts to make intent explicit and checkable in the era of agentic software engineering.

Agentic tools, powered by the latest frontier models, are quickly transforming the space of software defect detection. Anthropic’s Mythos and OpenAI’s Codex Security are able to identify high- and critical-severity zero-days and produce exploits at unprecedented scale.

While exploits generated by these tools allow teams to address bugs and harden their code (e.g. Hardening Firefox), these tools come late in the development process and encourage a reactive form of software security, where, under time pressure, one is rarely confident that issues found and patched are fixed properly and do not break other system behaviors.

In a world where everyone has access to agentic exploit-producing tools, we believe that effective defenses must involve systematic analysis and testing of software early in the development process, ironing out user intent and guarding software against future regressions with comprehensive test suites. A downstream benefit of a comprehensive test suite is substantially reducing the burden for review at patch time.

However, a test suite is only as good as the oracle that accompanies it. In the past, building a good oracle has required users to write down specifications in various formal notations (ranging from runtime assertions to more elaborate specification languages). Such formalisms have always been a barrier to adoption. Besides, certain properties are just plain hard to specify, though they may be easy to describe informally. Conversely, some properties are hard to pin down informally and benefit from formal specification (such as those described in a previous post).

In this post, we introduce DeepTest, an agentic framework that uses a combination of AI and symbolic tools to:

Infer specifications for software from a diversity of sources, including
- requirements documents
- code comments
- formal annotations (like SAL)
- code assertions
- and the code itself
Author tests to validate the behavior of a software component against its inferred specification
- contributing new tests to a project, making the expected behavior explicit and guarding against future regression
- finding that the inferred specification does not match a user’s intent, allowing the specification to be refined
- finding bugs in the software that can be fixed

We have used DeepTest on dozens of projects at Microsoft, including some open source projects, and have:

inferred detailed specifications ranging from procedure-level contracts to state-machine descriptions of entire components
contributed thousands of new tests
found and fixed dozens of functional and behavioral bugs

Whereas exploit-oriented agentic tools focus on finding “universal bugs” (e.g., memory safety errors that can be exploited or validated through a universal oracle), DeepTest infers behavioral specifications and finds bugs across the full spectrum: low-level violations like memory safety errors that arise from contract mismatches between functions, as well as higher-level behavioral bugs that violate protocol or application intent.

In the rest of this post, we give a flavor of the kinds of specifications, tests, and bugs that DeepTest produces, and conclude with some thoughts for the future.

Inferring Program Specifications

DeepTest uses symbolic tools to ground AI agents through modular analysis of the codebase, building a central specification knowledge base. It extracts call graphs to understand procedure relationships, then infers procedure-level contracts (pre/postconditions, module invariants) and state machines guided by call chains. It also stores mappings between natural language requirements and the code that implements them.

To illustrate one such example, we describe our experience on MSQuic, Microsoft’s open-source QUIC implementation that powers HTTP/3 across Windows, .NET, and many production services. The codebase spans roughly 200,000 lines of C and implements multiple protocol RFCs. In particular, we applied DeepTest to MSQuic’s implementation of the Cubic congestion control algorithm, a core protocol component responsible for regulating sending rate in response to network conditions.

DeepTest begins by analyzing the implementation (cubic.c), associated comments, and RFC descriptions to reconstruct a structured view of the protocol’s behavior. From these sources, it synthesizes a state machine that captures how the congestion controller evolves over time. This model is not explicitly defined in the codebase but is inferred by combining control- and data-flow structure in the implementation, semantic hints from comments, and behavioral constraints described in RFC documents.

The resulting state machine captures key phases such as growth under acknowledgments, transitions after congestion events, and recovery toward prior operating points. To make this structure executable, DeepTest synthesizes action functions that encode how to move between states. These functions (e.g., function to make a connection enter congestion avoidance state) simulate protocol events such as acknowledgment arrival, time progression, and congestion signals, ensuring that any constructed execution follows a valid sequence of steps.

By combining the inferred state machine with these synthesized actions, DeepTest creates a representation of protocol behavior that is both structured and executable, forming the foundation for systematic test generation.

Systematic Test Generation

DeepTest approaches test generation systematically by iterating AST-extracted paths or by traversing an inferred state-transition graph, and identifies gaps in existing test suites along the way. For example, the Cubic.c state machine is traversed exhaustively to cover sequences such as steady-state growth, congestion events, and recovery cycles. Each path is translated into a concrete test by composing the corresponding action functions, so every execution follows a valid sequence of protocol steps rather than an arbitrary state. The resulting tests are systematic across relevant transitions, faithful to real protocol behavior, and compositional across multi-step scenarios.

As tests are generated, DeepTest also produces assertions to check how the system behaves along each path, for example, how the congestion window evolves under acknowledgments or how state transitions are triggered. By turning the state machine into executable paths, DeepTest reduced non-determinism and provides agents with a systematic way to comprehensively explore real protocol behaviors, uncovering subtle bugs that depend on specific sequences of events.

DeepTest generated 50 tests covering 99.41% of the lines in cubic.c. The remaining two lines are dead code or defensive code paths. Out of the 50 contributed tests, 23 validated behavioral specification conformance with RFC 8312, RFC 9002, and HyStart++; 7 covered API-level contracts; and 20 asserted on internal struct invariants.

Finding Bugs with DeepTest

We ran DeepTest on dozens Micorosoft components and open-source projects, finding and fixing dozens of bugs ranging from subtle specification violations to API contract breaches and internal invariant failures. We describe examples from two representative classes of bugs below.

Violations of Behavioral Specifications

MSQuic

Issue #5306: DeepTest identified a standards-compliance gap in MSQuic’s handshake connection ID validation. RFC 9000 §7.3 requires that an endpoint authenticate all handshake connection IDs, including the Source Connection ID used in a Retry packet, by comparing it against the retry_source_connection_id transport parameter sent by the peer. MSQuic checks for the presence of this transport parameter but never compares its value against the actual Retry SCID it received. A // TODO - Validate comment in QuicConnValidateTransportParameterCIDs (connection.c) confirms the validation was known to be missing, and the encoder stores no record of the Retry SCID, making the comparison impossible without a code change.

This violation does not produce a crash, sanitizer signal, or obvious malfunction. Detecting it requires understanding the behavioral contract defined by the QUIC specification, which DeepTest captured by mapping the natural-language requirement in RFC 9000 §7.3 to the implementing code path.

Servo

DeepTest also surfaced a cross-origin cookie eviction bug in Servo’s storage layer. Issue #44097: In Servo, the reg_host function in cookie_storage.rs determines which storage bucket a cookie belongs to by delegating to reg_suffix, a public-suffix lookup designed for domain names. When the host is an IP address, no octet matches a public suffix rule, so reg_suffix falls back to returning the last two dot-separated segments as the bucket key. This treats 192.168.0.1, 10.0.0.1, and 172.16.0.1 as if they were subdomains of a shared parent, mapping all three to the bucket 0.1. Each bucket is capped at max_per_host cookies, so a server at one IP can flood the shared bucket and evict cookies belonging to a completely unrelated IP that happens to share the same bucket key.

The correct handling already exists elsewhere in the codebase: registered_domain_name checks whether the host is an IP and returns the full address as-is rather than passing it through suffix logic. reg_host simply does not apply the same check before bucketing.

DeepTest inferred the relevant contracts automatically:

reg_suffix assumes its input is a dot-separated domain name
reg_host is the sole determinant of bucket identity for max_per_host enforcement
CookieStorage::push evicts entries when a bucket exceeds its cap, regardless of which origin contributed them

By composing these contracts across the call chain from set_cookies_from_headers into CookieStorage::push, DeepTest identified that no sanitization step prevents the collision and generated a concrete triggering test in which cookies stored for one IP vanish after an unrelated IP fills the shared bucket.

Violations of Safety Specifications

MSQuic

Issue #5841: DeepTest also discovered a subtle memory management bug involving allocation contracts.

In MSQuic, the STRM_PROVIDE_RECV_BUFFERS cleanup path in QuicOperationFree releases leftover QUIC_RECV_CHUNK entries by calling CXPLAT_FREE, which ultimately passes the pointer directly to free(). However, these chunks are allocated via CxPlatPoolAlloc in MsQuicStreamProvideReceiveBuffers and marked with AllocatedFromPool = TRUE. CxPlatPoolAlloc returns an interior pointer within a pooled allocation rather than the original malloc return value. Passing this interior pointer directly to free() violates the allocator’s precondition.

The correct deallocation path already exists in QuicRecvChunkFree, which checks the AllocatedFromPool flag and routes the pointer to either CxPlatPoolFree or CXPLAT_FREE. Freeing an interior pointer often corrupts allocator metadata silently, with failures appearing only much later in unrelated code; under ASAN it surfaces as attempting free on address which was not malloc()-ed.

DeepTest inferred the relevant contracts automatically:

CxPlatPoolAlloc returns an interior pointer
free() requires the original allocation address
the AllocatedFromPool flag determines the correct deallocation path

By composing these contracts across functions, DeepTest identified the violation and generated a concrete triggering test.

Similar bugs found by DeepTest and fixed in MSQuic: #5255, #5389, #5820, #5822

Mimalloc

DeepTest also uncovered bugs in Mimalloc, Microsoft’s open-source high-performance memory allocator. Fix adf4dfb: The function mi_heap_realloc_aligned guarantees that the returned pointer will satisfy a requested alignment. However, when the original allocation was created using mi_heap_alloc_aligned_at, the reallocation preserves the original offset instead of honoring the new alignment parameter.

The returned pointer remains valid memory, so no crash or sanitizer signal occurs. The violation only surfaces later when downstream code executes SIMD instructions that assume the promised alignment. The fix was merged soon after discovery.

Similar bugs found by DeepTest and fixed in Mimalloc: 0adf251, f0cd215, acd6f6c, 661fe25

Looking Ahead

In the era of agentic software engineering, high-quality test collateral is foundational to:

Trustworthy coding agents, comprehensive tests provide natural guardrails for AI coding agents, catching regressions before they propagate
Security patch assessment, vulnerabilities found later in the development process can be patched with confidence that the fix does not break existing behavior
Developer productivity, writing good tests has always been painstaking and LLM-driven codebase understanding makes systematic testing tractable at scale

We are at the beginning of the specification inference journey. AI is already remarkably effective at inferring informal specifications by correlating evidence across documents, code comments, naming conventions, and the code itself. We envision an interactive workflow where developers vet, refine, and extend inferred specifications (Building on emerging methods in previous posts). Future posts will dig deeper into how DeepTest infers and refines specifications.

Finally, an exciting prospect is to connect the specifications inferred by DeepTest, at least parts of specifications that can be formalized, to program proof tools to prove the code correct, ensuring that at least certain classes of program defects are ruled out entirely. Tests find bugs and proofs guarantee their absence.

This involves building tools that allow agents to reason comprehensively about code bases, and provide guardrails for agentic software engineering. Such tools would allow agents to understand the implications of code changes, verify that modifications adhere to inferred specifications, and ensure that patches or new features do not introduce regressions or violate critical invariants, effectively enabling a more proactive and reliable approach to software development.

As software development becomes increasingly automated, specifications will become the primary mechanism by which humans communicate intent to machines. Building systems that can infer, validate, and maintain those specifications is therefore not just a testing problem, it is foundational to trustworthy software engineering in the age of AI.

Acknowledgements: Shuvendu Lahiri, Angelica Moreira, Daan Leijen, Guillaume Hetier