Spotting Specification Gaps with Small Proof-Oriented Tests
- Author: Nik Swamy and Shuvendu Lahiri
In our last two blog posts, we discussed using AI agents to author verified software libraries at scale and the importance of intent formalization in the software development process, especially when untrustworthy AI agents write code. This post puts these ideas together by using a simple, folklore methodology (that we’re calling small proof-oriented tests, or SPOTs) to identify and automatically fix specification gaps in AutoCLRS, our agentic formalization of the CLRS textbook.

Note: This image was produced by Copilot from the image in the previous post, which in turn uses Wikipedia’s cover image for the book.
Small Proof-Oriented Tests
Section titled “Small Proof-Oriented Tests”An old idea, popularized in the last decade especially by the DeepSpec project, is for specifications of verified components to be “two-sided”. That is, a component specification should not only be strong enough to prove the correctness of the implementation, but also be strong enough to conduct proofs of clients of that component. In other words, specifications of components should be “precise” enough to be used in client proofs.
A natural question then is whether the specifications of AutoCLRS are precise. One way to judge this (building on an idea explored in this paper) is to exercise each verified algorithm by writing a small, concrete, verified test case, proving that the test always succeeds. Small proof-oriented tests have some useful properties:
-
A concrete test for a verified library is always useful. As Donald Knuth famously once said: “Beware of bugs in the above code; I have only proved it correct, not tried it.” Actually running a test against the verified code can shake out important specification bugs, e.g., issues with how the specification models the execution environment of the program.
-
Verifying that the test always succeeds is useful way to assess whether or not the specification is precise enough to prove a client program correct.
-
The test exercises the specification on a small concrete instance, e.g., sorting a list of three numbers. This makes the expected output easy to judge.
-
Such tests are also useful API samples, e.g., documenting how to use verified libraries, both concretely and for proof.
We had agents generate SPOTs for all the algorithms in AutoCLRS and found several specification gaps. Then, with several further iterations and specification critique (in natural language) from the authors, we had agents fix these specification gaps and associated proofs.
Some Takeaways
Section titled “Some Takeaways”-
Writing SPOTs has always been good proof-engineering practice. Instructing agents to write such tests remains good practice, and is a useful way to find specification gaps.
-
A key principle of agentic engineering is to find ways to “close the feedback loop”, so that agents can iterate towards meeting some objective. Using SPOTs as a feedback mechanism for agentic proof-oriented programming is a great example of this principle in action.
-
We applied SPOTs to AutoCLRS after most of the code was already written, but it would be interesting to apply SPOTs earlier in the development process. For example, one could start by settling the interfaces of libraries and sketch SPOTs for them to check them for precision before any proofs are written. Of course, settling on specifications early runs the risk of choosing specs that cannot actually be proven, so the right model would strike a balance between early specification tests followed by proof attempts and repeated refinement.
-
SPOTs are not absolute or foolproof: they are small, and may not cover all the aspects of a specification, so they can miss some gaps. There’s also the danger of agents over-fitting specifications to the SPOTs, though we have yet to see this happen in practice.
-
SPOTs also need to be reviewed, though because they are small and concrete, they can be easier to review than large, abstract specifications.
A full report of SPOTs on AutoCLRS is available here, with a more detailed discussion written by us, as well as an agent-authored report of the findings.
A First Example: Insertion Sort
Section titled “A First Example: Insertion Sort”Let’s take a very simple example to start with: the signature of a verified implementation of insertion sort.
fn insertion_sort (a: array int) (#s0: Ghost.erased (Seq.seq int)) (len: SZ.t) (ctr: GR.ref nat) (#c0: erased nat) requires A.pts_to a s0 requires GR.pts_to ctr c0 requires pure ( SZ.v len == Seq.length s0 /\ Seq.length s0 <= A.length a) ensures exists* s (cf: nat). A.pts_to a s ** GR.pts_to ctr cf ** pure ( Seq.length s == Seq.length s0 /\ sorted s /\ permutation s0 s /\ complexity_bounded cf (reveal c0) (SZ.v len))The postcondition of this function says that the output sequence s is a sorted
permutation of the input sequence s0 and that a ghost variable that counts
the number of operations in insertion_sort is bounded by the expression
complexity_bounded (ghost variables exist only for specification and proof and
do not influence the runtime behavior of the program).
Now, this looks like a reasonable specification of a sorting function, but are
the sorted and permutation predicates accurate, and is complexity_bounded
a tight complexity bound for insertion sort? To be entirely sure, one should
read these specifications carefully and form an opinion. But, as we’ll see
shortly, even a careful reading can miss some subtle aspects.
A SPOT for insertion sort is available here and excerpted below:
(* Spec validation: insertion_sort on [3;1;2] produces [1;2;3]. Returns bool with runtime checks that survive extraction to C. *)fn test_insertion_sort_3 () returns r: bool ensures pure (r == true){ // Input: [3; 1; 2] let arr = alloc 0 3sz; arr.(0sz) <- 3; arr.(1sz) <- 1; arr.(2sz) <- 2;
// Create ghost counter (starts at 0) let ctr = Ghost.alloc #nat 0;
insertion_sort arr 3sz ctr; //call the verified library
//a lemma proving that a sorted permutation of [3;2;1] is [1;2;3] completeness_sort3 (value_of arr);
// Verify complexity bound: at most n*(n-1)/2 = 3 comparisons assert (pure (value_of ctr <= 3));
// Runtime check (survives extraction to C) let pass = (arr.(0sz) = 1) && (arr.(1sz) = 2) && (arr.(2sz) = 3);
// Cleanup Ghost.free ctr; free arr; pass //proves that pass = true}The test allocates an array arr with [3;2;1] and a ghost counter ctr for
complexity tracking, calls insertion_sort, and then proves afterwards that the
array contains [1;2;3] and that value_of ctr <= 3. The check on the array
contents is executable and runs in the concrete C code produced by Pulse, but
the check on the ctr is just a proof artifact---the ctr is ghost and does
not exist at runtime.
This provides good evidence that the specification of insertion_sort is
precise and usable to prove a client program correct. The proof is not entirely
trivial: the completeness_sort3 lemma proves that the sorted permutation of
[3;2;1] can only be [1;2;3] which takes a few lines to do. Though it’s time
consuming, writing small proof-oriented tests is a standard part of
proof-engineering practice, and, perhaps unsurprisingly by now, agents can
easily write SPOTs too.
SPOTs for AutoCLRS
Section titled “SPOTs for AutoCLRS”As discussed in our previous post, a big part of what’s remaining in AutoCLRS is a scalable way to review and find specific gaps. Using agents to write SPOTs to find gaps has been an effective way to do this. Our first run of SPOTs revealed that 8 algorithms had interesting specification gaps, as shown below.
| Algorithm | Key Spec Gaps |
|---|---|
| Hash Table (Ch11) — ⚠️ Weak insert | Insert postcondition doesn’t guarantee success |
| BST Array (Ch12) — ❌ Too weak | No reachability; insert→search broken |
| Union-Find (Ch21) — ⚠️ Moderate | Rank bound not preserved in postcondition |
| BFS (Ch22) — ⚠️ Weak | No shortest-path optimality in postcondition |
| DFS (Ch22) — ⚠️ Weak | Spec↔Impl disconnect; theorems not exposed |
| Kruskal (Ch23) — ⚠️ Weak | No spanning tree or MST property |
| Prim (Ch23) — ❌ Critical | Postcondition admits incorrect outputs |
| Bellman-Ford (Ch24) — ⚠️ Moderate | Correctness conditional on runtime boolean |
While the remaining 44 algorithms were deemed to already have sufficiently precise specifications, and we have reviewed the tests and found them to be useful, judging this to be an accurate assessment of precision would be hasty---SPOTs are small and do not necessarily cover all the aspects of a specification, so it’s quite possible that some gaps were missed.
Others have observed that writing provable tests is a useful way to evaluate specifications (e.g., [1] and [2]), but writing proofs of tests at scale has been a barrier to using this approach in practice. Now, with agents authoring SPOTs, we can use them in an automated feedback loop to spot and fix specification gaps at scale.
Let’s look in detail at a couple of the algorithms with specification gaps, and how they were fixed.
Hash Table: Under-specified Error Conditions
Section titled “Hash Table: Under-specified Error Conditions”Chapter 11 of CLRS defines a linear-probing hash table, implemented in a single
fixed-size array: when a collision occurs, the table searches for the next
available slot “further down” the array. This means that the hash table can run
out of space and insertions can fail, so insert returns a boolean result
indicating whether or not the insertion succeeded. An initial version of
insert had a precise specification in case result=true, and also proved that
the hash table is unchanged when result=false. But, it did not guarantee when
result=false would occur, so the specification allowed the implementation to
always return false, even when the hash table actually still had space.
Such specification gaps (e.g., describing the error behavior of APIs) are easy
to miss by a manual review, but a SPOT for insert caught it. Here’s an excerpt
of the SPOT, the comments are also produced by the agent.
fn test_insert_then_search (){ let table = hash_table_create 3sz; // Insert key 0 let b = hash_insert table 3sz 0 ctr; if b { // === Insert succeeded === // From insert postcondition: // key_in_table s' 3 0 /\ key_findable s' 3 0 /\ valid_ht s' 3
// Search for key 0 — should find it let r = hash_search table 3sz 0 ctr; // Postcondition precision: search must find key 0. assert (pure (SZ.v r < 3)); // Cleanup hash_table_free tv; } else { // === Insert failed === // Postcondition: s' == s (table unchanged) // This branch represents a spec weakness: the postcondition of hash_insert // does not guarantee success when empty slots are available.
// Cleanup hash_table_free tv; }}After finding and fixing this gap, a revised SPOT for insert is straight-line code like this:
fn test_insert_then_search () returns res: bool { res == true }{ let tv = hash_table_create 3sz; // Insert key 0 let b = hash_insert table 3sz 0 ctr; // Search for key 0 — must find it let r = hash_search table 3sz 0 ctr; // Compute runtime-observable result let ok = (b && (r <^ 3sz)); // Cleanup hash_table_free tv; ok}Prim: An Incomplete Proof
Section titled “Prim: An Incomplete Proof”Prim’s algorithm for minimal spanning trees had a critical specification gap.
The original version of the algorithm had a postcondition prim_correct key_seq parent_seq weights_seq (SZ.v n) (SZ.v source) but using this predicate to
conclude that the result of Prim is actually a minimal spanning tree required
the use of a lemma, prim_result_is_mst, whose precondition was too strong to
be
useful.
An initial SPOT for Prim set up a small three-vertex triangle graph below and found that one could not actually prove that the output of Prim had the expected minimal spanning tree.
// --- Set up weight matrix for 3-vertex triangle --- // Graph: 0 --1-- 1 --2-- 2 // | | // +------3--------+The agents report:
// FINDING: prim_correct only captures array shapes and boundary // values (key[source]=0, keys bounded, parent[source]=source). // It does NOT capture any MST structural property (spanning, // acyclic, minimum weight, correct parent relationships).Fixing the proof of Prim required several rounds of agent iteration, but eventually the specification was strengthened to capture the necessary properties, and the proof was completed.
The revised SPOT for Prim contains the same triangle graph but can prove that the output tree contains only the edges from 0 to 1 and 1 to 2 and that the total weight of the tree is 3, as expected.
Parting Thoughts
Section titled “Parting Thoughts”The interplay between testing and proving is a rich area of study, and SPOTs for reviewing agent-authored specifications is yet another example of testing and proving atq work together. While SPOTs have been useful in finding and fixing specification gaps in AutoCLRS, we should also point out some limitations.
Over-fitting Specifications to Tests One obvious danger is for agents that
author both the libraries and their SPOTs to over-fit the specifications to the
tests, e.g., the specification of insertion_sort might only specify its
behavior on 3-element lists. However, the kinds of specification gaps that we
have observed in practice are not due to over-specialization but rather
under-specification. A more systematic way to guard against over-fitting would
be to have separate agents author the SPOTs and the specifications, and for the
SPOTs to be held back (much like in programming assignments) from the
proof-authoring agents. One could also have separate SPOT-reviewing agents, a
“rubber-duck”
feature
that Copilot now makes available by default.
Tasteful Specification Although SPOTs are effective in finding specification gaps, they do not provide any guidance on the style of specification. Good specifications are abstract, concise, and elegant, matters of good taste at which agents are often lacking. One could perhaps get at some aspects of this by assessing the level of automation in the proofs of SPOTs. Good clean specs usually also lead to more concise proofs, and the proofs of SPOTs in AutoCLRS are often quite verbose, so there’s still lots of room for improvement.