Evaluations Are the Real Codebase

Why behavior outlives implementations

December 29, 2025

If deleting your codebase feels terrifying, your evaluations are insufficient. That's not a moral failure. It's a technical one—and in the age of AI-assisted development, it's an increasingly expensive one.

Here is the shift: language models have made code generation cheap. Not free, not perfect, but cheap enough that regenerating a service is often faster than understanding and modifying it. This changes what counts as a durable asset. Code isn't it. Code is now a materialized view of understanding—useful while current, disposable when stale.

The durable asset is the thing that lets you regenerate with confidence: evaluations that encode what the system must do, independent of how any particular implementation does it.

Code is Cache

Traditional software culture treated code as the memory of the system. It encoded intent, explained decisions, and preserved behavior over time. Protecting it was rational because replacing it was expensive.

That expense has collapsed. When a model can produce working code from a description in minutes, the calculus inverts. Keeping code around "just in case" stops being wisdom and starts being hoarding. The implementation is a cache. It's a snapshot of your current understanding, useful for running in production, not precious in itself.

If you delete a codebase and can't confidently regenerate it, that's not a tragedy. It's a diagnosis. The problem wasn't the deletion. The problem was that nothing important lived outside the code. The intent, the constraints, the behavioral requirements were all implicit in the implementation rather than explicit in artifacts that survive the implementation's death.

The Spectrum of Test Durability

Most engineers, asked how they ensure a system works, answer "tests." But tests vary enormously in what they actually protect.

Consider a unit test that verifies a specific function's behavior by calling it with specific inputs and checking specific outputs. This test is coupled to that function's existence, its signature, its language. Rewrite the service in a different language and the test doesn't just fail. It can't run at all. The test's lifetime is bounded by the implementation's lifetime.

This isn't because the test was poorly written. Even exemplary TDD—testing behavior over structure, focusing on public interfaces—produces tests that assume the codebase continues to exist in the same language with the same entry points. That assumption was safe when reimplementation was rare. It's not safe when regeneration is routine.

The alternative is tests specified at a boundary that survives reimplementation:

Invariants are properties that hold regardless of implementation. "Balances never go negative." "Events maintain causal ordering." "Round-trip serialization is lossless." These can be verified against any implementation in any language.

Contracts specify what crosses boundaries between components. If service A sends this shape, service B returns that shape. The contract survives reimplementation of either service.

Property-based tests verify behavioral properties across generated inputs. "Sorting is idempotent." "Encryption and decryption are inverses." These encode what must be true, not how to make it true.

End-to-end behavioral checks verify the system's observable outputs. Given this input, the system produces output in this class. The internal path doesn't matter.

These are durable evaluations. They encode intent at a level of abstraction that outlives any particular implementation. A codebase can be deleted and regenerated; if these evaluations pass, the system still works.

The Cost of Durability

This is where most manifestos would stop. Having sold the destination, ignore the hike. But, writing durable evaluations is hard. Genuinely hard. Harder than writing the code they specify.

Identifying true invariants requires deep domain understanding. Most systems have implicit invariants that no one has articulated. They're embedded in code that "just works" without anyone knowing exactly why. Extracting these invariants is archaeological work.

Property-based testing requires thinking in universals rather than examples. Instead of "when I call sort([3,1,2]), I get [1,2,3]," you must specify "for all lists, sorting produces a list with the same elements in non-decreasing order." This is a different mental motion than example-based testing, and most engineers haven't practiced it.

Formal contracts require precision that natural language resists. "The API returns user data" is not a contract. "The API returns a JSON object with fields id (string, non-empty), email (string, valid RFC 5322), and created_at (ISO 8601 timestamp)" is a contract. The gap between these is where bugs hide.

The investment is real. But the alternative (keeping code around because you're afraid to delete it, because nothing external specifies what it does) is also an investment. You pay it in cognitive load, in context-window costs, in the compounding complexity of systems that only grow.

The question isn't whether durable evaluations are expensive. The question is whether they're cheaper than the alternative. As regeneration gets cheaper, the answer increasingly favors evaluation.

Why the Boundary Matters

The distinction between ephemeral and durable tests reduces to one question: is the test specified at a boundary that survives reimplementation?

Tests against internal functions, private methods, specific call sequences: these are specified at the implementation boundary. They verify decisions that might change. Their lifetime is coupled to the implementation's lifetime.

Tests against inputs and outputs, observable behavior, interface contracts: these are specified at the system boundary. They verify obligations the system owes the outside world. The implementation can change completely as long as these obligations are met.

This isn't a new idea at all! Information hiding, API design, coupling versus cohesion, etc...the software engineering literature has understood interface boundaries for fifty years. What's new is the economic weight. When regeneration was expensive, careful interface specification was good hygiene. When regeneration is cheap, it's the difference between systems that can evolve and systems that calcify.

A simple check for you to apply: if reimplementing your service in a different language would invalidate your test suite, your tests are specified at the wrong boundary.

Monitoring as Continuous Evaluation

Even rigorous evaluations only verify intent at a point in time. They don't verify that production behavior matches intent continuously.

This matters more as regeneration frequency increases. Each regeneration is an opportunity for drift. Subtle changes in behavior that pass all explicit checks but diverge from baseline in ways no one anticipated. Monitoring catches what tests miss.

The relevant signals include standard operational metrics (latency distributions, error rates, throughput) but also business metrics specific to each application: conversion rates, fraud detection accuracy, revenue per transaction, whatever invariants matter in your domain. And for AI-assisted systems, add inference cost per request, token usage patterns, and context window consumption. If a regenerated system passes all tests but doubles your API costs or quietly degrades decision quality, that's a failure your evaluations didn't catch.

Monitoring is not separate from evaluation. It's evaluation that runs continuously against reality rather than periodically against test fixtures.

The Real Codebase

Three tiers of evaluation, three lifetimes:

Ephemeral tests verify implementation decisions. Unit tests, structural assertions, mock-heavy integration tests. Useful during development, disposable when the implementation changes. Write them freely; delete them without guilt.

Durable evaluations verify behavioral intent. Property tests, contract tests, invariants, end-to-end checks. These survive reimplementation because they're specified at boundaries that survive reimplementation. They're expensive to write and worth the expense.

Live evaluations verify production reality. Monitoring, drift detection, anomaly alerts. These run continuously because intent and reality can diverge even when all explicit tests pass.

A system with only ephemeral tests cannot be safely regenerated. You don't know what behavior you're trying to preserve. A system with durable evaluations but no live evaluation will drift without warning. A system with all three can be deleted and rebuilt with confidence.

That confidence is the product. Code is a byproduct.

The real codebase is everything that lets you throw code away without fear: the properties that define correctness, the contracts that specify interfaces, the monitors that detect drift. If that set is empty, no amount of careful implementation will save you. If that set is rich, the implementation is just a detail, regenerable on demand, disposable without loss.

This is the promise of regenerative software. It requires investment in specification that most teams haven't made. It requires honesty about what your tests actually protect. And it requires accepting that the code you wrote yesterday might not exist tomorrow and that this is fine, because the behavior it encoded is preserved in artifacts that outlive it.

Subscribe to The Phoenix Architecture

to get updates in Reader, RSS, or via Bluesky Feed

Immutable Infrastructure, Immutable Code

The Gradient of Trust

#ai