The Generative Stack

Trying to find the best tool or platform for generative software in 2026 is a mistake that could haunt you for decades

April 07, 2026

Every layer of the software toolchain is changing at the same time. New spec formats, new generation strategies, new evaluation frameworks, & new feedback mechanisms show up weekly, each one supposedly the one that matters. Most teams respond by trying to pick winners.

Maybe I'm just old and too cranky to follow the trends day in and day out, but I think that's exactly wrong.

What you actually want is an architecture where independent tools and representations coexist at every phase. Competing, reinforcing, composable. A system that absorbs change without requiring you to place a bet. Because pipelines that lock in today's best options? Those become next quarter's migration projects. I've seen it happen enough times to stop being surprised by it.

That's what I mean by the generative stack.

The Full Pipeline

The generative stack covers every phase between a human's intent and a system's verified behavior. Not just specifications.

Spec inputs (markdown documents, conversations, diagrams, existing code, domain models) feed into structured clauses. Canonicalization structures those clauses into requirements and invariants. Requirements drive evaluations. Evaluations constrain implementation units. Implementation units become actual code. Running code produces feedback (eval results, runtime behavior, performance data) that flows back into the loop, refining every upstream layer.

Each of those phases is its own site of diversity. Each one can absorb independent tools, overlapping representations, several approaches. And the trick is you don't choose between them. You compose them.

Diagram of the previous paragraph's flow

At the spec layer, you might combine natural language descriptions, example interactions, and formal constraints. No single artifact captures the full intent. Together, they triangulate it.

At the clause layer, several structuring strategies (e.g. one tool extracting behavioral requirements from conversation, another identifying invariants from existing code) produce a reconciled set of requirements that's more complete because each tool parses intent differently.

At the evaluation layer, the same principle holds. I say evaluations rather than tests deliberately: these are behavioral contracts that outlive any particular implementation. Property-based checks, example-based specs, integration contracts, performance bounds, LLM-as-judge assessments. Each catches failures the others miss. A system verified by one evaluation strategy has blind spots. A system verified by five has fewer. (Not zero. Fewer.)

Diversity at each phase is the mechanism that catches what any single tool misses.

At the generation layer, different models and strategies have different strengths. One might produce cleaner abstractions. Another might handle edge cases more carefully. You don't have to pick one. Generate candidates from several approaches and let your evaluations arbitrate.

At the feedback layer, runtime telemetry, eval results, and user behavior data all flow back upstream. Each signal illuminates different gaps. Eval failures point to broken implementations. Performance data reveals implementations that are correct but insufficient. And sometimes user behavior exposes the real problem: the specification itself was wrong. Independent feedback channels make the whole loop self-correcting in ways that any single channel can't.

Every phase is additive. Every phase benefits from redundancy.

Why Redundancy Works Here

In traditional development, maintaining overlapping representations of the same concern was expensive. Two spec documents meant two things to keep in sync. Three test frameworks meant three things to maintain. The cost of reconciliation exceeded the benefit of coverage. We all learned this the hard way.

AI inverts that calculus.

When generation is cheap and reconciliation can be automated, the economics flip. You can afford to maintain a natural language spec and a formal constraint set and a suite of example interactions, not because any one of them is sufficient, but because an AI system can reconcile them faster than a human can maintain a single canonical version. The same applies at every other layer. Run multiple generation strategies and let evaluations pick the winner. Maintain independent evaluation approaches; their disagreements surface ambiguity you'd otherwise miss. Ingest diverse feedback signals and let canonicalization absorb them all.

The cost of redundancy drops. The value of redundancy stays the same--or increases, because blind spots don't overlap across independent approaches.

Fault-tolerant systems in every other engineering discipline depend on redundancy across independent channels. Flight control systems don't rely on one sensor. Distributed databases don't rely on one replica. The principle is old and boring and correct: if you want resilience, don't depend on a single representation of anything critical.

In a generative stack, every layer is critical. Specs, clauses, requirements, evaluations, generation strategies, feedback loops. A single point of fragility at any layer makes the whole pipeline brittle.

No Layer Is Settling

There probably isn't going to be a single dominant way to specify intent, structure requirements, generate code, evaluate correctness, or close the feedback loop. This has been the observable pattern across every layer of the software toolchain for decades. Build systems, test frameworks, deployment strategies, monitoring approaches; none of them converged to one winner. We keep thinking they will. They keep not doing it.

What's different now is degree, maybe not kind. Every layer between intent and implementation is becoming a site of rapid evolution simultaneously, and they're all getting better at different rates. Take evaluation alone. Two years ago, LLM-as-judge was a research curiosity. Today it's a production evaluation strategy that catches semantic regressions no unit test would surface. It coexists with property-based testing, which catches algebraic violations no LLM judge would notice. Neither replaces the other. Both arrived on different timelines. This is happening at every layer at once.

The stack keeps expanding. An architecture that assumes convergence will need to be rebuilt every time a layer lurches forward.

Designing for Absorption

If you build your generative pipeline around a specific tool at any layer such as a particular spec format, a particular LLM, or a particular evaluation framework, you've created a coupling that will fight you when that layer evolves. And it will evolve. I don't think I need to convince anyone of this at this point.

The alternative is to design each layer as a composition point. A place where inputs, tools, and representations can be added without replacing what's already there.

At the spec layer, accept multiple input formats without requiring a single canonical source of truth for intent. Canonicalization should structure requirements so new clauses from new sources don't invalidate existing ones. Evaluations work as an additive suite: new strategies strengthen the contract rather than replacing it. Generation stays model-agnostic, able to swap or combine approaches as capabilities shift. Feedback ingests signals from independent sources, routing them to the right upstream layer.

Each layer becomes a site where you can add without subtracting. Where new tools strengthen the pipeline instead of forcing you to rip out what's already working. This is possible, because the layers and their interfaces are well known. Because the architectural system around them is well defined and unchanging. The specific tools and techniques inside the layer are encapsulated.

This is what makes regeneration robust. If your regeneration pipeline depends on one tool at each layer, it's as fragile as the weakest tool. If each layer draws from overlapping representations and approaches, regeneration tolerates the failure or obsolescence of any single component.

You're no longer depending on one interpretation being correct at any layer. Which, if you think about it, was always a pretty optimistic assumption.

You don't optimize the generative stack once. You cultivate it. Every phase absorbs the next tool, the next representation, the next approach that hasn't been invented yet. Convergence is a bet against the entire history of software tooling. I wouldn't take that bet.

The Phoenix Primitives

The Conversation Is the Commit

#ai

#coding

#software

#architecture

The Phoenix Architecture

Generative AI coding demands what we've always known: modularity, clear boundaries, disposable components. Principles that scaled human teams are now table stakes. Here, we make the implicit explicit