Something quiet has shifted in how engineering teams build with AI agents. A year ago, a feature began as a Jira ticket, became a design doc, became code, and the documentation lived a short life. Today, more teams are writing the specification first — in a structured form a coding agent can consume — and treating the spec, plan, tasks, and constitution as the primary artifacts. The code is generated downstream.
GitHub’s Spec Kit, Anthropic’s Skills, and the broader pattern of spec-as-source are reorganizing how teams ship. The promise is real. The new questions branch in two directions, depending on what’s being generated.
What Spec-Driven Development Actually Is
In Spec-Driven Development (SDD), the specification is the source of truth. A spec.md describes user stories and acceptance criteria. A plan.md derives technical decisions from the spec. A tasks.md decomposes the plan into orderable work items. A constitution.md defines the engineering principles that gate the whole thing. An AI coding agent reads these documents and generates the implementation. When requirements change, you change the spec — not the code — and regenerate.
The shift isn’t just a workflow change. It’s a change in what code is. Code becomes the expression of the spec in a particular language and framework, the way assembly is the expression of C. The artifact you maintain is the spec.
Why Teams Are Adopting It
Three reasons keep coming up:
The gap between requirements and code closes. For decades, the PRD and the implementation tended to drift. SDD lets one generate the other, so pivots become regenerations rather than rewrites.
AI coding agents thrive on structured input. “Build a refund flow” gives an agent enormous latitude. A spec with explicit acceptance scenarios, a plan with constitution-checked architectural choices, and a tasks list with dependency ordering removes that latitude — and the variance in output that comes with it.
Onboarding gets faster. A new engineer reads the spec, plan, and tasks for a feature and understands not just what the team built but why. The reasoning lives in the artifact, not in Slack.
These benefits are real. They all reward rigor at the spec — and the level of rigor that pays off depends on what the spec is generating.
Scenario 1: When the Generated Code Is Conventional Software
Most teams adopting SDD today are using it to generate ordinary application code — a refund flow, a CRUD service, a data pipeline. In this scenario, the practice rewards spec quality. Five patterns are worth planning for.
Acceptance criteria that travel
A spec that lists user stories without measurable acceptance criteria reads well during the kickoff and produces code that matches the developer’s current mental model — not a documented one. Six months later, when the agent regenerates the same feature for a model upgrade or framework migration, the regeneration can drift from the original behavior, because the original behavior was never written down.
The opportunity is mechanical: every user story benefits from Given/When/Then-style acceptance scenarios or an enumerated Definition of Done. Spec Kit’s templates make this the default. When teams override the templates, this is the first place to bring the discipline back in.
Ambiguity that gets resolved on purpose
Spec Kit ships a structured marker — NEEDS CLARIFICATION — and a /speckit.clarify command that turns those markers into open questions an engineer resolves before the spec advances. Teams that adopt the templates and lean into this loop catch ambiguity early. Teams that don’t can find their specs accumulating prose like we should figure this out, TBD, or the team will align later — phrasing that reads fine to a human and gives the coding agent nothing to anchor on.
The opportunity here is process: treat ambiguity markers as PR-blocking, the way you’d treat a failing test. Specs become more decisive, and the generated code follows.
Constitutions that operationalize
A constitution.md encodes the project’s non-negotiable principles — the things every plan and every PR must satisfy. The most useful constitutions read like checks a coding agent can verify. “Every PR must include integration tests for new public API surface” is a principle a tool can confirm. “We value simplicity” is a principle a team can share but a tool cannot enforce.
Both kinds of statements have a place — values shape culture, principles shape gates. The opportunity is to make sure each constitution has at least one operationalizable principle per dimension the team cares about: testing, documentation, observability, error handling, dependency hygiene. The agent generating code can then keep the constitution honoured without anyone having to re-explain what the words meant.
Plans that stay aligned with implementation
A plan.md will declare that storage is PostgreSQL, the framework is FastAPI, and the testing approach is pytest. Six iterations later, the implementation may have evolved one or two of those without the plan catching up. The next regeneration — driven by the original plan — will reach for the original choices.
Keeping the plan synced with the implementation is part of the new discipline SDD asks for. It’s also the place where ARIAS adds the most leverage: the platform reads both, surfaces where they’ve diverged, and points at the section of the plan that needs an update before the next regeneration runs.
Specs and code that stay paired
In code-first development, dead code is a known concept. SDD introduces a sibling: a spec that describes a feature with no code referencing it, or a module in code with no spec backing it. These are the natural side effects of an iterating workflow — features that got reshaped, modules that came in via dependency. Naming them keeps the spec credible: someone reading it should be able to trust that the code reflects it.
How ARIAS reads these artifacts
ARIAS treats .specify/, SKILL.md, agents.yaml, and .claude/agents/ as first-class artifacts on equal footing with code. When a Spec Kit project is detected, the platform surfaces the same kinds of gaps a careful spec review would: user stories that would benefit from acceptance scenarios, ambiguity markers waiting to be resolved, constitutions that could use a few more operationalizable principles, plans without a paired tasks.md, specs and implementations that have drifted out of pairing.
Each finding names a specific file, a specific section, and the change that would round it out. “spec.md user story #2 has no acceptance scenarios — add a Given/When/Then block before the next regeneration” is the kind of finding an engineer can act on.
This carries the rigor of SDD a long way when the output is conventional code. When the output is an agent, it stays useful — and a second layer joins it.
Scenario 2: When the Generated Code Is an AI Agent Itself
Coding agents are increasingly being asked to write other agents. A spec that says “build a customer-support agent with refund capability” gets handed to a coding agent, which produces an agent definition: a system prompt, a tool registration, a memory configuration, an orchestration pattern. Everything from Scenario 1 still applies. A new layer joins it, because the generated artifact is itself an autonomous system.
The decisions the spec doesn’t make, the coding agent makes by default
In conventional code generation, an unspecified detail produces a small concrete choice — what HTTP framework to use, what name to give a variable. In agent generation, an unspecified detail produces an ambient default. Most specs talk about features. They less often enumerate the autonomy level (L1 human-in-the-loop versus L5 fully autonomous), the maximum number of tool calls per task, the timeout on long-running operations, the memory strategy, or the observability hooks. The coding agent picks something — usually whatever the framework’s getting-started example does.
The opportunity is to be intentional. We’ve written about autonomy levels and architecture patterns in detail; the relevant point here is that the spec is the natural place to make these choices, and a few additional sections in the user-story template can make them part of the standard rhythm.
The negative space
A spec describes what the agent should do. It rarely describes what the agent should refuse. When a coding agent generates a system prompt from a spec, it generates the positive instructions. The refusal scope — “do not delete, drop, or destroy anything in production,” “do not move data outside the configured tenant,” “do not make irreversible changes without explicit human approval” — joins the prompt only when the spec enumerates it.
We discussed this pattern at length in the post on the PocketOS-style nine-second incident. In code-first development, an engineer often adds the refusal scope by hand after reviewing the prompt. In SDD, the spec is the prompt’s ancestor, so the refusal scope wants a home in the spec.
Tool surface as a first-class spec concern
A spec might say “the agent should be able to issue refunds and look up order status.” A reasonable plan derives “the agent needs the payments API and the orders API.” A coding agent generating from that plan often takes the next step and registers the full SDK. The result is an agent with read-write-delete access to twenty endpoints when the spec asked for two.
The opportunity is to specify the tool surface at the operation level: issue_refund(order_id, amount), lookup_order_status(order_id). Operation-level specs translate directly into operation-level tool registrations, and the coding agent has a precise target instead of an SDK to wrap.
Memory and orchestration deserve a spec section
User stories don’t usually talk about whether the agent uses buffer memory, summary memory, or RAG. They don’t talk about iteration limits on autonomous loops. They don’t talk about whether the agent shares state with other agents in the same process. These are architectural decisions that show up in the generated code as defaults — and defaults shape how the agent behaves under load.
Adding short, dedicated sections to the spec template — “Memory Strategy,” “Orchestration Bounds,” “Observability Surface” — turns these from defaults into choices. The coding agent gets clearer targets, and the team gets a record of what was chosen and why.
Constitutions that include agent-specific principles
A constitution authored for conventional code typically says “every module must have unit tests” or “every public API needs a contract test.” When the generated artifact is an agent, the constitution does its best work when it adds a few agent-shaped principles alongside the code-shaped ones: “every destructive tool registration must include an approval interceptor,” “every autonomous agent must declare a max-iterations bound,” “every tool registration must declare its side-effect class.”
This is a small addition in length and a meaningful addition in coverage. It also keeps the constitution honoured by the next regeneration, not just the current one.
Drift across two layers
When SDD generates conventional code, drift can happen between the spec and the code. When SDD generates an agent, there are two pairings to keep aligned: (spec → agent definition) and (agent definition → agent runtime behavior). A constitution principle gets relaxed; the next regeneration changes the agent’s prompt; the agent’s behavior shifts in production; each step in isolation looks routine.
The Agent Behavioral Fingerprint that ARIAS computes captures the agent’s goals, tools, memory, orchestration, and error posture as a structured record at each scan. When the input artifacts change — a constitution amendment, a spec rewording, an added tool in the plan — the resulting fingerprint shifts, and the platform surfaces the drift with a citation back to the file and section that changed. SDD doesn’t remove behavioral drift; it moves the source of drift upstream into prose. Detecting it benefits from a fingerprint that reads both the prose and the code.
How ARIAS reads agent-shaped specs
The same six-dimension assessment ARIAS applies to agent code — prompt engineering, agent design, memory architecture, orchestration soundness, observability, governance alignment — applies to the specification of an agent. The platform surfaces the gaps that turn a feature spec into a complete agent spec: tool surfaces registered at the SDK level rather than the operation level, missing refusal scopes, constitutions without agent-specific principles, plans that don’t yet name autonomy level or iteration bounds, generated agents that have drifted past the spec’s stated boundaries.
Each finding cites the file and section, names the constraint that would round it out, and proposes the change. The result is a spec that produces a more reliable agent — and a generated agent that can be certified against the spec it came from.
What This Means for the Way You Adopt SDD
The teams getting the most out of SDD are the ones that treat the spec the way they treat the code: written carefully, reviewed seriously, kept in sync with implementation. The level of rigor depends on what the spec is generating.
If the generated output is conventional code, the rigor that pays off is spec quality: clear acceptance criteria, resolved ambiguity, operationalizable constitutions, traceable plans. If the generated output is an AI agent, all of that still applies — and the spec also benefits from constraining dimensions that code-only specs typically don’t address: tool surface, refusal scope, autonomy level, memory, orchestration, observability.
ARIAS exists to bring that rigor before the regeneration runs. The same control plane reads your spec, your plan, your tasks, your constitution, and your generated code, and surfaces the gaps that round out each artifact. Whether the next thing you ship is a refund service or the agent that builds the refund service, the questions are similar. The answers belong in the spec.
ARIAS is the pre-production control plane for AI agents — and for the spec-driven workflows that produce them. We read your spec.md, plan.md, tasks.md, and constitution.md the same way we read your code, and surface the gaps before the next regeneration ships them. Start your free trial and run the assessment against your first spec-driven project.