JumpFlow AI
All articles

·11 min read

The AI software factory: what it is, why it matters, how I build it

AI gets you fast code. A factory around the AI gets you safe, reliable, secure code that actually ships. Here's the five-stage delivery system I run on my products today, and the risk-tier pass I'm rolling out next.

AI Software Factory: safe, reliable, secure, and fast AI development across understand, research, plan, develop, and review stages

Where this comes from

I have mentioned the idea of an AI software factory in a few places without ever going into detail. This is the detail.

It is not a manifesto, and I am not claiming it is the way to do this. It is the way I have come to think about delivering software with AI on my products. What I run today, what I think works, and what I am about to try next. Take the bits that are useful, ignore the rest.

The need, as I see it

The way I think about it: software produced with AI has to be safe, reliable, secure, and fast to ship. All four. Lose any one, in my experience, and the rest stops being commercially useful. Fast and unsafe is just incident generation. Safe and slow is a competitor's opportunity.

That is the bar I try to clear. The model on its own does not get me there. What does, for me, is the system around the model. I call that system the AI software factory.

Most of what follows — the five-stage flow, the agent tooling, the review loop, the deterministic gates — is in place on my line today. Risk tiers are the piece I am introducing next, and I will call that out where it appears.

How I describe it

A structured software delivery system that uses AI inside a controlled engineering process.

An assembly line, not a single tool. Raw materials in (code, documentation, requirements). Validated change out. Staged production, inspection points, deterministic gates at the end. Continuously tuned around the specific business, its risk profile, its systems, and its people.

A bank, a startup, a healthcare provider, and a SaaS company probably should not run the same line. I think of the Factory as a pattern. The implementation is shaped to context — which is most of the work we do under JumpFlow services.

What I do not mean by it

A few things worth being explicit about, because I have seen the metaphor get misread.

Not a coding bot. A coding bot optimises for generating code. What I am after is getting validated change into production.

Not "more AI". The goal, the way I think about it, is fewer incidents, safer releases, predictable delivery, and better use of engineering time. I do not think AI earns its place on the line unless it moves one of those.

Not a replacement for engineering discipline.TDD, code review, rollback planning, deterministic checks — all still required. The Factory just makes them cheaper for me to apply consistently. It takes top engineers to make and evolve the factory, so they're still critical.

Not finished once built. A factory is a product. It gets tuned forever. Bottlenecks removed, checks refined, tooling upgraded.

Not "access to the best model". Every competitor can buy the same models from OpenAI, Anthropic, or Google. Model access feels like a commodity to me. It is not, on its own, a durable edge.

Why I think it pays off

Speed, measured end-to-end

The clock that matters, in my view, is elapsed delivery time. Request raised to validated change live. Typing is a small slice of that clock. The larger slices, at least where I have looked, are:

  • waiting for a human to be free (PM, BA, getting work into sprints)
  • context-switching back into a problem
  • POCs and estimation to get to delivery
  • deciding whether a change is safe
  • gathering evidence for review
  • coordinating across people

The Factory compresses those slices by running planning, coding, testing, review, and evidence-gathering in parallel across multiple work streams — including while the developer is in meetings, solving another problem, or asleep. The throughput gain comes from more concurrent, validated change, not faster keystrokes.

The metrics that matter

System throughput, measured directly:

  • lead time for validated changes
  • change failure rate and rework volume
  • streams of work in flight per engineer

Audience-by-audience framing

CTO. Predictable delivery, lower change failure rate, more streams in flight per engineer.

CISO. Deterministic security checks at the end of every line, an evidence trail per change, agent actions auditable.

COO or business owner. Faster time from request to live, fewer incidents, fewer late surprises. If you want to put numbers on that for your own situation, our ROI calculator uses the same back-of-envelope model I use with clients.

Engineering team. Less time on toil, more time on judgement, agents as contributors with tooling.

My rule of thumb: if a layer of AI does not move one of these, I do not think it has earned its place on the line.

The high-level design (what I run today)

Five stages, run in order. To me this looks more like mature engineering than a replacement for it. This part of the line is in production for me.

Understand → Research → Plan → Develop → Review

1. Understand

Interpret the request properly before anything else. Scope, constraints, business purpose, affected users, success measures. Most rework starts with a misunderstood request. The Factory front-loads this so the rest of the line is not building the wrong thing efficiently.

2. Research

Inspect the surrounding system before proposing changes. Codebase, architecture, existing patterns, integrations, prior decisions, relevant documentation. Coding agents do this well if you've got your patterns and decisions where they can see (markdown in repo, accessible by MCP). A change that ignores prior decisions creates inconsistency, which creates rework, which destroys the speed argument.

3. Plan

Define the implementation path before coding begins. Files likely to change, risks, rollback considerations, testing strategy, verification steps, deployment shape. Plans are cheap to revise. Code is not. I use a combination of Speckit/GSD and open conversation interactive plans that are documented in markdown in the repo. They're all researched, theories are proven, and we burn the tokens here rather than wasted implementation re-works.

4. Develop

This is where many teams focus first, but it is only one stage. Inside it, the strongest results come from disciplined sub-steps.

Keep context small. Planning will have split work into narrow units. API contract, validation, UI state, tests. Features like subagents pay huge dividends here when executed against a great plan.

Test-driven where practical. As AI speeds up coding, tests become more important, not less. They are the control system that makes higher throughput safe. TDD also has the magical property of making sure code is written in a testable way from day 0.

Build first, refactor second. One pass to get it working. A separate pass for clarity, reuse, and consistency. Each pass with a single objective. This is a huge unlock. AI agents are far faster in my opinion when you disconnect these steps.

Give agents proper tools.Logs, shell, database access, test runners. Playwright for visual exploration and front-end tests. Mobile automation where the work needs it. Docker and seed scripts to test and run systems with known states. Without tooling, agents are limited observers. With it, they verify their own work. These things were commercially difficult to justify investing in once a product was live, so we devs mostly lived without — but they're now accessible and critical.

Run the system end-to-end. I use Docker to spin up services and databases locally in a controlled way, with cleanup jobs that remove containers agents forget to tidy up themselves. Seeded, repeatable. If the change cannot be run, it cannot be verified.

Finish with deterministic checks. SonarQube for SAST, code quality, and complexity. Dependabot for package vulnerability checks, though Snyk is a strong alternative. Linters and standard unit test runners on every change, plus type checking and coverage. AI assists. Deterministic tools remain the gate.

5. Review

Once the work is complete, PR raised, a separate model or reviewer inspects it, comments returned to the implementer, sensible fixes applied, then human review. The separation between implementer and reviewer matters, even when both are agents.

Specialise the reviews. Security, standards, design and user experience. A single generic review pass is weaker than several focused ones.

Reserve human attention for the questions machines do not answer well. Does this solve the real problem? Is anything risky or surprising? Would a customer understand this? Is the trade-off sensible?

Risk tiers — what I am trying next

Everything above is in place. The next iteration I am about to trial is risk tiers — gating the depth of process by the risk of the change, instead of running every change through the same controls.

The reasoning, as I see it. Applying maximum rigour to every change is not safety, it is waste. Applying minimum rigour to every change is not speed, it is incident generation. Today my line treats most changes the same. My hypothesis is that splitting it into tiered routes will lift throughput without lifting incidents. I will find out.

The model I am starting with:

  • Low risk. Copy, internal tools, isolated UI behind a flag. Build check and screenshot review.
  • Medium risk. New features, changes to existing user-facing flows. Tests, security scan, peer review.
  • High risk. Payments, auth, data migration, regulated surfaces. Full regression, adversarial review, staged rollout.

The tier is decided at planning time, not at PR time. Tiering after the work is done is not a tier model, it is a triage queue. By PR time the implementation choices are already locked in. Tier early, and the tier shapes the design, the evidence, and how the work is built.

I will write up what I learn once it has been on the line for long enough to mean something.

What has to be in place

Three groups of inputs to the line.

Inputs. Existing code, reachable documentation, clear requirements. Input quality is the single biggest lever. Vague requests produce convincing-looking output that misses the mark.

Tooling for agents. Logs, shell, browser automation, mobile automation, database access, test runners. And a runnable system. Containers, seeded data, repeatable startup.

Confidence layers.I use Claude Code and its hooks to enforce execution of parts of the process. Git hooks work too, but I have found them a bit blunt. Screenshots, recordings, walkthroughs. Reviewers should not have to take an agent's word for what happened.

A warning. Teams that rely on a single fragile shared staging environment slow down once AI increases output volume. Local, reproducible environments scale with throughput and are now critical. One shared bottleneck does not.

Where I think the edge sits

Most companies can buy the same models. Model access feels like a commodity to me. I do not think it produces a durable advantage on its own.

The edge, for my money, comes from the surrounding system. Clear inputs, staged delivery, the right tooling, the right review specialisations, the right risk-matched controls, and the discipline to keep tuning all of it. Products like CallRelay are what comes off this line — a small, focused tool that has to be safe, reliable, secure, and fast to ship, all at once.

I suspect the winners will not be the teams that generate the most code. They will be the teams that can reliably turn change into production outcomes.

So to sum it up

That is what I mean when I say AI software factory. A delivery system in which AI helps produce software that is safe, reliable, secure, and fast to ship. The model is one component. The system around it is, in my experience, where the actual work is.

If you are running something similar, or you are figuring out how to start, I would love to chat and learn from each other. The fastest way in is the contact page.