2026-03-23

Building Kaze: A Production App in 27 Days with AI Agents

A safe AI chat companion for kids, 545 commits, 97%+ test coverage, and only 1.7% human-written code. This is the overview of the Kaze project and the beginning of a series exploring what it takes to build production software with AI agents.

~7 min read ai-assisted-development, case-study, kaze, claude-code, engineering-process

I built a production Rails application in 27 days, and I wrote 1.7% of the code. AI agents handled the rest: 545 commits, 2,236 tests at 97%+ coverage, authentication, admin panel, compliance preparation, bilingual support. What surprised me was how much the job changed when I stopped typing code and started steering agents instead.

What Kaze Is

Kaze is a safe AI chat companion for kids ages 5-18. Think ChatGPT, but designed for children with parental oversight and a 4-layer safety architecture: input screening, built-in chat rules, output moderation, and parent review. Parents create accounts via magic links, add up to 5 child profiles, and each child connects with an 8-character code. No ads, no data sold, COPPA-aligned. The safety requirements shaped every architectural decision, from prompt construction to conversation storage.

The codebase includes 34 models, 60 controllers, 157 views, ~23,000 lines of application code, and full English/Spanish bilingual support from day 2. Not a demo. Built entirely through AI agents operating under a structured engineering process (OpenUP).

The Stack: Boring Technology Wins

I picked Rails 8.1 with PostgreSQL because the AI writes better Rails than anything else I tried. Rails is extremely well represented in training data, so the AI produces idiomatic code and knows the ecosystem's conventions. The Solid Stack (SolidQueue, SolidCache, SolidCable) eliminated Redis entirely, and RubyLLM gave me provider-agnostic LLM integration across OpenAI, Anthropic, and Ollama.

I experimented briefly with a more modern frontend setup early on. Wasted half a day before switching back. The AI's strength is producing correct code in well-established ecosystems. Boring technology is a feature.

The Numbers

545 total commits across 27 calendar days. AI co-authored 429 of them (78.7%). Merge commits account for 98 (18.0%), Dependabot handled 9 (1.7%), and human-only commits total 9 (1.7%). Those 9 human commits were all quick configuration changes, things faster to type directly than to describe to an agent. No feature code was written by me.

Three Claude models shared the work: Opus 4.6 handled 215 commits (39.4%) for complex architecture and compliance tasks, Sonnet 4.6 took 197 commits (36.1%) for standard features, and Haiku 4.5 covered 17 commits (3.1%) for quick fixes. I picked the model for each task the same way a tech lead assigns people, based on how hard the problem is.

What I Actually Did

Most of my time went into three roles: product owner, PR reviewer, and process designer. Sometimes all three in the same hour.

Product owner. Every feature started as my decision about what to build and why. The AI never decided what to build next. The most impactful decisions in Kaze were product decisions: what features to include, when to pause for quality work, when to address security mid-construction rather than at the end.

PR reviewer. 98 pull requests merged in 27 days, 3.6 per day. AI-generated code rarely has obvious bugs, which makes the subtle ones harder to catch. I got burned once by a PR that looked clean but missed the actual intent of the task. After that I started reviewing the plan before the code.

The role I care most about is process designer. I adapted OpenUP for AI agents, built 26 custom Claude Code skills, defined 5 agent team roles, and designed the documentation structure that serves as the AI's institutional memory. A good process makes everything else better. But it has to stay transparent, running in the background like a supervisor, observing and taking notes to act on the next iteration. If your process is solid, it does not matter much which model runs the task.

And then there is the quality gate work. The test trajectory tells the story: 0 tests on day 1, 69 by day 2, 371 by day 3, 1,067 by day 5 at 91.1% coverage. Then during a high-velocity sprint (admin panel, compliance, monetization), coverage dropped to 85.5% by day 16 even though the absolute test count kept growing. I stopped shipping features and ran a recovery effort: day 17, back to 97.9%. Final count: 2,236 tests at 97%+. The AI will not tell you coverage is slipping. You have to watch the numbers yourself.

What is Coming Next

This article is the overview. The rest of the series covers the structured process, the custom skills and agent roles, documentation as AI memory, a full metrics breakdown, and what went wrong along the way.

What I am Still Figuring Out

For 30 years I have been a software engineer. In this project, I wrote 9 commits out of 545. Every architectural decision, every quality standard, every feature priority, every safety constraint came from me. But I am still not sure how to feel about the fact that my most productive month ever involved almost no coding.

I found it exciting. The main testing so far is my 11-year-old kid and myself using the software daily. I built it for us. It works for us. Whether it works for other families is something I can only find out by putting it in front of them. If you want to try it, grab an invitation code and see for yourself.

This is part 1 of the Building Kaze series, exploring what it takes to build production software with AI agents.