Notes on AI-native Workflows

July 1, 2026/

AIAgentsClaude CodeWorkflowEdTech

Most "AI features" are AI bolted onto a workflow that already existed. A chat box in the corner. A "summarize" button. A copilot that autocompletes the thing you were going to type anyway. Useful, sometimes. But the shape of the work doesn't change — a human still drives, and the model rides along.

AI-native is the other direction: you design the workflow around the model doing the work, and you take the seat of the orchestrator. You stop asking "where can I add AI here?" and start asking "if agents do the heavy lifting, what is my job?"

I've been building this way at InfyBytes, and the clearest example is a pipeline I call Alfred. This post is less a tutorial than a set of notes on what I've learned about the shape of these systems.

The case study: a game factory

The goal is narrow and concrete. We make math games for students. A creator has an idea — "a ratio comparison game for Class 5, three lives, ten rounds, multiple choice" — and wants a finished, tested game that kids can actually play.

Alfred turns that sentence into a deployed game. The concept flows through a sequence of stages:

Spec — draft a complete, unambiguous game specification from the description
Review the spec — run it through a checklist, catch scope creep and missing decisions
Plan — screen flow, round-by-round breakdown, scoring and lives logic
Build — generate a single self-contained HTML file
Validate — deterministic contract and static checks
Test — drive the game in a real browser, fix what breaks
Visual review — screenshot every state, check layout and polish
Final review — compare the built game against the original spec
Deploy — upload, register, health-check
Gauge — after students play, read the data and decide what to change

Each stage is an agent with one job. The output of the build stage is a single HTML file that gets injected into a student-facing harness — the harness handles the platform (progress, scoring hooks, audio), the generated file is the game.

That's the system. The interesting part is the principles that made it work.

Orchestration over prompting

The naive version of this is one enormous prompt: "here's a concept, give me a finished game." It fails, and it fails in a way that's hard to debug — you can't tell whether the spec was wrong, the plan was wrong, or the code was wrong, because they all happened in one breath.

The fix is to decompose. Each stage is a specialized agent that reads only the skills it needs and produces one artifact. A spec is a thing you can inspect. A plan is a thing you can inspect. When something is wrong, you know which artifact is wrong, and you can fix that stage without re-rolling the whole pipeline.

The orchestrator's job isn't to write the game. It's to route work between agents, decide what runs where, and hold the line on quality between stages.

Human gates, not full autonomy

It's tempting to let the whole thing run end to end and hand you a finished game. Don't. The most valuable design decision in Alfred is that it stops — at every phase transition, it presents what it produced and waits.

PHASE 1: NAIL THE INTENT
  Draft spec → HUMAN REVIEWS SPEC
  Validate   → HUMAN REVIEWS VALIDATION
  Plan       → HUMAN REVIEWS PLAN
PHASE 2: BUILD, TEST, REVIEW
  ...
  Human preview → creator plays the game

This isn't a lack of trust in the model. It's that the cost of a wrong turn compounds. A wrong assumption in the spec becomes a wrong plan becomes a wrong game becomes a wasted deployment. Catching it at the spec gate costs a sentence of feedback. Catching it after deploy costs an afternoon.

The gates put the human where humans are actually good — judging intent and taste — and keep them out of where the machine is better: generating, checking, and grinding through the mechanical middle.

Verify, don't trust

An agent will tell you it tested the game. It is often lying, not maliciously but structurally — if you ask a model to "test" something without giving it a way to actually run it, it will read the code, reason about it, and report confident results that never touched a browser.

So the pipeline never takes an agent's word for correctness. It has three kinds of ground truth:

Deterministic validators — scripts that exit non-zero on a contract violation. The exit code is the source of truth, not the screenshot and not the agent's summary.
Real browser testing — the game is served locally and driven with Playwright. A test "passes" only if it ran.
Adversarial review — a separate review stage compares the built game against the original spec, looking for what's missing rather than confirming what's there.

There's a subtle failure mode worth naming: a fix that makes the screenshot look clean while the code underneath violates a rule. A green-looking screen is not a passing check. The validator is.

Context boundaries are real

This one is specific to how agents run, but it bit us hard enough to be a principle. Sub-agents don't inherit everything the orchestrator has — in our case, they can't reach the browser-automation tools. Delegate "test this game" to a sub-agent and it silently falls back to reading the code instead of running it. Tests "pass." Confidence is false.

So some stages must run in the main context, and the pipeline says so explicitly, per stage:

Stage	Runs where	Why
Draft spec / plan / build	Sub-agent	Text and code generation only
Test / visual / final review	Main context	Needs the real browser
Deploy / gauge	Sub-agent	API calls and queries

The lesson generalizes: know what each agent can and can't reach, and design the topology around it. Capability boundaries aren't an implementation detail — they change where work is allowed to happen.

The artifact is deterministic even when the process isn't

The pipeline is probabilistic. The output is not. Every game is a single self-contained HTML file that has to pass the same static validators, honor the same data contract, and mount into the same harness. The creativity lives in the generation; the reliability lives in a hard, deterministic boundary the output must cross.

That boundary is what makes it safe to let a model generate the thing at all. You don't have to trust the process if you can verify the artifact.

An example run — Variant 1 (input + spec + HTML)

Here's a real run — the same one that produced the screenshots for this post.

The input was one line pointing Alfred at a concept file:

using this pipeline generate a game for below concept .../text-only-mcq-single-select-with-submit/concept.md

The concept, "Better Way," describes a word-problem game where a Class 3–4 child picks the strategy to solve a problem — not the answer, the method:

"You tried to give away 108 pins to 4 students. Giving out one pin at a time is slow. How can you find the number of pins each student gets in a faster way? Choose the best answer."

The spec stage turned that prose into a precise, buildable contract — identity, Bloom level, a target-skills table, round-by-round progression, a scoring formula, and (my favourite part) a Diff from creator description that lists every line it added and why:

Round-set cycling — Sets A, B, C — added because validator GEN-ROUNDSETS-MIN-3 is mandatory for multi-round games.

Lives (3 hearts) — RETAINED from the creator's concept; this contradicts a suggested pedagogy default, so it's surfaced in Warnings for me to confirm.

That section is the whole faithfulness story in one place: nothing enters the game silently.

The build stage emitted a single index.html. Each round is a plain data object the harness renders:

{
  set: 'A', id: 'A_r1_pins108div4', round: 1, stage: 1,
  scenarioHtml:
    '<p>You tried to give away <strong>108 pins</strong> to <strong>4 students</strong>.</p>' +
    '<p>How can you find how many each student gets in a faster way?</p>',
  options: [
    { id: 'opt-1', text: "No, I don't think there is a faster way.", kind: 'hedge' },
    { id: 'opt-2', text: 'Use division to find how many pins each student gets.', kind: 'operation' },
    { id: 'opt-3', text: 'Guess how many pins each student should get.', kind: 'slow' }
  ],
  correctOption: 'opt-2',
  operationVerb: '÷',
  misconception_tags: { 'opt-1': 'self-hedge', 'opt-3': 'count-one-by-one-instead-of-operation' }
}

One sentence in; a tested, playable ten-round game out.

An example run — Variant 2 (every stage)

To make it concrete, here's an actual run of the pipeline, stage by stage — the same one behind the screenshots above.

The input

One line, pointing Alfred at a concept file:

using this pipeline generate a game for below concept .../concept.md — create a new game if one already exists

The concept, "Better Way," is a word-problem game for Class 3–4: the child reads a sharing scenario and picks the method they'd use, not the answer.

"You tried to give away 108 pins to 4 students… How can you find the number each student gets in a faster way? Choose the best answer."

Stage 1 — Spec

Prose becomes a buildable contract: identity, Bloom level, a target-skills table, and a round plan.

Skill	Cue	Rounds
Equal-share → division	share equally, each, left over	1–3
Equal-groups → multiplication	N groups of M, each box has	4–6
Combine vs separate → +/−	altogether vs gave away	7–9

The spec ends with a Diff from creator description — every added line, justified:

Sets A, B, C — added; validator GEN-ROUNDSETS-MIN-3 is mandatory for multi-round games.

Lives — RETAINED from the concept; contradicts a suggested default, so surfaced in Warnings.

Stage 2 — Plan

The plan fixes the concrete shape before any code: screen flow (Welcome → Round intro → play → Victory / Game Over), the per-round loop, and the scoring math:

3★ at ≥27/30 internal stars (≥90%), 2★ at 18–26, 1★ at 1–17. Three lives; −1 per wrong submit; game over at 0.

Stage 3 — Build

A single index.html. Each round is a data object:

{
  set: 'A', id: 'A_r1_pins108div4', round: 1,
  scenarioHtml: '<p>You tried to give away <strong>108 pins</strong> to <strong>4 students</strong>…</p>',
  options: [
    { id: 'opt-1', text: "No, I don't think there is a faster way.", kind: 'hedge' },
    { id: 'opt-2', text: 'Use division to find how many pins each student gets.', kind: 'operation' },
    { id: 'opt-3', text: 'Guess how many pins each student should get.', kind: 'slow' }
  ],
  correctOption: 'opt-2', operationVerb: '÷',
  misconception_tags: { 'opt-1': 'self-hedge', 'opt-3': 'count-one-by-one-instead-of-operation' }
}

Stage 4 — Validate, test, review

Deterministic validators run first (exit 0 required), then the game is served and driven in a real browser, then a review compares the built game against the spec. Only after all three does it reach me for preview.

The result: one sentence in, a tested ten-round game out — with a paper trail at every stage.

What this changes

The part that surprised me is how the job shifts. I write far less code by hand now. But I spend more thought on the system — where the gates go, what "done" means precisely enough for a validator to check it, which stage owns which decision. The skill isn't prompting. It's decomposition, verification, and knowing when to step in.

That's what AI-native means to me in practice: not a smarter autocomplete, but a workflow where the model does the making and I do the deciding — and where nothing the model claims is believed until something deterministic confirms it.

More of these to come as we build them.

— Rishabh