What Level 4 Actually Requires (And Why We're Not There Yet)

I can see Level 4 from where I’m standing. That’s not a throwaway line—it’s a genuine observation from working at Level 3 every day.

Level 4 is fully autonomous AI development: you describe what you want at a high level, and AI handles everything. Requirements breakdown, architecture, implementation, testing, deployment. Human involvement becomes strategic and supervisory.

We’re not there yet. But the gaps aren’t mysterious. They’re specific, observable, and—critically—being actively worked on.

Here’s what needs to be true for Level 4 to actually work.

The Four Requirements

1. Reliability

The problem: Current AI systems hallucinate. They make confident mistakes. They produce code that looks right but has subtle bugs.

At Level 3, this is manageable. Human review catches the hallucinations. I read every PRD, verify architecture decisions, inspect code, validate test coverage. The AI proposes; I verify.

At Level 4, there’s no human review of individual outputs. The AI needs to be right—or at least reliably catch and correct its own mistakes—without human verification at each step.

Where we are: Hallucination rates have improved dramatically but aren’t zero. With Claude Opus 4.5, I see maybe 5-10% of outputs that have some kind of issue—wrong assumption, missed edge case, subtle bug. That’s great for Level 3 where I review everything. It’s not good enough for autonomous operation.

What needs to change: Either hallucination rates need to drop to near-zero, or AI systems need robust self-verification. Run the code. Check the tests. Validate the behavior. Catch mistakes before they propagate.

The self-verification path is probably more achievable. You don’t need perfect first-draft code if you have reliable error detection and correction.

2. Context Management

The problem: Current AI systems have limited context windows. They can hold a conversation, remember what was discussed, process a reasonable amount of code. But they can’t maintain coherent understanding of an entire large project.

At Level 3, I help with context management. I know where things are in the codebase. I provide relevant files. I give background that the AI doesn’t have. I’m the “project memory” that persists across sessions.

At Level 4, the AI needs project-scale memory. It needs to understand the entire codebase—not just what’s in the current context window. It needs to know the history: why decisions were made, what’s been tried before, what constraints exist.

Where we are: Context windows have expanded dramatically (200K+ tokens), but large codebases still exceed them. Solutions like RAG (retrieval-augmented generation) help—the AI can search for relevant code—but “search for relevant code” isn’t the same as “understand the whole project.”

What needs to change: Either context windows need to grow by another order of magnitude, or we need much better retrieval and memory systems. The AI needs to act like a developer who’s worked on the project for months, not a contractor who just showed up.

Some combination of expanded context, persistent memory across sessions, and intelligent retrieval seems most likely.

3. Self-Correction

The problem: When AI makes a mistake at Level 3, I catch it. I provide feedback. We iterate. The correction loop includes a human.

At Level 4, the AI needs to detect its own mistakes and fix them. Without human intervention. Run tests, see failures, debug, fix. See runtime errors, trace the cause, patch the issue.

Where we are: AI can debug when you tell it “this test is failing, here’s the error.” It struggles more with “something is wrong, figure out what.” The difference is between directed debugging and autonomous debugging.

What needs to change: AI needs to run its outputs, interpret the results, and act on them. Not just “generate code” but “generate code, run it, check if it works, fix it if it doesn’t, repeat until correct.”

This is more achievable than it sounds. The loop is simple in principle: execute → observe → fix → repeat. The challenge is doing it reliably without getting stuck in loops or making things worse.

4. Judgment

The problem: Software development involves judgment calls. When to refactor vs. ship. How much to optimize. Where to draw the scope boundary. When requirements are ambiguous and need human clarification.

At Level 3, I make these judgment calls. The AI executes; I decide when execution is “good enough” or “needs more work.”

At Level 4, the AI needs to make these calls autonomously—or at least know when to escalate to a human.

Where we are: AI judgment is inconsistent. Sometimes it makes great tradeoff decisions. Sometimes it over-engineers. Sometimes it takes shortcuts that cause problems later. The judgment isn’t reliably good.

What needs to change: Better calibration on when to ask vs. decide. Clear guidelines for quality thresholds. Probably some way to encode “taste” or “standards” that the AI applies consistently.

This might be the hardest requirement. Reliability, context, and self-correction are technical problems with technical solutions. Judgment is… squishier. It might require AI to have a model of user preferences that it can reason about.

What Level 4 Looks Like (When We Get There)

Let me paint a picture:

You: “Build me an invoicing application for small service businesses.”

The AI:

Asks clarifying questions about scope, features, and constraints
Proposes a high-level architecture for your approval
Breaks the project into phases and milestones
Executes each phase: requirements → design → implement → test
Self-verifies each piece: runs tests, checks behavior, fixes issues
Asks for human input at decision points: “These two approaches have tradeoffs. Which do you prefer?”
Deploys incrementally, verifying each deployment works
Reports progress and final result

Human involvement: strategic decisions, preference choices, final acceptance. Not line-by-line code review. Not catching hallucinations in PRDs.

The human moves from “director” to “executive”—setting direction and approving outcomes, not managing execution.

Timeline Speculation

Timelines are dangerous—they age poorly and create false expectations. But here’s my honest assessment:

Near-term (12-24 months):

Level 4 for constrained, well-defined greenfield projects
Scope: Simple CRUD apps, standard integrations, common patterns
Constraint: Human still approves architecture and reviews major decisions

Medium-term (2-4 years):

Level 4 for broader greenfield development
AI can handle ambiguous requirements through clarifying questions
Self-correction is reliable enough for autonomous iteration
Human involvement: goal-setting and final acceptance

Longer-term (5+ years):

Level 4 for maintenance and modification of existing codebases
AI understands project history and can evolve systems over time
Human involvement: strategy and taste

What might not reach Level 4 for a long time:

Novel research problems with no existing patterns
Systems where mistakes have high consequences (safety-critical, financial)
Domains that require deep human expertise the AI can’t acquire

Why This Matters Now

If you’re operating at Level 2 or early Level 3, why care about Level 4?

Skill preparation. The skills that matter at Level 4 are already becoming valuable at Level 3. Goal clarity. Outcome validation. Strategic thinking. Knowing what you want built. These skills are investments in your future leverage.

Tooling choices. Systems designed for Level 4 compatibility will have an easier transition. Simple tooling, clear interfaces, well-documented codebases—these help AI at Level 3 and become essential at Level 4.

Mental models. Understanding where this is going helps you think about your role. If Level 4 arrives and you’ve been optimizing for line-by-line coding review, you’ve built the wrong muscles. If you’ve been developing judgment about goals and outcomes, you’re prepared.

The Things We Don’t Know

Some genuine uncertainty:

How will judgment be encoded? I don’t have a clear picture of how AI develops good taste or consistent standards. This might emerge from better training, or might require explicit preference modeling, or might remain a gap that limits Level 4 scope.

How fast will reliability improve? Hallucination reduction has been steady but not exponential. Will there be a breakthrough, or will we approach reliability asymptotically?

How will context scale? Bigger windows vs. better retrieval vs. persistent memory—the path isn’t clear. Different approaches might work for different use cases.

Will there be hard limits? Some optimists believe AI will solve everything. Some pessimists believe hard limits exist. The honest answer is we don’t know yet.

Closing Thought

Level 4 isn’t science fiction. It’s engineering work on specific, identifiable problems: reliability, context, self-correction, judgment.

Progress on these is measurable. You can track hallucination rates, context window sizes, debugging success rates. The gaps are closing.

I can see Level 4 from where I’m standing. The view from Level 3 is actually quite good—close enough to understand what’s missing, far enough to see the work that remains.

If you’re building skills for AI-assisted development, build them for where this is going, not just for where it is. The transition from Level 3 to Level 4 is coming. Being ready means building the skills that matter on the other side.

This is Part 9 of a series on AI-assisted software development. Previously: The Lean AI Stack. Next: You’re Not Replaced, You’re Promoted.