The First AI I'd Trust With My Code

I've been vibe-coding since before it had a name, feeding prompts to models and stitching the output into real projects. Early GPT-3 couldn't get a swap function right. By 2024, Codex could handle small tasks, individual scripts and one-off functions. By mid-2025, AI could connect modules and fix simple bugs. The grind was always the same: 5-10 iterations with the model, beat my head against the wall, then hand-fix the remaining surgical bugs myself.

Opus 4.5 changed the game. My first realization of its power came from working on the Points of Interest feature for Investomation. I already had a database of points, but building a functional UI exposing that data to the user is no small feat. It requires changes to the backend, frontend, the protocol between them, and coming up with a sane UI in Svelte. Plenty of things to get confused on, and had I given the task to a lesser model, it would've gladly slapped in some unnecessary React in there too. But Opus produced a plan that seemed... right.

Opus figured out the architecture correctly, on the first try. I was impressed, I didn't have to break down a complex task into smaller chunks, it could swallow it whole, or at least it claimed it could. "Go for it" – I figured – "I'll fix the bugs later myself, as usual." But this time... there were no bugs. It not only got the design right on the first try, it got the implementation right too. It had the intuition to ask the right questions during the design phase like no model before did. There were minor bugs I still had to fix myself, but to its credit, it wrote the code in the dark, with no access to the browser to verify its work.

I started wondering what was different. Initially the model felt more powerful, like they just threw more compute at it, but as I started digging deeper into its operation, I realized that the difference wasn't that it stopped making mistakes. It was just better at discovering them, it would test its work. And when things didn't work, it would ask "why?", like a developer would.

The difference was self-correction. Previous models would make an error and keep building on top of it, producing increasingly broken output with complete confidence. Opus 4.5 would catch its own mistakes. This sounds simple, but it's the difference between a tool you supervise and a tool you delegate to. Old models would take seconds to dump code that got 80% of the way there but epic-failed during the other 20%. Opus took off to do its thing for 40+ mins (randomly interrupting itself to ask for permissions) but then produced a working solution. The idea that you could now "cook" code on a back-burner and come back every once in a while to stir it raised a new question: how many of these things can I run at once?

During the two-week promo when Opus was priced at Sonnet rates, several. At full Opus pricing, running parallel sessions gets expensive fast. Perhaps there is a future where a single Opus architect could guide a dozen of cheaper Deepseek models, but that would be a different blog post.

As Good As a Junior Developer?

The "junior developer" comparison gets thrown around a lot in AI circles, but it misses the mark. A junior developer makes human mistakes: forgets edge cases, copies patterns without understanding them, gets lost in complex systems. AI makes alien mistakes, like explaining a clean architecture and then failing to follow it during implementation, or rewriting the implementation entirely when the user asks for a small bug fix.

I can definitely see the limitations of context window coming into play here: the agents simply don't have enough space to keep both the architecture and implementation in memory at once. Opus is better, its context window is 200k tokens (about 500-page book), but even then, the source code files eat through that window fast and when compaction kicks in (this is when a model summarizes the original thread and clears the old context), the model gets noticeably dumber until the original memories resurface.

Opus 4.5 finally earns the "junior developer" badge, it brings an intuition that previous models didn't have. It can take a clearly-scoped task, implement it across multiple files, run the tests, iterate on failures, and deliver working code. Just like a real junior developer, the quality of its output depends heavily on the quality of the spec given to it. A vague "build me a user authentication system" will produce vague results. A specific "add RBAC with these roles, these permissions, using this library, following the patterns in this directory" produces code I'd merge after a normal review. The expert still sets the direction, still reviews the output, still makes the architectural decisions. But the grunt work of implementation can finally be outsourced to AI. In 2024, a dev who said that AI writes his code would be laughed out of the room by other devs. In late 2025, this dev would be envied.

Where it Excels

AI excels at writing boilerplate, and Opus is no exception. In fact, it can one-shot complex CRUD and RBAC permission logic because there are a lot of open-source examples of it done right and Opus can finally hold all that context in its head at once, iterate on it, troubleshoot without a human in the loop and build a working solution autonomously. Back in 2019, I built survey-builder app in a weekend. An elegant project that I ended up shelving just because I didn't have the time to work on it. A few days ago, I was able to dust off that project and create a fully working MVP. Opus handled the app boilerplate, replacing my old Marko stack with modern Svelte frontend, but kept the enhancements I made to the markdown parser. It was able to take the spirit of my old app to pattern off of without requiring me to rebuild the entire web portal myself. With compaction, Opus can hold the entire roadmap in memory across a long session. Self-correction plus sustained context turns 40-minute autonomous runs into working code instead of confident garbage.

Where It Still Falls Short

Opus still struggles with genuinely novel problems where there's no established pattern to follow. It can implement a well-known architecture beautifully, but ask it to design a new one and it'll default to whatever's most common in its training data, or latch on to some buzzword you may have mentioned in passing. It makes subtle errors on ambiguous requirements, the kind of mistakes where you'd need domain expertise to even notice something is wrong. And it still can't replace the expert's judgment about which approach is right for a given situation. It still doesn't understand performance bottlenecks (although it will address them if you point them out), It creates intuitive UX in some cases and horrible in others (really depends on what it's paying attention to).

It does better in the hands of a generalist (who's comfortable with full stack development) than a specialist who focuses on one thing, mainly because the generalist is more comfortable at looking at the big picture. Opus will never have better focus in your expert niche and will not have the intuition to auto-pilot the niches you're weak at. It still needs a competent human in the loop, familiar with the architecture. It doesn't replace expertise, but it does multiply it. A senior engineer with Opus 4.5 is genuinely more productive than a small team without it.