Scaling long-running autonomous coding

(simonwillison.net)

65 points | by srameshc 6 hours ago

7 comments

vedmakk 4 minutes ago
After reading that post it feels so basic to sit here, watching my single humble claude code agent go along with its work... confident, but brittle and so easily distracted.
Agent_Builder 17 minutes ago
This matches what I’ve seen with long-running agents. The failures usually aren’t one big mistake, but small assumptions compounding over time.
What helped for me was forcing the agent into short, explicitly scoped steps. Each step declares what it can read, what it can do, and what it’s allowed to output, then that context gets torn down before the next step.
I’ve been using GTWY for this kind of setup and it made long-running coding agents much more boring and predictable, which is exactly what you want at scale.
Curious how you’re handling state reset and permission drift as runtimes get longer.
simonw 3 hours ago
One of the big open questions for me right now concerns how library dependencies are used.
Most of the big ones are things like skia, harfbuzz, wgpu - all totally reasonable IMO.
The two that stand out for me as more notable are html5ever for parsing HTML and taffy for handling CSS grids and flexbox - that's vendored with an explanation of some minor changes here: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...
Taffy a solid library choice, but it's probably the most robust ammunition for anyone who wants to argue that this shouldn't count as a "from scratch" rendering engine.
I don't think it detracts much if at all from FastRender as an example of what an army of coding agents can help a single engineer achieve in a few weeks of work.
[-]
- sealeck 3 hours ago
  I think the other question is how far away this is from a "working" browser. It isn't impossible to render a meaningful subset of HTML (especially when you use external libraries to handle a lot of this). The real difficulty is doing this (a) quickly, (b) correctly and (c) securely. All of those are very hard problems, and also quite tricky to verify.
  I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.
  Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.
  I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.
  [-]
  - simonw 3 hours ago
    Yeah, I'm hoping they publish a lot more about this project! It deserves way more then the few sentences they've shared about it so far.
- shubhamjain 1 hour ago
  Why attempt something that has abundant number of libraries to pick and choose? To me, however impressive it is, 'browser build from scratch' simply overstates it. Why not attempt something like a 3D game where it's hard to find open source code to use?
  [-]
  - Banditoz 1 hour ago
    Is something like a 3D game engine even hard to find source code for? There's gotta lots of examples/implementations scattered around.
- janoelze 3 hours ago
  Any views on the nature of "maintainability" shifting now? If a fleet of agents demonstrated the ability to bootstrap a project like that, would that be enough indication to you that orchestration would be able to carry the code base forward? I've seen fully llm'd codebases hit a certain critical weight where agents struggled to maintain coherent feature development, keeping patterns aligned, as well as spiralling into quick fixes.
  [-]
  - simonw 3 hours ago
    Almost no idea at all. Coding agents are messing with all 25+ years of my existing intuitions about what features cost to build and maintain.
    Features that I'd normally never have considered building because they weren't worth the added time and complexity are now just a few well-structured prompts away.
    But how much will it cost to maintain those features in the future? So far the answer appears to be a whole lot less than I would previously budget for, but I don't have any code more than a few months old that was built ~100% by coding agents, so it's way too early to judge how maintenance is going to work over a longer time period.
  - brianjeong 2 hours ago
    I think there's a somewhat valid perspective that the Nth+1 model can simply clean up the previous models mess.
    Essentially a bet that the rate of model improvement is going to be faster than the rate of decay from bad coding.
    Now this hurts me personally to see as someone who actually enjoys having quality code but I don't see why it doesn't have a decent chance of holding
- teaearlgraycold 1 hour ago
  It looks like JS execution is outsourced to QuickJS?
halfcat 2 hours ago
So AI makes it cheaper to remix anything already-seen, or anything with a stable pattern, if you’re willing to throw enough resources at it.
AI makes it cheap (eventually almost free) to traverse the already-discovered and reach the edge of uncharted territory. If we think of a sphere, where we start at the center, and the surface is the edge of uncharted territory, then AI lets you move instantly to the surface.
If anything solved becomes cheap to re-instantiate, does R&D reach a point where it can’t ever pay off? Why would one pay for the long-researched thing when they can get it for free tomorrow? There will be some value in having it today, just like having knowledge about a stock today is more valuable than the same knowledge learned tomorrow. But does value itself go away for anything digital, and only remain for anything non-copyable?
The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?
[-]
- tornikeo 1 hour ago
  > The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?
  It's nearly frictionless, not frictionless because someone has to use the output (or at least verify it works). Also, why do you think the "shape" of the knowledge is spherical? I don't assume to know the shape but whatever it is, it has to be a fractal-like, branching, repeating pattern.
- ukuina 37 minutes ago
  Single-idea implementations ("one-trick ponies") will die off, and composites that are harder to disassemble will be worth more.
- ramraj07 1 hour ago
  The fundamental idea that modern LLMs can only ever remix, even if its technically true (doubt), in my opinion only says to me that all knowledge is only ever a remix, perhaps even mathematically so. Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
  [-]
  - pseudosavant 52 minutes ago
    But all of my great ideas are purely from my own original inspiration, and not learning or pattern matching. Nothing derivative or remixed. /sarcasm
  - mrbungie 1 hour ago
    > Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
    You know this is a false dichotomy right? You can treat and consider LLMs statistical parrots and at the same time take advantage of them.
  - heavyset_go 1 hour ago
    Yeah, Yann LeCun is just some luddite lol
    [-]
    - NitpickLawyer 25 minutes ago
      I don't think he's a luddite at all. He's brilliant in what he does, but he can also be wrong in his predictions (as are all humans from time to time). He did have 3 main predictions in ~23-24 that turned out to be wrong in hindsight. Debatable why they were wrong, but yeah.
      In a stage interview (a bit after the "sparks of agi in gpt4" paper came out) he made 3 statemets:
      a) llms can't do math. They can trick us with poems and subjective prose, but at objective math they fail.
      b) they can't plan
      c) by the nature of their autoregressive architecture, errors compound. so a wrong token will make their output irreversibly wrong, and spiral out of control.
      I think we can safely say that all of these turned out to be wrong. It's very possible that he meant something more abstract, and technical at its core, but in the real life all of these things were overcome. So, not a luddite, but also not a seer.
      [-]
      - gjadi 19 minutes ago
        Have this shortcomings of llms been addressed by better models or by better integration with other tools? Like, are they better at coding because the models are truly better or because the agentic loops are better designed?
        [-]
        NitpickLawyer 3 minutes ago
        100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding.
        The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better.
        The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that.
        So current models + minimal loop >> previous gen models with human written harnesses + lots of glue.
        > Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!
        [1] - https://github.com/SWE-agent/mini-swe-agent
tinyhouse 3 hours ago
Well, software is measured over time. The devil is always in the details.
[-]
- aronowb14 1 hour ago
  Yeah curious what would happen if they asked for an additional big feature on top of the original spec
anilgulecha 3 hours ago
That's a wild idea-a browser from scratch! And ladybird has been moving at snails pace for a long time..
I think a good abstractions design and good test suite will make it break success of future coding projects.
vivzkestrel 3 hours ago
I am waiting for that guy or a team that uses LLMs to write the most optimal version of Windows in existence, something that even surpasses what Microsoft has done over the years and honestly looking at the current state of Windows 11, it really feels like it shouldn't even be that hard to make something more user friendly
[-]
- kimixa 2 hours ago
  Considering Microsoft's significant (and vocal) investment in LLMs, I fear the current state of Windows 11 is related to a team trying to do exactly that.
  [-]
  - g947o 1 hour ago
    I noticed that dialog that has worked correctly for the past 10+ years is using a new and apparently broken layout. Elements don't even align properly.
    It's hard to imagine a human developer misses something so obvious.