Rendered at 19:25:07 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
ian_j_butler 2 hours ago [-]
Hah, since it's open in another tab: Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate @ https://arxiv.org/html/2509.05396v2
To actually engage more with the substance of TFA.. very refreshing to see someone bringing numbers. To me this shows we (still) need something somewhere between the numbers and the anecdata. It's annoying to hear claims of "1000x productivity" or claims of negative/neutral productivity without any extra context whatsoever. So you brought data? Great! But still no context. Boring CRUD? Complex UI? A rewrite/port of legacy? What industry, language, how many human collaborators, and what baseline for SLOC??
We need to get rigorous about this stuff and actually aim for a decent framework which can answer "Are LLMs a value add for this project? How much value? How much cost?"
Such a thing might be information-theoretical, complexity-based, or counting integration touchpoints / info boundaries / sources of ground-truth? We could even try to implement that framework with LLMs and probably should! But the default answer of "Yes definitely, always useful, just token-maxx it since LLMs are the future" is (still) only marketing, not engineering
vgordon 30 minutes ago [-]
The activity-vs-outcome distinction feels like the most interesting part of this discussion. Curious what metrics people have found most useful for separating the two.
sublinear 2 hours ago [-]
More tasks get "done" while rework is sky high and overall throughput to production drops.
First, I'd like to thank all the people working on testing and doing the lord's work.
Anyway, this isn't even a unique pattern to LLM use. We've all seen this exact same thing when more devs are added a project running late, teams are siloed, outsourcing to contractors, etc.
northstar702 4 minutes ago [-]
Definitely not unique. Thats said, given how "easy" it is to scale LLM output, compared to human output, this pattern could be messier in the LLM era? The whole tokenmaxxing motion assumed more output equals more outcomes. What do you think?
3 hours ago [-]
gtirloni 2 hours ago [-]
People are still figuring things out, there's a lot of wasted tokens, etc.
This is like complaining a student isn't as productive as a senior engineering.
I think we as an industry haven't even graduated to junior level when it comes to figuring our how to use AI to improve things.
ozlikethewizard 2 hours ago [-]
This is discussed in the article, and I think the author makes pretty reasonable arguments for why by nature we will not see the reliability of LLM usage improve. They also discuss what I agree as the more effective method of using an LLM is, as a feedback and refinement tool, not a decision maker.
deadbabe 2 hours ago [-]
I'm curious, LLMs have been around for a while now...
How many of you would say you need LLMs now for work? Not that you want it because it's nice to have, but rather you would literally not be able to do your job at all because you don't have an LLM to use?
If your company said "We're not paying for LLMs anymore.", would you begrudgingly pay for or host your own LLM that complies with company policies, or just go back to writing everything by hand?
I feel like companies could definitely just push the cost of LLMs back onto the engineers themselves (much like how people have to pay for their own gas to go to work), and engineers would have no choice but to either buy their own subscriptions or be very good at writing code by hand just to stay competitive.
This kind of shift is coming, partly because costs of LLMs are to unsustainable for companies, but also because it sounds like the kind of diabolical idea some upper management thinks they can get away with, as peer pressure will naturally do its thing. Paying for your own token usage is a small price to pay for job security isn't it?
Cyan488 2 hours ago [-]
I'm an embedded systems developer. I have almost fully "outsourced" the Python code for frontend pc software that interacts with my firmware.
I deliberately continue to write all my firmware by hand, and will occasionally consult AI for review. I never use AI to write prose for me.
Python is better represented in training data, writing bench software was a bit boring, I get to spend more time where I have (and continue to build) domain knowledge.
Agentic Opus is a nice to have and I get to explore the frontier tech, but if (or when) it's taken away, a self hosted coding model would be fine - I'd just have to dust off my Python skills and it would take longer.
6stringmerc 2 hours ago [-]
“It’s hard for me to put into words how bad this is.”
AHAHAHAHA! I genuinely laughed out loud, filled the room.
This is the “citation needed” rebuttal most REASONABLE people in the software industry have been looking for, and the sample size is only going to get bigger! Does anyone really think - not believe, logically conclude based on evidence at hand - that there will be any contrary outcomes at scale? Honestly, this is going to put a lot of software pros into a “sabbatical” where cleaning up the mess should be billed like that old joke at auto mechanics’ shops:
$150 an hour to fix it
$500 an hour if you tried to fix it before bringing it here
Seriously laughing out loud. I needed this. Hahaha
Edit: no wonder he’s so astute - came up in construction. In that industry cost versus benefit and safety concerns are often a matter of life or death. Real consequences. A lot more educational than staring at a glowing box for most of one’s career. Disclosure: I did risk management for construction projects, like the Alabama MB plant.
To actually engage more with the substance of TFA.. very refreshing to see someone bringing numbers. To me this shows we (still) need something somewhere between the numbers and the anecdata. It's annoying to hear claims of "1000x productivity" or claims of negative/neutral productivity without any extra context whatsoever. So you brought data? Great! But still no context. Boring CRUD? Complex UI? A rewrite/port of legacy? What industry, language, how many human collaborators, and what baseline for SLOC??
We need to get rigorous about this stuff and actually aim for a decent framework which can answer "Are LLMs a value add for this project? How much value? How much cost?"
Such a thing might be information-theoretical, complexity-based, or counting integration touchpoints / info boundaries / sources of ground-truth? We could even try to implement that framework with LLMs and probably should! But the default answer of "Yes definitely, always useful, just token-maxx it since LLMs are the future" is (still) only marketing, not engineering
First, I'd like to thank all the people working on testing and doing the lord's work.
Anyway, this isn't even a unique pattern to LLM use. We've all seen this exact same thing when more devs are added a project running late, teams are siloed, outsourcing to contractors, etc.
This is like complaining a student isn't as productive as a senior engineering.
I think we as an industry haven't even graduated to junior level when it comes to figuring our how to use AI to improve things.
How many of you would say you need LLMs now for work? Not that you want it because it's nice to have, but rather you would literally not be able to do your job at all because you don't have an LLM to use?
If your company said "We're not paying for LLMs anymore.", would you begrudgingly pay for or host your own LLM that complies with company policies, or just go back to writing everything by hand?
I feel like companies could definitely just push the cost of LLMs back onto the engineers themselves (much like how people have to pay for their own gas to go to work), and engineers would have no choice but to either buy their own subscriptions or be very good at writing code by hand just to stay competitive.
This kind of shift is coming, partly because costs of LLMs are to unsustainable for companies, but also because it sounds like the kind of diabolical idea some upper management thinks they can get away with, as peer pressure will naturally do its thing. Paying for your own token usage is a small price to pay for job security isn't it?
I deliberately continue to write all my firmware by hand, and will occasionally consult AI for review. I never use AI to write prose for me.
Python is better represented in training data, writing bench software was a bit boring, I get to spend more time where I have (and continue to build) domain knowledge.
Agentic Opus is a nice to have and I get to explore the frontier tech, but if (or when) it's taken away, a self hosted coding model would be fine - I'd just have to dust off my Python skills and it would take longer.
AHAHAHAHA! I genuinely laughed out loud, filled the room.
This is the “citation needed” rebuttal most REASONABLE people in the software industry have been looking for, and the sample size is only going to get bigger! Does anyone really think - not believe, logically conclude based on evidence at hand - that there will be any contrary outcomes at scale? Honestly, this is going to put a lot of software pros into a “sabbatical” where cleaning up the mess should be billed like that old joke at auto mechanics’ shops:
$150 an hour to fix it
$500 an hour if you tried to fix it before bringing it here
Seriously laughing out loud. I needed this. Hahaha
Edit: no wonder he’s so astute - came up in construction. In that industry cost versus benefit and safety concerns are often a matter of life or death. Real consequences. A lot more educational than staring at a glowing box for most of one’s career. Disclosure: I did risk management for construction projects, like the Alabama MB plant.