Why iteration, not inspiration, is the key to success in productionising GenAI models

Mark Venables

Share this article

Generative AI is easy to demo but notoriously difficult to deploy at scale, with many promising applications failing under real-world conditions. Success depends on a new approach rooted in experimentation, rigorous evaluation, and relentless iteration.

Few technologies have captured the imagination of business leaders quite like generative AI. Its potential is widely discussed, from image creation to code generation and customer service. Yet, while ideas are easy to demonstrate, translating them into reliable, scalable systems remains a challenge that even the most advanced teams are still learning to navigate.

According to Lukas Biewald, Chief Executive Officer and Co-Founder of Weights & Biases, that challenge lies not in imagination but in iteration. “It is easy to demo and hard to productionise,” he says. “You see many apps come out with impressive adoption numbers and catastrophic retention. The imagination captured by generative AI masks how brittle and unpredictable it can be when deployed.”

This brittleness, he argues, stems from the essential difference between traditional software development and GenAI. Where conventional software follows deterministic logic, GenAI operates probabilistically. The outputs shift with context, prompt phrasing, or seemingly trivial changes. For business leaders seeking to deploy AI systems at scale, this presents not just a technical challenge but an operational one—because the rules of the game have changed.

Why GenAI demands new thinking

The shift from rule-based logic to generative models is not simply a matter of tool choice. It demands a new mental model. In traditional engineering, success comes from careful planning, execution, and testing. In generative AI, success is shaped by trial, observation, refinement, and frequent failure.

Biewald likens this to the difference between engineering and science. “In software, code is sacred,” he explains. “No one would discard code that works. But in AI, we see companies constantly throwing away learning. That is a critical mistake that creates huge inefficiencies. In AI development, your learning is your intellectual property. The trials that fail are just as valuable as the ones that succeed.”

This principle shaped the evolution of Weights & Biases’ own customer support agent, OneBot. What began as a relatively simple application built around a retrieval augmented generation (RAG) pipeline quickly revealed the limits of surface-level evaluation. While the demo version seemed promising, real-world use exposed hallucinations, irrelevant responses, and inconsistent behaviour. User feedback proved insufficient. Only through a structured evaluation framework that measured accuracy, completeness, and relevance using actual data made meaningful improvement possible.

The first internal assessments rated OneBot’s accuracy at just 25 per cent. It was only after extensive prompt engineering, model testing, and the use of techniques like keyword re-ranking and sub-query detection that accuracy reached a more robust 80 per cent. Along the way, the team learned a lesson many enterprises are now discovering: performance is use-case specific, and benchmarks are no substitute for context-aware evaluation.

The illusion of progress

One of the more insidious challenges in GenAI development is what Biewald refers to as the ‘third 90 per cent’ problem. Teams repeatedly feel they are nearing completion, only to discover a new set of failures, issues with security, compliance, or model output that were previously hidden.

“You think you are nearly done multiple times, but then discover new security, compliance, or quality issues that were not anticipated,” he says. “That is even more true with agents. From what we see, every customer tries to put agents into production, but only a small handful have succeeded.”

This is not a critique of ambition. Instead, it is a warning that generative AI amplifies complexity, particularly agentic systems. Agents combine deterministic code with probabilistic reasoning. Their interactions unfold over multiple steps, each with the potential to go awry. Organisations treating them like traditional applications often find that their expectations unravel in contact with reality.

As Weights & Biases scaled OneBot, they added intent detection to divert off-topic questions, basic guardrails to manage inappropriate content, and domain-specific keyword techniques that outperformed semantic search in technical areas. No single breakthrough defined the success. Instead, the improvements came from layering small changes, each driven by rigorous evaluation, not assumption.

“Subjective evaluations like ‘testing by vibes’ often suggest better performance than what real-world usage will reveal,” Biewald notes. “Once we had a consistent evaluation framework, we could test different models. GPT-3.5 Turbo outperformed GPT-4 in our setup, even though it is considered a weaker model overall.”

This outcome offers a valuable lesson: the most powerful model is not always the best. Enterprise deployment requires fit-for-purpose evaluation, not blind allegiance to hype or leaderboard rankings.

Complexity at scale

If OneBot represents a practical application, then Weights & Biases’ work on SWE-bench offers a glimpse into the bleeding edge of GenAI development. SWE-bench is a dataset of real GitHub issues with associated tests, designed to evaluate whether AI agents can resolve software bugs. The task is deceptively difficult. Early systems scored less than two per cent accuracy. Even Devon, a commercial product with significant engineering investment, reached just under 14 per cent.

Biewald’s co-founder and CTO, Sean, approached the challenge by building an agentic pipeline that used external tools to parse code and summarise relevant segments. One key insight was the limited context window available to LLMs. Rather than trying to feed all the information at once, the system used a combination of search and summarisation to locate only the most relevant code blocks.

He ran each task five times in parallel to generate diverse responses. Then, he employed a discriminator model to select the best answer, based not on the SWE-bench tests but on proxy indicators to avoid data leakage. The result dramatically improved: 64 per cent accuracy using five cross-checks, placing Weights & Biases at the top of the verified leaderboard.

“Discrimination is easier than generation, so that part is more scalable,” Biewald explains. “As LLMs get cheaper, we can increase rollouts and accuracy simply by adding more compute.” Yet this was not a story of a single breakthrough but of methodical iteration. Over two months, the team ran 977 experiments, each involving 500 test issues. The tools were not exotic. Google Sheets tracked progress. Counterfactuals were used to assess whether changes to prompts or tools meaningfully impacted results. The process was low-glamour but high-value.

“Looking at real examples, not just metrics, was essential to understand where the agent was failing,” Biewald continues. “Non-determinism is a huge factor. Even simple queries sometimes give different results every time. You must run multiple tests to gauge reliability.”

The implication for business leaders is clear: building reliable GenAI systems requires infrastructure that supports continuous testing, rapid iteration, and the ability to draw insight from both success and failure.

Democratisation and its demands

One of the more striking observations is how far the AI landscape has shifted in just a few years. Weights & Biases originally served highly technical teams at companies like Meta, OpenAI, and Nvidia. Today, its customer base includes many developers building AI into real-world applications. The democratisation of AI is no longer theoretical; it has happened.

That shift brings new pressures. As GenAI tools become more accessible, expectations rise. Yet the underlying complexity remains. While a five-year-old can use voice instructions to prompt an LLM into building a game, deploying that same model in an enterprise environment that demands security, compliance, accuracy, and auditability is entirely different.

To bridge this gap, Biewald and his team are building Weave, a platform designed to help teams evaluate and monitor their AI systems, collaborate securely, and manage iteration workflows. This reflects a broader truth that executives must now confront: deploying LLMs is a team sport. Success depends not on a single model or feature but on infrastructure that supports versioning, monitoring, testing, and adaptation.

“Having done this for over 20 years, I can confidently say we are in a transformative era,” Biewald concludes. “For 2025, that shift is AI. And it is not just a single trend but a platform that absorbs everything else.”

The path to maturity

For organisations looking to scale GenAI, the lesson from Weights & Biases is one of humility and rigour. Demos may impress, and proofs of concept may work. However, durable, enterprise-grade systems only emerge through methodical iteration. There is no substitute for evaluation frameworks that reflect real-world usage, no shortcut to learning from failure, and no single model that fits all tasks.

The question for executives is not whether GenAI can deliver value. It is how to create the systems, processes, and teams capable of turning potential into production. As Biewald’s experience shows, the difference between 25 per cent and 80 per cent accuracy is not magic, it is iteration, infrastructure, and relentless focus on what works in the real world.