The hidden cost of large language models is not scale, but governance

Mark Venables

Share this article

Large language models promise fluency at an industrial scale, yet the difference between a pilot and a production system is not measured in parameters. The decisive factor is whether leaders can govern large language models with the same discipline they apply to finance, safety and cyber risk, and do so without blunting their value.

The past two years have been dominated by a race to size. Bigger models, more tokens, faster inference. That energy has been valuable; it forced executives to treat AI as a board topic rather than a side project. It also created a blind spot. The organisations now getting value from large language models are not those chasing the largest model on the market. They are the ones who built the plumbing, set the guardrails, and learned how to embed systems that are persuasive by design but fallible by nature.

The lesson is simple to state and harder to execute. Large language models will improve the business only when governance enhances the model. The technical choices still matter, open source versus proprietary, fine-tune or prompt, edge or cloud. Still, they are secondary to a programme that manages data quality, risk, transparency and cost with intent.

What scale cannot buy

ChatGPT’s arrival made generative AI accessible to anyone with a browser and a question. It also made something else apparent: the most capable proprietary models sit behind APIs and remain opaque to customers. The compute required to train and serve them is vast, which keeps them under the control of their creators and obliges you to send data off-premise to use them. That raises privacy questions and leaves you with a black box you cannot inspect or fine-tune to your domain. The “absolutely enormous” cost footprint is not theoretical; GPT-4 cost over $100 million to develop, and running inference at scale compounds the bill.

Open source shifts the calculus. Communities have released a wave of capable models across tasks from text generation to classification and summarisation. They have not matched the very top tier on raw performance, but they are catching up rapidly. More importantly for an enterprise, they can be brought into your environment, governed under your controls, and adapted to your data. You avoid a black box and gain the option to fine-tune for the language, formats and edge-cases that define your business. That is why many teams use a smaller open model, orders of magnitude lighter than a frontier system, and still achieve better domain performance, lower latency and tighter cost control.

This is not dogma. There will be workloads where a frontier API is the sensible choice, particularly for highly creative tasks or complex code synthesis. There will be others where an open model with targeted fine-tuning dominates in terms of cost-to-value. Treat the decision as a portfolio, not a bet. The one universal truth is that neither route absolves you from governance. Proprietary services require careful data-sharing policies and output validation. Open source demands ownership of training, evaluation and monitoring. Both need a strong data foundation.

What GPT-5 tells us about governance

OpenAI’s August 2025 release of GPT-5 highlights exactly why governance is non-negotiable and why model evolution alone is not enough. GPT-5 arrived as a unified system with fast and deeper reasoning models and an intelligent router that automatically selects the appropriate model based on conversation type, complexity, or explicit intent, including prompts such as “think hard about this”. It marked a significant step forward in reasoning, coding, accuracy, multimodal understanding, and control over output quality.

Yet the reaction was mixed. Some users missed the warmth and familiarity of earlier models; others criticised the router for reducing control or even performance in cursory usage. OpenAI responded by reinstating legacy models and promising greater steerability, an implicit admission that governance involves not just safety, but user experience, transparency, and the ability to tailor systems to human needs. For enterprises, that is a critical insight: even the best model can falter without interfaces, feedback loops, and fallback options that users understand and trust.

From experiments to systems you can trust

Most organisations started with proofs of concept: a summariser for meetings, a helper for customer service, a draft generator for documentation. Those are reasonable entry points and the guidebooks reflect them. LLMs summarise lengthy artefacts, cluster and classify text, analyse sentiment, translate content, and assist with code. Each of these is valuable; none of them on its own rewires a business.

The shift from showcases to systems begins when you close three gaps.

First, the data gap. Language models are not fact engines. They are probabilistic sequence predictors that can sound authoritative while being wrong. The moment you use them for decisions, you must constrain and ground them. One practical approach is to integrate the model with your own documentation and knowledge bases, enabling it to reference material it was never trained on. That simple idea, retrieve before you generate, turns a generic model into a context-aware assistant for your estate.

Second, the feedback gap. Recent leaps in model quality have come from integrating human feedback into training and alignment. That principle carries over to the enterprise. You do not need to retrain a frontier model. Still, you do need structured feedback loops in production: capture thumbs-up and thumbs-down, annotate failure cases, and feed those signals into prompt templates, safety filters and fine-tuning sets. Human-in-the-loop is not a slogan; it is the only way you prevent persuasive nonsense from creeping into operations.

Third, the control gap. Governance is not a document on a shared drive; it is a living system. Register every model and prompt pattern as an artefact. Version them. Attach lineage to datasets. Decide which use cases are allowed per model class. Track cost per request and set budgets. None of this is glamorous. All of it is what separates a harmless pilot from a reliable platform.

Designing an enterprise LLM stack you can trust

Executives do not need another reference architecture; they need a map that links decisions to outcomes. The stack below is a pragmatic baseline. It will not be identical in every company, but each layer is there for a reason.

Data foundation. Your model is only as useful as the data you can safely expose to it. Bring structured and unstructured data into a common layer where they can be discovered, governed and retrieved at low latency. Meeting transcripts, PDFs, policies, emails and code repositories should be searchable by meaning, not just by keyword. Build the entitlement model early so sensitive sources are never retrieved for the wrong audience. Strength lies in the availability of data with strong access control.

Model portfolio. Operate a tiered set of models: frontier APIs for the few tasks that benefit from maximal capability; one or two mid-size open models fine-tuned to your domain; and small local models for privacy-critical or edge scenarios. Switching costs are real, so standardise interfaces and keep prompts and retrieval logic model-agnostic where possible. The open source route enables in-house fine-tuning and control; proprietary models offer peak performance at the expense of transparency and cost. Manage that trade-off deliberately.

Retrieval and grounding. Use retrieval to inject live, authoritative context at query time. Point it at policy libraries, product specs, runbooks and recent tickets. Where the corpus is large or messy, normalise it first. The target is “fewer hallucinations, more citations.”

Safety and observability. Put classifiers and rule checks in front of and behind the model: prompt filters, PII redaction, toxicity screens, jailbreak detection. Log every request and response with metadata. Score outputs for quality. Alert on drift in both cost and correctness.

Adaptation and fine-tuning. Fine-tune where it pays. Techniques that adjust a model on your domain examples can produce step-changes in relevance with modest datasets. Treat it as a capability rather than an experiment.

Human-in-the-loop. Keep approval points in the workflow for actions with material impact. Capture the user’s edits and route them back into evaluation sets so the model learns from real operations rather than synthetic examples.

Measuring what matters

Benchmarks and leaderboards are useful for researchers; they are weaker guides for executives. Replace generic metrics with operational ones. Start with fitness for purpose. If the assistant writes change-control drafts, measure the time saved per ticket and the rework rate. If it summarises customer calls, measures downstream resolution time and NPS shift.

Track total cost of ownership. Training costs are headline-grabbing, but inference usually dominates at scale. Choose the smallest model that meets the task, then prove it with A/B tests.

Insist on explainability through evidence, not wishful thinking. If the model is answering from policy, show the relevant clauses. If it proposes code changes, link to similar accepted patches.

Invest in evaluation as a product. Create gold-standard sets per use case. Mix automatic scoring with periodic human review. Score for helpfulness and for harm. Version and track metrics over time.

The strategic posture for leaders

The temptation is to delegate governance to a technical team. Resist that. Leadership posture determines whether AI becomes another function or an enterprise nerve system.

Set scope and ambition in plain language. Define where generative outputs may appear, what data stays private, and how errors are handled. Sponsor the unpopular work, data hygiene, entitlement modelling, and red-teaming. Treat them as prerequisites, not optional extras.

Remain deliberate about build versus buy. Use frontier APIs when performance is essential and risk is low. Use open models when privacy or cost matters. Build when your data and domain knowledge differentiate you. Prepare your people. Expertise in prompt design, interpreting model rationale, and spotting failure modes must extend across the workforce. Train, equip, and govern them.

The next sensible step

Executives often ask for a low-risk starting point that matters. Launch a grounded assistant for one high-value domain. Constrain the input, retrieve from vetted sources, apply safety filters, and measure usage. Let it fail in production, learn fast, then expand.

Follow with wider use cases: automated summarisation with intent to serve stakeholders and agentic workflows that model domain logic. These are not speculative—they represent how grounded, governed LLMs evolve in practice.

AI is changing software. But it has not changed our responsibilities. The demands for accuracy, accountability, privacy and trust are higher than ever. Scale for its own sake might capture headlines, but governance ensures you capture value. In that difference lies leadership.