18 Months Running LLMs in Production: The Parts Nobody Talks About

The Tutorial vs. Reality Gap

Every LLM tutorial follows the same arc: install the SDK, write a prompt, get an impressive response, ship it. The tutorial stops before the part where your prompt behaves differently at 3am on a Tuesday, your costs run well past projections, the model returns a confident hallucination to a user who trusts it, and behavior starts degrading mysteriously after the provider updates their infrastructure.

These aren't edge cases. They're the default state of LLMs in production. The gap between tutorial code and production-ready code is not a few extra lines of error handling — it's an entirely different engineering discipline.

At Simplilearn, we've been running LLMs in production across several distinct features: a quiz generator that creates assessment questions from course content, chatbots for learner support, and voice-platform-powered voice agents for interactive learning scenarios. Each surface taught us something different. This is what we actually encountered.

Choosing the Right Model for the Right Job

Before you can manage LLMs in production, you have to make a model selection decision that most tutorials skip entirely: which model for which task. The "just use the best available model" instinct is expensive and often wrong.

We evaluated three models across our use cases — Claude for complex reasoning and generation tasks, Codex for code generation, and Gemini for specific tasks where it fit the requirements. The evaluation wasn't just about output quality in isolation; it was about quality per dollar, latency characteristics, and fit for the specific prompt shapes each feature needed.

The lesson from this evaluation: model selection is a product decision, not just an infrastructure one. Claude's reasoning depth made it the right choice for quiz question generation, where ambiguity in source content needs to be resolved thoughtfully and output quality is evaluated by domain instructors. For code generation assistance, Codex's specialization produced better results on the specific task. The discipline of evaluating rather than defaulting paid off materially — our AI tooling costs settled at roughly $125/month against an estimated API-equivalent spend of over $1,500/month, a 12× difference driven largely by deliberate model tiering and caching rather than reaching for the largest model on every call.

Latency Is a Feature

The first thing users taught us is that LLM latency is not just a performance metric — it's a trust signal. A response that takes several seconds feels wrong even when the content is correct. Users assume something failed.

We solved this in two ways. First, streaming responses for any LLM output that renders as text — users see tokens appearing almost immediately, even when the full response takes longer to complete. Second, pre-computing responses asynchronously for predictable queries and serving from cache on the actual user request. For our recommendations feature, page load went from waiting on a live LLM call to reading from a warm cache, with a non-blocking background job triggering a refresh.

code

// Background job: pre-compute and cache recommendations
async function refreshUserRecommendations(userId: string) {
  const context = await buildUserContext(userId);
  const recommendations = await llm.generate({ prompt: buildRecommendationPrompt(context) });
  await cache.set(`recs:${userId}`, recommendations, { ttl: 3600 });
}

// Page load: serve from cache, trigger background refresh
async function getRecommendations(userId: string) {
  const cached = await cache.get(`recs:${userId}`);
  if (cached) {
    refreshUserRecommendations(userId); // non-blocking refresh
    return cached;
  }
  return refreshUserRecommendations(userId); // cold start, wait for result
}

The streaming and caching strategies together eliminate most of the cases where users experience the raw latency of a synchronous LLM call.

Cost Control Strategies

LLM costs scale in ways that surprise teams used to predictable infrastructure pricing. The failure modes we encountered: retry logic re-sending full context windows on transient failures, prompt templates that grew over time without token count reviews, and features that called the LLM on every user action when once per session would have sufficed.

The controls that actually worked:

Token budgets at the prompt-building layer. If context exceeds a defined limit, truncate with an explicit strategy rather than silently sending an oversized prompt. Oversized prompts are expensive and often don't produce proportionally better outputs.

Model tiering. Use smaller, cheaper models for classification and routing tasks. Reserve larger models for generation tasks that justify the cost. The quiz generator warranted Claude's reasoning depth. A support ticket classifier did not.

Result caching with appropriate TTLs. Deterministic-enough queries — where the same input reliably produces equivalent useful output — don't need a live model call every time. Cache the output.

Cost attribution by feature and call site. Invisible costs stay unchecked. When teams can see which feature is spending what, optimization conversations happen naturally. We built a simple cost dashboard that made this visible.

The 12× difference between our actual spend and a naive API-equivalent estimate came from applying these controls consistently, not from any single clever trick.

Hallucination Mitigation

You cannot eliminate hallucinations. You can build systems that catch them before users see them, or limit the blast radius when they occur.

The effective techniques depend on what you're generating. For factual claims — course information, dates, prerequisites, pricing — we constrain the model to only use information explicitly provided in the prompt context. No reliance on model knowledge. If the answer isn't in the context, the model says it doesn't know. This is architecturally important: it means the accuracy of factual responses is bounded by the accuracy of your retrieval layer, not by what the model happens to believe.

For open-ended generation like quiz questions, we run a validation pass: a second LLM call with a different prompt that scores the output for accuracy and flags low-confidence outputs for human review. This adds latency and cost; for content that instructors review before publishing to learners, it's worth it.

The most underused mitigation is simply limiting scope. A model asked to "answer any question about our platform" will hallucinate. A model asked to "categorize this support ticket into one of these seven categories" will not. Narrow the task and the failure modes narrow with it.

Monitoring LLM Outputs

Standard application monitoring doesn't cover what you need to watch in LLM systems. Response latency is table stakes. The metrics that matter more: output quality score (if you have a scorer), cache hit rate, token count distribution, fallback rate (how often the LLM call fails and you fall back to a non-AI path), and prompt template version by call volume.

Prompt versioning deserves special emphasis. Prompt templates change. When a template changes, behavior changes — sometimes subtly, sometimes dramatically. If you're not versioning prompts and tracking which version produced which output, you cannot debug regressions.

We store every LLM call — input, output, model version, prompt version, latency, token count — in a dedicated logging table with a defined retention policy. When behavior regresses, we can replay historical inputs through new prompt versions before deploying them. This capability has caught regressions that would otherwise have been invisible until users reported them.

What Running Multiple Surfaces Taught Us

The quiz generator, chatbots, and voice agents each have different latency tolerances, different quality thresholds, and different failure modes. The quiz generator can afford to be slower because it runs in an async instructor workflow. A voice agent cannot afford to be slow because silence in a conversation is disorienting. A chatbot failure mode is a wrong answer; a voice agent failure mode is a broken conversational exchange.

Running LLMs across different surfaces forces you to reason about each one on its own terms rather than applying a single policy across the board. That granularity — in model selection, latency strategy, hallucination mitigation, and monitoring — is where the real operational complexity lives. It's also where the real differentiation comes from.

Production LLMs are not a feature you ship and monitor passively. They require ongoing attention in ways that traditional software features don't. The teams that treat them as infrastructure to be managed rather than demos to be deployed are the ones that sustain the capability over time.

// RELATED

01APPLIED AI

DEC 2025 · 5 MIN→

The Gap Between a Great AI Demo and a Feature That Survives Production

LLMPRODUCTPROCESSEVALUATION