Open SourceImportance: Medium

How to debug an LLM app that gets worse over time

r/LLMDevsJun 11, 2026 · 6h ago

A Reddit user asks how teams handle LLM products that start giving wrong answers after launch. They give examples such as a query failing weeks later, a user using a new internal term, or retrieval pulling an old document. The post asks whether people first group failures by cause or put them into an eval set to test them.

Key points

The post says LLM answers can become wrong after they worked before.
It lists possible causes like new internal terms, old documents, and bad retrieval.
It points out that similar-looking failures may need different fixes.
It asks whether teams group failures first or add them to an eval set.
It also asks how teams stop the same fixed problem from coming back.

Quick term guide

query: To ask a system for specific information.
retrieval: The step where a system finds the most relevant text for a question.
eval set: A set of test questions used to check whether the model still works well.
AI agents: AI agents are AI tools that can carry out steps toward a goal, not just answer once.
AI agent: An AI program that can inspect information and suggest what to do next.
agents: AI helpers that follow your instructions and make changes for you.
system: Here, system means a repeatable way to use AI, such as steps, rules, or checks.
tokens: Tokens are small pieces of text that AI systems count when reading or writing.

Read original ↗