How to debug an LLM app that gets worse over time

A Reddit user asks how teams handle LLM products that start giving wrong answers after launch. They give examples such as a query failing weeks later, a user using a new internal term, or retrieval pulling an old document. The post asks whether people first group failures by cause or put them into an eval set to test them.

Key points

  • The post says LLM answers can become wrong after they worked before.
  • It lists possible causes like new internal terms, old documents, and bad retrieval.
  • It points out that similar-looking failures may need different fixes.
  • It asks whether teams group failures first or add them to an eval set.
  • It also asks how teams stop the same fixed problem from coming back.

Quick term guide

query
To ask a system for specific information.
retrieval
The step where a system finds the most relevant text for a question.
eval set
A set of test questions used to check whether the model still works well.
AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
AI agent
An AI program that can inspect information and suggest what to do next.
agents
AI helpers that follow your instructions and make changes for you.
system
Here, system means a repeatable way to use AI, such as steps, rules, or checks.
tokens
Tokens are small pieces of text that AI systems count when reading or writing.
Read original