Reducing noisy search results in a mixed RAG document set

A government-focused retrieval-augmented generation platform is being built from scratch on a .NET stack, with several pilot uses already lined up. The main problem is search contamination when documents from different work areas sit inside one collection. For example, reports about two-wheeler theft, pickpocketing, and chain snatching can get mixed together, causing unrelated text chunks to rank highly for a query.

Hybrid search, RRF, and confidence scoring are already in use, but the weak results seem to start before ranking. The system is selecting the wrong or overly broad set of nodes first, so the candidate material is already noisy. The proposed fix is to make category selection required when documents are added, use BAAI/bge-m3, and optionally use an LLM classifier to identify the query category.

Search would then be limited to that category, with reranking added afterward if needed. Accuracy matters more than speed or latency.

Key points

  • Mixed documents in one collection can create noisy search results.
  • The likely failure point is node selection before ranking happens.
  • Required category selection during ingestion is being considered.
  • A query classifier could restrict search to the right category.
  • Accuracy is the priority, even if latency increases.

Quick term guide

retrieval-augmented generation
A method where an AI first retrieves outside information and then uses it to answer.
retrieval
The step where a system finds the most relevant text for a question.
collection
A stored group of documents that the search system can look through.
hybrid search
A search method that combines keyword matching with meaning-based matching.
classifier
A system that sorts an input or task into a type so the tool can decide what to do.
reranking
A second pass that re-sorts search results by relevance so only the best ones are kept.
AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
categories
Groups used to organize similar items so they are easier to find.
Read original