An edge semantic cache idea for cutting LLM cost

High-volume LLM services can become slow and expensive when repeated requests are always sent back to the model. The proposed design replaces a heavy central gateway with a lightweight that runs near the user at the edge.

It would be written in Rust and compiled to so it can run on services such as or Fastly Compute. When a prompt arrives, the edge module reads the text first and turns it into a vector with a small model such as bge-small-en-v1.5.

It then uses to quickly check whether the new request is close to a previous one. The goal is to reduce and network delay for repetitive work such as customer support or extraction.

Key points

  • The idea targets repeated LLM requests that create high and latency.
  • Python proxies and central caches may add too much delay for real-time agent steps.
  • Rust and are proposed for a lightweight cache at the CDN edge.
  • Incoming prompts would be converted into a vector and compared with .
  • The strongest fit is repetitive work such as customer support or extraction.
Read original