An edge semantic cache idea for cutting LLM cost
High-volume LLM services can become slow and expensive when repeated requests are always sent back to the model. The proposed design replaces a heavy central gateway with a lightweight that runs near the user at the edge.
It would be written in Rust and compiled to so it can run on services such as or Fastly Compute. When a prompt arrives, the edge module reads the text first and turns it into a vector with a small model such as bge-small-en-v1.5.
It then uses to quickly check whether the new request is close to a previous one. The goal is to reduce and network delay for repetitive work such as customer support or extraction.
Key points
- The idea targets repeated LLM requests that create high and latency.
- Python proxies and central caches may add too much delay for real-time agent steps.
- Rust and are proposed for a lightweight cache at the CDN edge.
- Incoming prompts would be converted into a vector and compared with .
- The strongest fit is repetitive work such as customer support or extraction.