Open SourceImportance: Medium

Parallel LLM calls can get slower when rate limits are tight

r/LLMDevsJun 12, 2026 · 17h ago

OxyJen is an open-source Java framework for coordinating AI work. Its MapNode feature applies the same task to many items at once, with controls for how many jobs can run together, timeouts, and separate error handling for each item. The problem appears when each task calls a large language model such as Gemini.

Gemini’s free tier allows 15 calls per minute, so sending 3 requests at the same time can cause some of them to fail with a 429 error. LLMChain retries failed calls and uses exponential backoff, but waiting 30 seconds and then 60 seconds can make the whole batch slower than simply spacing the calls out. One possible fix is RateLimitedChatModel, which starts calls at intervals based on the allowed calls per minute.

That can reduce retry storms and get close to the best possible total time, but with 5-second model calls it may leave very little real overlap. The throttle code is being adjusted to use CAS.

Key points

OxyJen’s MapNode runs the same AI task across many items with controlled concurrency.
Gemini’s free tier allows 15 calls per minute, so 3 simultaneous LLM calls can trigger a 429 error.
Exponential backoff can prevent repeated failures, but 30-second and 60-second waits may make the batch much slower.
Spacing call start times by the allowed rate can reduce failed calls and retry delays.
For AI agents, rate limiting is part of cost and latency control, not just error handling.

Quick term guide

error handling: Code that decides what to do when something goes wrong, so the app doesn't just crash silently.
large language model: The type of AI behind ChatGPT or Claude — trained on huge amounts of text to read, write, and code.
exponential backoff: A retry strategy where each failed attempt waits a bit longer before trying again, reducing pressure on the server
RateLimitedChatModel: A wrapper that slows model calls so they stay under a service’s allowed limit.
model calls: Requests sent to an AI model to get an answer or action.
model provider: The external AI service (such as OpenAI or Anthropic) that Cursor connects to in order to generate code suggestions.
concurrency: How many requests are being handled at the same time.
rate limiting: A security measure that limits how often a user can access a site.

Read original ↗