One open window eats 4,000 tokens — that's why local AI can't run your PC
For an AI to control your computer, it first has to read a list of every button and text field on screen — called an accessibility tree. Just one window fills about 4,000 tokens, which quickly maxes out small local AI models. That bottleneck explains why local computer-use loops keep breaking down.
'Computer-use' is a feature that lets an AI act like a human at a keyboard — clicking buttons, typing, and navigating apps. To do this, the AI reads an accessibility tree: a structured list of every interactive element currently on screen. The problem is that this list is enormous. A single window already consumes around 4,000 tokens, roughly the length of a 3,000-word article.
Local AI models have a fixed context window — a cap on how many tokens they can handle at once. Once you add a second window, a few steps of back-and-forth, or any history, you blow past that cap and the loop stalls or crashes. In practice, reliable PC automation requires a large cloud model (like GPT-4o or Gemini) with a much bigger context window. The workaround for local setups is to filter or compress the accessibility tree before passing it to the model — keeping only the elements that actually matter for the current task.
Key points
- A single window's accessibility tree uses ~4,000 tokens just to describe what's on screen
- Small local models hit their token limit fast, causing multi-step automation loops to fail
- Large cloud models with bigger context windows handle this much better
- Pre-filtering the accessibility tree to only relevant elements can make local models viable
- This bottleneck affects any AI agent scenario that involves reading the desktop UI
Quick term guide
- accessibility tree
- A list the operating system generates of every button, text box, and control on screen, used by AI (and screen readers) to 'see' the interface.
- AI models
- The core brain or underlying program that powers an artificial intelligence tool.
- computer-use
- A feature that lets an AI control a computer directly by clicking, typing, and reading the screen, just like a human would.
- context window
- The amount of text an AI tool can remember and use in one chat.
- automation
- A way to make repeated work happen without doing every step by hand.
- workaround
- An alternative way to get something done when the normal way doesn't work.
- local model
- An AI model you run directly on your own computer, with no internet connection or external service needed.
- context windows
- The maximum amount of text an AI can process in a single request.