One open window eats 4,000 tokens — that's why local AI can't run your PC

For an AI to control your computer, it first has to read a list of every button and text field on screen — called an accessibility tree. Just one window fills about 4,000 tokens, which quickly maxes out small local AI models. That bottleneck explains why local computer-use loops keep breaking down.

'Computer-use' is a feature that lets an AI act like a human at a keyboard — clicking buttons, typing, and navigating apps. To do this, the AI reads an accessibility tree: a structured list of every interactive element currently on screen. The problem is that this list is enormous. A single window already consumes around 4,000 tokens, roughly the length of a 3,000-word article.

Local AI models have a fixed context window — a cap on how many tokens they can handle at once. Once you add a second window, a few steps of back-and-forth, or any history, you blow past that cap and the loop stalls or crashes. In practice, reliable PC automation requires a large cloud model (like GPT-4o or Gemini) with a much bigger context window. The workaround for local setups is to filter or compress the accessibility tree before passing it to the model — keeping only the elements that actually matter for the current task.

Key points

  • A single window's accessibility tree uses ~4,000 tokens just to describe what's on screen
  • Small local models hit their token limit fast, causing multi-step automation loops to fail
  • Large cloud models with bigger context windows handle this much better
  • Pre-filtering the accessibility tree to only relevant elements can make local models viable
  • This bottleneck affects any AI agent scenario that involves reading the desktop UI

Quick term guide

accessibility tree
A list the operating system generates of every button, text box, and control on screen, used by AI (and screen readers) to 'see' the interface.
AI models
The core brain or underlying program that powers an artificial intelligence tool.
computer-use
A feature that lets an AI control a computer directly by clicking, typing, and reading the screen, just like a human would.
context window
The amount of text an AI tool can remember and use in one chat.
automation
A way to make repeated work happen without doing every step by hand.
workaround
An alternative way to get something done when the normal way doesn't work.
local model
An AI model you run directly on your own computer, with no internet connection or external service needed.
context windows
The maximum amount of text an AI can process in a single request.
Read original