12 small AI models tested on reasoning — results compared
Someone ran 15 reasoning questions across 12 small AI models ranging from 1B to 35B parameters. The lineup includes popular open-source models like Qwen, Gemma, Phi, and Llama. It's a practical reference for anyone choosing a model to run locally on their own machine.
The test covers models you can run on your own computer without needing a cloud service — so-called local LLMs. Model sizes range from 1B (1 billion parameters, very lightweight) to 35B (35 billion parameters, needs a powerful machine). The 15-question test focused on reasoning tasks such as logic, math, and common sense.
The compared models include Qwen, Ministral, Nemotron, Gemma, Phi, Llama, LFM, and GPT-OSS variants. For solo developers or makers who want to run AI locally, this kind of benchmark helps find the sweet spot between performance and the hardware requirements that come with larger model sizes.
Key points
- 12 models tested head-to-head on the same 15 reasoning questions
- Size range: 1B (smallest, fast) to 35B (largest, needs more hardware)
- Models covered: Qwen, Gemma, Phi, Llama, Ministral, Nemotron, LFM, GPT-OSS
- Useful for picking a local AI model that reasons well without requiring a high-end machine
- Helps identify which models punch above their weight in reasoning ability
Quick term guide
- AI models
- The core brain or underlying program that powers an artificial intelligence tool.
- parameters
- The internal numbers an AI model learns during training; more parameters generally mean a more capable model.
- open-source models
- AI models whose code is freely available so anyone can download and run them on their own computer or server.
- open-source model
- An AI model whose code and weights are freely available for anyone to download and use.
- open-source
- Software whose code is shared publicly so others can inspect, use, or change it.
- local LLMs
- AI language models you run on your own computer instead of through a company's online service
- developers
- Developers are people who build software, apps, or websites.
- benchmark
- A test used to compare speed, quality, or cost.