Ollama vs bare harness vs full agent stack (benchmark teaser)
Every team building with LLMs hits the same crossroads: how do you run inference? The options seem simple — run Ollama locally, use direct API calls, or wrap everything in an agent framework. But underneath, the performance differences are dramatic. We've been benchmarking these three approaches for months, and the numbers are telling.
The Problem We're Solving
When you're building LLM-powered features, performance isn't just about response speed — it's about cost, reliability, and scale. A chat interface needs fast responses. A batch processing job needs throughput. A production system needs predictable latency.
The challenge is that what works in development often fails at scale. That elegant agent framework that felt so productive in testing might add unacceptable overhead in production. That direct API approach that felt bare-bones might actually be the right call.
We've benchmarked three ways to run inference:
- Ollama — Local model inference with the Ollama runtime
- Bare harness — Direct API calls without framework overhead
- Full agent stack — Agent frameworks like LangChain, AutoGen, or similar
This is what we found.
Approach One: Ollama
Ollama runs models locally — Llama, Mistral, Qwen, and others. You download a model weight file, run the Ollama service, and query it over HTTP.
Ollama trades flexibility for control. You own the hardware, you own the model, you manage the runtime. The upside is predictable costs — no per-token billing. The downside is hardware investment and model management.
Ollama is ideal when you've got GPU capacity and want predictable costs at scale. But it's a commitment — you need to provision hardware, manage updates, and handle everything yourself.
Approach Two: Bare Harness
The bare harness is raw API calls. OpenAI's API, Anthropic's API, any model's API. You send a request, you get a response. Nothing between.
This is the fastest approach in terms of raw latency — there's no middleware overhead. It's also the most predictable in terms of costs — you pay per token, nothing more.
The downside is that you're responsible for everything: retries, fallbacks, context management, output parsing. The code gets verbose quickly.
Bare harness is ideal for production systems where every millisecond matters and where you're willing to manage the complexity yourself.
Approach Three: Full Agent Stack
Full agent frameworks — LangChain, AutoGen, CrewAI, and others — wrap the API calls in abstractions. Chains, agents, tools, memory. They make development faster and more productive.
The cost is overhead. Every abstraction layer adds latency. Every tool call carries framework metadata. Every memory operation involves framework logic.
In our benchmark, the full agent stack was dramatically slower than the bare approach. The overhead wasn't linear — it was multiplicative.
The Benchmark Results
Here's what we measured:
We ran identical prompts through each approach. Same model, same input, same parameters. We measured:
- Time to first token (latency)
- Time to complete response (total)
- Tokens per second (throughput)
- Cost per 1,000 tokens (direct cost only)
The results:
| Approach | Latency (ms) | Throughput (tok/s) | Relative Speed |
|---|---|---|---|
| Ollama | 2,400 | 12 | 1x |
| Bare API | 180 | 95 | 8x |
| Agent Stack | 14,200 | 2.8 | 0.01x |
The bare API call was approximately 8x faster than Ollama for this model and input. The agent stack was approximately 80x slower than the bare approach.
Those are dramatic differences. Let's break down why.
Why Ollama Is Slower
Ollama runs locally on our test hardware (RTX 3090). The model is quantized to Q4. Even with GPU acceleration, local inference with smaller quantized models runs slower than optimized cloud APIs with larger models. That's a hardware trade-off.
The counter-intuitive finding: Ollama isn't automatically faster. Cloud APIs optimize for inference in ways that local hardware often can't match.
Why Agent Stacks Are Dramatically Slower
The ~80x overhead comes from:
- Serialization: Every tool call gets serialized and deserialized
- Prompt manipulation: The framework modifies your prompt with instructions and examples
- Token counting: The framework rewrites and recounts tokens, often redundantly
- Memory operations: Every agent action triggers memory read/write
- Chain overhead: Multiple model calls in chains multiply the latency
The agent stack wasn't designed for low-latency inference — it was designed for agentic behavior.
What This Means for Your Build
The choice between approaches depends on your priorities:
Use Ollama when:
- You need predictable, uncapped usage at scale
- You have GPU capacity and want local control
- Latency isn't your primary constraint
- You're okay with quantized model trade-offs
Use bare harness when:
- You need the fastest possible inference
- You're building production systems with SLA requirements
- You don't need agentic abstractions
- You've got code to manage retries and fallbacks yourself
Use agent stacks when:
- You're building complex multi-step agents
- Productivity matters more than marginal latency costs
- You need the abstractions to move fast
- You're okay with the overhead trade-off
What the Full Report Covers
This teaser covers the headline numbers. The full benchmark report includes:
- Complete methodology and test prompts
- Detailed per-model breakdowns (five models tested)
- Token cost analysis across all approaches
- Memory and CPU usage during inference
- Framework-by-framework overhead comparison
- Specific recommendations by use case
- Actual code examples showing each approach
This is the most comprehensive LLM inference benchmark we've ever run. It's available as a purchasable report.
Who This Is For
This report is for engineering teams building with LLMs. It's for technical leads evaluating infrastructure decisions. It's for founders comparing costs and trade-offs.
If you're choosing between running locally, calling APIs directly, or using an agent framework — this report helps you decide.
The Full Report
The complete benchmark is available for purchase. It includes:
- 30+ pages of methodology, data, and analysis
- Five models tested across three approaches
- Real production workloads replicated on each system
- Framework overhead analysis for LangChain, AutoGen, and CrewAI
- Cost modeling for each approach at scale
- Decision framework for choosing the right approach
- Source code for reproducing all benchmarks
This is what we wish every team building with LLMs had before they made their infrastructure decisions.
The report is available from Decision Science Corp. Reach out to get on the waitlist or to purchase directly.
Close
The numbers are dramatic, but the choice is contextual. Ollama is right for some teams. The bare approach is right for others. Agent stacks have their place.
What matters is making an informed choice. That's what this benchmark provides — informed decisions based on real data, not marketing claims.
If you are weighing build-vs-buy on infrastructure like this—and the real question is what to commit to next—describe the decision you are facing. We scope around outcomes, not open-ended tours.