Running an agent harness on a $5 VPS

Running an agent harness on a $5 VPS

Can a $5-class VPS and a tiny local model do real agent work? CPU-only, no GPU, no Letta-scale infrastructure — just the Bumblebee harness, Ollama, and a question. We ran it for weeks to find out what breaks before anything useful happens, and what holds up well enough to recommend.

This is an honest post about what we found.

Public repository — Bumblebee: Bumblebee-AGI/bumblebee — Open-source agent harness (not ours) — connects to Ollama for local smoke tests.

What Bumblebee Is

Bumblebee is the open-source agent harness from the Bumblebee-AGI project: Bumblebee-AGI/bumblebee It connects to Ollama for local model inference, defines entities with configuration, and exposes a CLI for interactive agent work. You define an entity, point to an Ollama endpoint, and talk to it.

It's agent infrastructure — not hosting monitoring, not health checks, not Prometheus exporters. We ran Bumblebee because we wanted to understand the harness's behavior on hardware we actually want to use for experiments.

The Smoke Lab Setup

We provisioned a BitLaunch nibble-1024 instance:

  • 1 vCPU, 1 GiB RAM
    • CPU-only (no GPU)
    • Running Debian, Ollama service

The entity configuration in configs/entities/smoke.yaml:

 name: smoke
 ollama:
 endpoint: "http://localhost:11434"
 model: "gemma3:270m-it-qat"
 embedding: "nomic-embed-text"
 environment:
 BUMBLEBEE_OLLAMA_NO_TOOLS: "1"  # tools disabled — model doesn't support calling

We disabled tools because that model build doesn't support tool calling. The harness degrades gracefully, but you lose agent action capability.

CLI usage:

 bumblebee talk smoke --ollama
 bumblebee ask smoke --ollama "what is 47 times 12"

This is agent research on the cheapest sensible hardware. Not demo-quality.

What We Measured

We measured latency and failure modes on the smoke box:

Wall-clock latency: bumblebee ask commands took 48 to 122 seconds to return. This is the full harness cycle — prompt building, Ollama inference, output parsing. The model itself generated at roughly 19-20 tokens per second. The rest was harness overhead and network round-trips within the box.

Generation quality: The 270M parameter model can hold a conversation and produce readable text. That's it. Multi-step reasoning failed — when we asked for structured output, it drifted. Math operations got wrong with any complexity. Formatting requirements (JSON, bullet lists) frequently broke.

Tool hallucination: Even with tools explicitly disabled in configuration, the model sometimes hallucinated tool calls. It assumed it had capabilities it didn't. The smaller model can't maintain context about what it can and cannot do.

Policy regurgitation: Sometimes the model returned long policy-text responses instead of short answers. It defaults to safe, verbose output — which is safe design but not useful for an agent workflow that needs concision.

What works on smoke: simple Q&A, short-form responses, basic information retrieval. What doesn't work: reasoning chains, math, structured output, tool use, formatting enforcement.

Working with Our Research Agents

We coordinated with Athena — our Letta-based research agent — on what to probe and measure from the smoke box. This wasn't smoke box reporting to production; it was research-to-research communication on what metrics matter.

Athena helped us define the test matrix: latency characterization across multiple prompt lengths (1 sentence to 5 sentences), reasoning probe with multi-step logic puzzles, instruction compliance for format enforce (JSON, bullet lists), and failure mode identification (hallucination, policy drift, math errors). We fed Athena the raw timing and output data, and she annotated it with pattern recognition — which failure modes correlate with which prompt types, where reasoning breaks first, where latency spikes relative to prompt complexity.

This is how we do research: isolate the component, instrument it, coordinate with production agents to contextualize findings. We don't treat smoke as production. We treat it as a data point. The conversation between research and production agents is what makes the measurement meaningful.

What This Is NOT

This is NOT a production Sanctum agent. Letta, Athena, Ada, Broca, and the rest run on separate infrastructure with appropriate models. The smoke box is intentionally isolated and intentionally limited.

This is NOT hosting monitoring. We didn't check port 80, DNS resolution, or uptime. That's infrastructure monitoring, which we do separately with different tools.

This is NOT the bounds of all small-model agents. We tested one specific model at one specific scale. Other models at larger parameter counts may perform better. We'll test those.

Why This Matters

The smoke box exists to establish the lower bound: what can you do with the cheapest reasonable hardware, in exchange for what trade-offs?

The honest answer at this scale:

  • You can have a conversational agent that responds
    • You cannot have reasoning chains or reliable tool use
    • Latency makes interactive use painful past trivial queries
    • The model is too small to maintain instruction following for complex tasks

This informs our decisions: when someone asks "can we run agents on cheap VPS hosting," we now have a data point. The answer is: technically yes, but practically limited. If you need reasoning, structured output, or tool use, budget hosting isn't the constraint — the model is. Spend money on the model, not the harness.

The economics: harness overhead at this scale is substantial. We're seeing 80× slowdown in the harness layer versus raw Ollama on the same hardware. The smaller the model, the more harness overhead dominates relative to generation speed. This is an honest economics for cheap agent experiments — you pay for the model, and you pay for the harness.

The Follow-On

We also benchmarked a larger dev-agent box (8 GiB RAM, Qwen 2.5 7B, tools enabled) on the same line. Raw generation was roughly 5 tokens per second — significantly faster than the 270M model. Tool calling worked. The harness added roughly 80× overhead for the same "pong" prompt compared to raw Ollama on the same CPU box. The larger model makes the harness cost more visible, but the model carries the work better.

The dev-agent box is now our agent lab line. We'll publish more on this in future Technonomicon benchmark posts.

Closing

The smoke box answered our question: no, you can't run useful agents on $5 VPS hosting with tiny models. Not for production work. But for experimentation, the smoke box has a purpose — it's a soak test for harness behavior, a baseline for latency measurement, and an existence proof that we're willing to do the honest measurement ourselves.

If you are choosing where to run agent smoke tests and want a production-grounded view of cost vs fidelity, tell us what you are deciding.