How walkindb scales to 10 000 concurrent walk-ins on a single €6 VPS

Share-nothing file-per-instance architecture, no connection pool, Landlock + seccomp sandbox, and why the filesystem is the database of instances.

Posted 2026-04-11 · 8 minute read

The claim

walkindb runs one Go binary on one OVHcloud VPS. The VPS costs roughly €6 per month. With backups it's €7.20. The binary's resident memory, right now, in production, is 3.1 MB. The whole thing — including the landing page, the SDKs, the legal docs, a Cloudflare Pages deployment, and walkindb.com itself — costs around $107 per year end-to-end.

On that setup we target 10 000 concurrent walk-ins, 1 000 sustained queries per second, and 100 new instance creations per second. This post explains why that's not a wild claim, why it's possible on hardware a laptop could outperform, and what architectural choices made it fall out naturally.

The honest framing: none of what follows is magic. It's what you get when you pick a problem that fits SQLite's shape and you aggressively refuse to add infrastructure.

The architecture in one picture

     +-----------------------------+
     |    Cloudflare (DNS only)    |
     +-------------+---------------+
                   |
              api.walkindb.com
                   |
     +-------------v---------------+
     |  Caddy (TLS, CORS, proxy)   |
     +-------------+---------------+
                   |
              127.0.0.1:8080
                   |
     +-------------v---------------+     +-------------+
     |  walkindb binary (Go)       |     |  Landlock + |
     |                             |<----+  seccomp    |
     |  - HTTP router              |     +-------------+
     |  - Session HMAC verifier    |
     |  - Rate limiter (per IP)    |
     |  - SQL keyword blocklist    |
     |  - SQLite executor (pure Go)|
     |  - TTL sweeper (goroutine)  |
     +-------------+---------------+
                   |
     +-------------v---------------+
     |   /var/walkindb/            |
     |     secrets/hmac.key        |
     |     instances/<uuid>/       |
     |         db.sqlite           |
     |         meta.json           |
     |     logs/access-*.jsonl     |
     +-----------------------------+

There is no Postgres. No Redis. No service mesh. No Kubernetes. No message queue. No Docker. There is a binary, a filesystem, and a single systemd unit.

Choice #1: share-nothing, file-per-instance

Every walk-in is its own SQLite file. The path is /var/walkindb/instances/<uuid>/db.sqlite. There is no shared schema, no shared catalog, no shared anything. If you have 10 000 concurrent walk-ins, you have 10 000 SQLite files.

This is the single most load-bearing decision in the whole product. It means:

Writes never contend across tenants. SQLite's notorious single-writer lock is famously a bottleneck when you have one database. When each tenant has their own database, it's a feature — it guarantees the walk-in's author is the only writer. Contention disappears.
There is no connection pool. Each HTTP request opens a fresh SQLite connection to its instance's file, runs one statement (maybe a batch), and closes. Opening a local SQLite file is ~100 μs in Go. We measured: the whole POST /sql round trip, including JSON decode, session verify, executor limits, and result encode, is under 1 ms for simple SELECTs.
Per-tenant quotas are trivial. Want to cap one walk-in at 10 MB? PRAGMA max_page_count = 2560. SQLite enforces the cap itself and returns SQLITE_FULL when the user exceeds it, which our executor maps to HTTP 507. No accounting logic, no billing-system plumbing, no quota service.
Data expiry is one rm -rf. The TTL sweeper is a goroutine that scans /var/walkindb/instances/ every 30 seconds, reads each meta.json, and removes any directory whose expires_at is in the past. There is no "DELETE FROM instances WHERE expired" query, because there is no instances table. The filesystem IS the database of instances.

Most managed-database products fight the database to isolate tenants. walkindb skips the fight by never sharing anything in the first place. The cost is lower per-tenant overhead, not higher.

Choice #2: the filesystem is the database of metadata

Look at the files walkindb reads and writes:

/var/walkindb/
├── secrets/
│   └── hmac.key            # 32 bytes, rotated daily
├── instances/
│   ├── 018f2b...a3/
│   │   ├── db.sqlite       # the walk-in
│   │   └── meta.json       # {created_at, expires_at, ttl_seconds}
│   └── 018f2b...a4/
│       └── ...
└── logs/
    ├── access-2026-04-11.jsonl   # daily-rotated; 7-day retention
    └── access-2026-04-10.jsonl

The operations walkindb needs on this state are:

Given a session token → resolve to instance directory. That's an HMAC verify plus a stat(2).
Check whether an instance is expired. That's reading meta.json (~100 bytes) and comparing a timestamp.
Delete an expired instance. That's rm -rf <dir>.
Rotate the access log. That's opening a new file.

All of these are O(1) or O(number of instances being swept). None of them need indexes, transactions, or a query planner. They're syscalls. SQLite would be overkill for metadata about SQLite files.

The popular failure mode here is adding Postgres to store "just the tenant list". Don't. Once you have Postgres, you have a thing to back up, a thing to fail over, a thing to scale, a thing to budget for. The walkindb design says: the filesystem already has every property I need, and it's managed by systemd.

Choice #3: no connection pool

Typical managed databases maintain a connection pool per tenant. That's a scaling problem because pools are sized for peak load and sit idle at baseline. 10 000 tenants × 10 connections each = 100 000 connections, which is enough to exhaust a real Postgres before any queries run.

walkindb has zero long-lived SQLite connections. Every POST /sql does:

sql.Open("sqlite", dsn)  // open the instance file
  → db.Conn(ctx)         // get a dedicated connection for this request
  → Limit(conn, ...)      // apply per-connection sqlite3_limit
  → Query/Exec(ctx, sql) // run the statement with a 2 s timeout
  → conn.Close()         // release the connection
  → db.Close()           // close the file handle

No pool. No keepalive. The Go runtime manages the file handles; the OS page cache keeps hot SQLite pages resident. We don't pre-allocate anything. Idle walk-ins cost literally zero CPU — the only state they have is meta.json on disk.

The consequence is profound: walkindb's memory footprint does not scale with the number of walk-ins it's serving. It scales with the number of in-flight requests, which is bounded by the rate limiter (60/min/IP) times the number of unique client IPs. At steady state, idle walk-ins cost us nothing in RAM.

Choice #4: the request path is the only hot path

Here's everything that happens on a POST /sql request, in order:

Caddy accepts the HTTPS connection, terminates TLS, reverse-proxies to 127.0.0.1:8080. Caddy is the only long-lived process besides walkindb itself.
Go's net/http server routes POST /sql to our handler.
Per-IP rate limiter consumes a token (two buckets: request + new-instance). Implemented as an in-memory LRU of golang.org/x/time/rate limiters, capped at 10 K tracked IPs.
Body is decoded as JSON, capped at 8 KB by http.MaxBytesReader.
If X-Walkin-Session is present, the token is HMAC-verified against current + previous secrets in constant time. On failure: 404.
If no token, mint a new one (UUIDv7 + 32-byte nonce + HMAC) and create a new instance directory.
SQL keyword blocklist strips comments and rejects forbidden patterns. Regex matching, no parsing.
Executor opens the instance's SQLite file, gets a fresh connection, applies 10 per-connection limits, runs the statement under a 2-second context deadline.
Result is encoded as JSON and written to the response.
Access log middleware writes one line to today's JSONL file: timestamp, IP, instance ID, method, status, SQL byte count, user-agent. Nothing else.

Nine steps. None of them allocate more than a few kilobytes. None of them open a network connection to anywhere else. The entire stack is RAM + local filesystem.

Choice #5: share-nothing security

Share-nothing is usually pitched as a performance property. At walkindb's scale it's more valuable as a security property, because it composes with the sandbox.

Consider ATTACH DATABASE '/etc/passwd' AS bad, the canonical SQLite-escape attempt. Three independent layers stop it:

Application keyword blocklist. walkindb rejects the request with 400 forbidden sql keyword: ATTACH before SQLite parses the statement. This catches the attack 100 % of the time in practice, and it's cheap — one regex match after a comment strip.
sqlite3_limit(LIMIT_ATTACHED, 0). Even if the blocklist is somehow bypassed, SQLite itself has been told it may attach zero databases. ATTACH returns an error at the engine level.
Landlock. Even if SQLite has a CVE and honors the ATTACH despite the limit, the kernel blocks the open(2). Landlock (Linux LSM, available since 5.13) has restricted the walkindb process to paths under /var/walkindb/** plus /tmp. open("/etc/passwd") from the walkindb process returns EACCES. No amount of SQL cleverness can reach that file.

Three layers, none of which depend on the correctness of the one above them. That's the walkindb security story: we don't need any single check to be perfect, because failure is contained at the layer below it.

Compare to a typical shared-tenant SQL service: they have ONE layer — the authorizer callback — and if it misses a case, the attacker has root on the shared database engine. walkindb's share-nothing layout means the worst case isn't "compromised engine", it's "compromised your own walk-in instance", which is already empty of anything you didn't put there.

Choice #6: seccomp as icing

The walkindb systemd unit pins a syscall allowlist via SystemCallFilter=@system-service and adds explicit denies for the dangerous groups:

SystemCallFilter=~@mount @swap @reboot @raw-io @cpu-emulation @debug @obsolete @privileged @resources

This is belt over Landlock's suspenders. Even if an attacker could somehow get arbitrary syscalls to execute inside the walkindb process, they can't call mount, reboot, swapon, ptrace, raw I/O, debug registers, or any of the other groups that would be useful for escalation. Combined with NoNewPrivileges, CapabilityBoundingSet= (empty), MemoryDenyWriteExecute, and RestrictNamespaces, the process is locked down as hard as systemd allows without a container.

The cost of adding infrastructure

Every shared service we didn't add is a service we don't have to pay for, secure, backup, fail over, or explain:

We don't have	Which saves
Postgres for metadata	$20+/mo managed or ~500 MB RAM self-hosted; 1 more thing to back up; 1 more CVE feed to watch
Redis for rate limiting	$15+/mo or ~50 MB RAM self-hosted; network hop on every request
Connection pool	~100 MB RAM at rest; complexity around pool exhaustion
Kubernetes	1 control plane, 1 set of manifests, 1 learning curve, 1 category of outage
Docker (for the hot path)	~50 MB overhead per container; longer start times; cgroups that systemd already does
Background job queue	The only "background job" is the TTL sweeper, which is a goroutine
Message broker	There are no messages. It's a request/response API.
Sharding layer	One VPS. If we need to shard, we'll shard.
Service mesh	One process on one box talks to itself on localhost

Every one of those we resisted adding kept the memory footprint at 3 MB and the monthly bill at €6. We didn't optimize walkindb to be small — we refused to add anything that would make it large.

When this stops working

The honest failure modes of this architecture:

One VPS is one failure domain. If the OVH box loses power, walkindb is down. At 10-minute TTLs this is actually cheap — there's no catastrophic data loss possible, because there's no durable data. But availability is one-nine until we add a standby.
Vertical scaling has limits. An OVH VPS Value has 2 vCPUs and 4 GB RAM. At some point, 10 000 genuinely-active concurrent walk-ins would exceed it. The answer then is to add more boxes with a consistent-hash shard on instance ID — not to rewrite the architecture.
Network egress is a hidden cost. Cloudflare is DNS-only so there's no free egress bandwidth; all response bytes leave via OVH. At ~1 KB responses, that's fine for millions of requests.
The filesystem is the database of instances — including the bugs. If we ever hit a filesystem corruption, we have no database-level repair tools. Mitigation: the per-instance files are small (≤10 MB) and short-lived, so the blast radius is one walk-in, not the whole service.

What this enables

The share-nothing, file-per-instance design is what lets walkindb exist as a free service. If every walk-in cost us a connection pool slot and a row in a metadata table, we could not afford to give them away without a credit card. Because every walk-in costs us a directory and a 100-byte JSON file, both cleaned up automatically, the marginal cost of an additional walk-in is approximately nothing.

That's the whole product thesis: agents can't sign up for things, so walkindb doesn't ask. And the only way to make "don't ask" economically sustainable is to make each walk-in cheap enough that the operator genuinely doesn't care whether you use one.

The architecture is the business model.

Also see

ARCHITECTURE.md in the repo — the canonical architecture doc
Security model — the defense-in-depth layers referenced above
Agent patterns — what to actually build with walkindb
REST API reference