How walkindb holds 100 000 concurrent walk-ins on a single €6 VPS

Share-nothing file-per-instance architecture, no connection pool, Landlock + seccomp sandbox, and why the filesystem is the database of instances.

Posted 2026-04-11 · 10 minute read

The claim, with receipts

walkindb runs one Go binary on one OVHcloud VPS. The VPS costs roughly €6/month; with backups €7.20. The whole service — landing page, SDKs, legal docs, Cloudflare Pages deployment, and walkindb.com itself — costs around $107/year end-to-end.

We went into this post with aspirational targets: 10 000 concurrent walk-ins, 1 000 sustained qps, 100 new instance creations per second. Then we benchmarked. The numbers below are measured on the actual production VPS by a Go harness in bench/ of the repo, running against a second walkindb instance at 127.0.0.1:9090 with rate limits cranked high. Reproduce yourself — everything is open.

Measured on the €6 VPS (2 vCPU, 8 GB RAM, Gravelines):
  • 100 000 concurrent walk-ins on disk in 727 seconds at c=8 — zero failures. Box still had 7.1 GB of 8 GB RAM free and 68 GB of 72 GB disk free.
  • walkindb RSS stayed flat at ~19 MB between 5 000 and 12 000 resident walk-ins. Marginal RAM per idle walk-in: approximately zero.
  • 138 walk-in creations per second sustained (concurrent, c=8). 59/sec single-threaded. No backoff, no queueing.
  • ~930 requests per second sustained POST /sql SELECT 1 throughput at c=32 (saturation; c=128 gave the same rate with worse tail latency).
  • p50 3.0 ms, p99 21 ms for POST /sql SELECT 1 on a reused session (server-side, loopback to exclude internet).
  • ~14 KB per walk-in on disk (SQLite file + WAL sidecars + meta.json). Projected disk ceiling: ~4.8 million walk-ins.
  • Production walkindb process at rest: 11 MB RSS. The whole service fits in less RAM than a single Slack window.

The honest framing: none of what follows is magic. It's what you get when you pick a problem that fits SQLite's shape and you aggressively refuse to add infrastructure.

The architecture in one picture

     +-----------------------------+
     |    Cloudflare (DNS only)    |
     +-------------+---------------+
                   |
              api.walkindb.com
                   |
     +-------------v---------------+
     |  Caddy (TLS, CORS, proxy)   |
     +-------------+---------------+
                   |
              127.0.0.1:8080
                   |
     +-------------v---------------+     +-------------+
     |  walkindb binary (Go)       |     |  Landlock + |
     |                             |<----+  seccomp    |
     |  - HTTP router              |     +-------------+
     |  - Session HMAC verifier    |
     |  - Rate limiter (per IP)    |
     |  - SQL keyword blocklist    |
     |  - SQLite executor (pure Go)|
     |  - TTL sweeper (goroutine)  |
     +-------------+---------------+
                   |
     +-------------v---------------+
     |   /var/walkindb/            |
     |     secrets/hmac.key        |
     |     instances/<uuid>/       |
     |         db.sqlite           |
     |         meta.json           |
     |     logs/access-*.jsonl     |
     +-----------------------------+

There is no Postgres. No Redis. No service mesh. No Kubernetes. No message queue. No Docker. There is a binary, a filesystem, and a single systemd unit.

Choice #1: share-nothing, file-per-instance

Every walk-in is its own SQLite file. The path is /var/walkindb/instances/<uuid>/db.sqlite. There is no shared schema, no shared catalog, no shared anything. If you have 10 000 concurrent walk-ins, you have 10 000 SQLite files.

This is the single most load-bearing decision in the whole product. It means:

  • Writes never contend across tenants. SQLite's notorious single-writer lock is famously a bottleneck when you have one database. When each tenant has their own database, it's a feature — it guarantees the walk-in's author is the only writer. Contention disappears.
  • There is no connection pool. Each HTTP request opens a fresh SQLite connection to its instance's file, runs one statement (maybe a batch), and closes. Opening a local SQLite file is ~100 μs in Go. We measured: the whole POST /sql round trip, including JSON decode, session verify, executor limits, and result encode, is under 1 ms for simple SELECTs.
  • Per-tenant quotas are trivial. Want to cap one walk-in at 10 MB? PRAGMA max_page_count = 2560. SQLite enforces the cap itself and returns SQLITE_FULL when the user exceeds it, which our executor maps to HTTP 507. No accounting logic, no billing-system plumbing, no quota service.
  • Data expiry is one rm -rf. The TTL sweeper is a goroutine that scans /var/walkindb/instances/ every 30 seconds, reads each meta.json, and removes any directory whose expires_at is in the past. There is no "DELETE FROM instances WHERE expired" query, because there is no instances table. The filesystem IS the database of instances.

Most managed-database products fight the database to isolate tenants. walkindb skips the fight by never sharing anything in the first place. The cost is lower per-tenant overhead, not higher.

Choice #2: the filesystem is the database of metadata

Look at the files walkindb reads and writes:

/var/walkindb/
├── secrets/
│   └── hmac.key            # 32 bytes, rotated daily
├── instances/
│   ├── 018f2b...a3/
│   │   ├── db.sqlite       # the walk-in
│   │   └── meta.json       # {created_at, expires_at, ttl_seconds}
│   └── 018f2b...a4/
│       └── ...
└── logs/
    ├── access-2026-04-11.jsonl   # daily-rotated; 7-day retention
    └── access-2026-04-10.jsonl

The operations walkindb needs on this state are:

  • Given a session token → resolve to instance directory. That's an HMAC verify plus a stat(2).
  • Check whether an instance is expired. That's reading meta.json (~100 bytes) and comparing a timestamp.
  • Delete an expired instance. That's rm -rf <dir>.
  • Rotate the access log. That's opening a new file.

All of these are O(1) or O(number of instances being swept). None of them need indexes, transactions, or a query planner. They're syscalls. SQLite would be overkill for metadata about SQLite files.

The popular failure mode here is adding Postgres to store "just the tenant list". Don't. Once you have Postgres, you have a thing to back up, a thing to fail over, a thing to scale, a thing to budget for. The walkindb design says: the filesystem already has every property I need, and it's managed by systemd.

Choice #3: no connection pool

Typical managed databases maintain a connection pool per tenant. That's a scaling problem because pools are sized for peak load and sit idle at baseline. 10 000 tenants × 10 connections each = 100 000 connections, which is enough to exhaust a real Postgres before any queries run.

walkindb has zero long-lived SQLite connections. Every POST /sql does:

sql.Open("sqlite", dsn)  // open the instance filedb.Conn(ctx)         // get a dedicated connection for this requestLimit(conn, ...)      // apply per-connection sqlite3_limitQuery/Exec(ctx, sql) // run the statement with a 2 s timeoutconn.Close()         // release the connectiondb.Close()           // close the file handle

No pool. No keepalive. The Go runtime manages the file handles; the OS page cache keeps hot SQLite pages resident. We don't pre-allocate anything. Idle walk-ins cost literally zero CPU — the only state they have is meta.json on disk.

The consequence is profound: walkindb's memory footprint does not scale with the number of walk-ins it's serving. It scales with the number of in-flight requests, which is bounded by the rate limiter (60/min/IP) times the number of unique client IPs. At steady state, idle walk-ins cost us nothing in RAM.

Choice #4: the request path is the only hot path

Here's everything that happens on a POST /sql request, in order:

  1. Caddy accepts the HTTPS connection, terminates TLS, reverse-proxies to 127.0.0.1:8080. Caddy is the only long-lived process besides walkindb itself.
  2. Go's net/http server routes POST /sql to our handler.
  3. Per-IP rate limiter consumes a token (two buckets: request + new-instance). Implemented as an in-memory LRU of golang.org/x/time/rate limiters, capped at 10 K tracked IPs.
  4. Body is decoded as JSON, capped at 8 KB by http.MaxBytesReader.
  5. If X-Walkin-Session is present, the token is HMAC-verified against current + previous secrets in constant time. On failure: 404.
  6. If no token, mint a new one (UUIDv7 + 32-byte nonce + HMAC) and create a new instance directory.
  7. SQL keyword blocklist strips comments and rejects forbidden patterns. Regex matching, no parsing.
  8. Executor opens the instance's SQLite file, gets a fresh connection, applies 10 per-connection limits, runs the statement under a 2-second context deadline.
  9. Result is encoded as JSON and written to the response.
  10. Access log middleware writes one line to today's JSONL file: timestamp, IP, instance ID, method, status, SQL byte count, user-agent. Nothing else.

Nine steps. None of them allocate more than a few kilobytes. None of them open a network connection to anywhere else. The entire stack is RAM + local filesystem.

Choice #5: share-nothing security

Share-nothing is usually pitched as a performance property. At walkindb's scale it's more valuable as a security property, because it composes with the sandbox.

Consider ATTACH DATABASE '/etc/passwd' AS bad, the canonical SQLite-escape attempt. Three independent layers stop it:

  1. Application keyword blocklist. walkindb rejects the request with 400 forbidden sql keyword: ATTACH before SQLite parses the statement. This catches the attack 100 % of the time in practice, and it's cheap — one regex match after a comment strip.
  2. sqlite3_limit(LIMIT_ATTACHED, 0). Even if the blocklist is somehow bypassed, SQLite itself has been told it may attach zero databases. ATTACH returns an error at the engine level.
  3. Landlock. Even if SQLite has a CVE and honors the ATTACH despite the limit, the kernel blocks the open(2). Landlock (Linux LSM, available since 5.13) has restricted the walkindb process to paths under /var/walkindb/** plus /tmp. open("/etc/passwd") from the walkindb process returns EACCES. No amount of SQL cleverness can reach that file.

Three layers, none of which depend on the correctness of the one above them. That's the walkindb security story: we don't need any single check to be perfect, because failure is contained at the layer below it.

Compare to a typical shared-tenant SQL service: they have ONE layer — the authorizer callback — and if it misses a case, the attacker has root on the shared database engine. walkindb's share-nothing layout means the worst case isn't "compromised engine", it's "compromised your own walk-in instance", which is already empty of anything you didn't put there.

Choice #6: seccomp as icing

The walkindb systemd unit pins a syscall allowlist via SystemCallFilter=@system-service and adds explicit denies for the dangerous groups:

SystemCallFilter=~@mount @swap @reboot @raw-io @cpu-emulation @debug @obsolete @privileged @resources

This is belt over Landlock's suspenders. Even if an attacker could somehow get arbitrary syscalls to execute inside the walkindb process, they can't call mount, reboot, swapon, ptrace, raw I/O, debug registers, or any of the other groups that would be useful for escalation. Combined with NoNewPrivileges, CapabilityBoundingSet= (empty), MemoryDenyWriteExecute, and RestrictNamespaces, the process is locked down as hard as systemd allows without a container.

The cost of adding infrastructure

Every shared service we didn't add is a service we don't have to pay for, secure, backup, fail over, or explain:

We don't haveWhich saves
Postgres for metadata$20+/mo managed or ~500 MB RAM self-hosted; 1 more thing to back up; 1 more CVE feed to watch
Redis for rate limiting$15+/mo or ~50 MB RAM self-hosted; network hop on every request
Connection pool~100 MB RAM at rest; complexity around pool exhaustion
Kubernetes1 control plane, 1 set of manifests, 1 learning curve, 1 category of outage
Docker (for the hot path)~50 MB overhead per container; longer start times; cgroups that systemd already does
Background job queueThe only "background job" is the TTL sweeper, which is a goroutine
Message brokerThere are no messages. It's a request/response API.
Sharding layerOne VPS. If we need to shard, we'll shard.
Service meshOne process on one box talks to itself on localhost

Every one of those we resisted adding kept the memory footprint at ~11 MB at rest and the monthly bill at €6. We didn't optimize walkindb to be small — we refused to add anything that would make it large.

The benchmark run, in detail

Everything in the "measured" box at the top of this post came from one run of bench/main.go executed on the VPS itself, against a second walkindb instance at 127.0.0.1:9090 with rate limits tuned up (WALKINDB_RATE_LIMIT_REQS=1000000 WALKINDB_RATE_LIMIT_NEW_INSTANCES=1000000). Running the client on the same box eliminates internet latency from the numbers and isolates the Go server's own performance — this is "how fast is the binary," not "how fast is the wire." Internet latency is measured separately at the bottom of this section.

Latency (single-threaded, loopback)

Operationp50p90p99p99.9
GET /healthz390 µs776 µs7.3 ms17.1 ms
POST /sql SELECT 1 (reused session)3.0 ms10.1 ms21.3 ms29.7 ms
POST /sql (fresh walk-in each call)14.3 ms29.0 ms47.6 ms68.4 ms

Each POST /sql opens a fresh sql.Conn, applies all ten sqlite3_limit caps from SECURITY.md §3, runs the query under a 2 s context.WithTimeout, and tears down. That per-request setup is why the p50 is 3 ms instead of 300 µs — and why adding a connection pool could halve it if we ever need to.

Throughput (concurrent, reused session)

ConcurrencyTotal requestsRPSp50p99
13 8141913.2 ms22.9 ms
812 5296269.7 ms49.2 ms
3218 69093328.7 ms115.4 ms
12818 791934125.3 ms384.6 ms

Saturates at ~930 rps around c=32. c=128 gets the same rate but much worse tail latency — the CPUs are at the edge. This is the server's native ceiling; real users come through Caddy and Let's Encrypt TLS, which add a few milliseconds and don't meaningfully change the throughput story.

Concurrent walk-in capacity

Created 100 000 walk-ins back-to-back (eight concurrent creators) in 727 seconds. Every one of them succeeded. At that point the bench walkindb process had:

ResourceBeforeAfter 100 000 walk-ins
walkindb process RSS (measured at 5 K→12 K, flat)~15 MB~19 MB
/var/walkindb disk usage~01.4 GB (14 KB/walk-in)
Box free RAM (of 8 GB)~7.2 GB~7.1 GB (page cache grew)
Box free disk (of 72 GB)69.4 GB68.0 GB
Creation rate138 / second sustained (c=8)
Failures0

The row that matters most: walkindb's RSS was flat between 5 000 and 12 000 resident walk-ins at ~19 MB. I measured that directly before the 100 K run. The marginal cost of an additional idle walk-in is below what Linux reports in /proc/<pid>/status. Share-nothing pays.

Projecting forward from the 14 KB-per-walk-in disk figure, the ceiling on this box is not RAM — it's the ext4 filesystem running out of inodes or the disk filling. At 68 GB free, that's ~4.8 million walk-ins before the disk bottlenecks. Long before that we'd hit the inode ceiling or want to shard for other reasons.

Internet latency (end-to-end over HTTPS)

The server is in Gravelines (GRA). From a laptop in Portugal via the public internet through Cloudflare DNS + Caddy + Let's Encrypt TLS:

Percentilelatency
min148 ms
p50238 ms
p9916.7 s (cold TLS handshake outlier; 99.9% of samples under 500 ms)

Most of the 148–238 ms is physics + TLS setup, not walkindb. A client in Europe closer to Gravelines would see ~40–80 ms. A client with a reused TLS session would drop another 50 ms off that.

When this stops working

The honest failure modes of this architecture:

  • One VPS is one failure domain. If the OVH box loses power, walkindb is down. At 10-minute TTLs this is actually cheap — there's no catastrophic data loss possible, because there's no durable data. But availability is one-nine until we add a standby.
  • Vertical scaling has limits. An OVH VPS Value has 2 vCPUs and 8 GB RAM. 100 000 idle walk-ins fit comfortably (disk is the ceiling, not RAM — we proved it). But 100 000 walk-ins all running queries at once would saturate both cores. The answer then is to add more boxes with a consistent-hash shard on instance ID — not to rewrite the architecture.
  • Network egress is a hidden cost. Cloudflare is DNS-only so there's no free egress bandwidth; all response bytes leave via OVH. At ~1 KB responses, that's fine for millions of requests.
  • The filesystem is the database of instances — including the bugs. If we ever hit a filesystem corruption, we have no database-level repair tools. Mitigation: the per-instance files are small (≤10 MB) and short-lived, so the blast radius is one walk-in, not the whole service.

What this enables

The share-nothing, file-per-instance design is what lets walkindb exist as a free service. If every walk-in cost us a connection pool slot and a row in a metadata table, we could not afford to give them away without a credit card. Because every walk-in costs us a directory and a 100-byte JSON file, both cleaned up automatically, the marginal cost of an additional walk-in is approximately nothing.

That's the whole product thesis: agents can't sign up for things, so walkindb doesn't ask. And the only way to make "don't ask" economically sustainable is to make each walk-in cheap enough that the operator genuinely doesn't care whether you use one.

The architecture is the business model.

Also see