Storage — System AI-X

Sections

Why distributed storage matters
What the storage plane does
Replication: more copies, fewer 3am pages
Snapshots and restore
Off-site backups and the 3-2-1 rule
The trade-offs nobody tells you about
Takeaways

Why distributed storage matters

Most home-lab tutorials skip the storage layer. They mount single-host volumes, declare victory, and quietly wait for the day a disk dies and a database goes with it. That works fine until it doesn’t. The first time you lose six months of social posts to a single sector error, you stop being cavalier about persistence.

The storage layer solves the problem the way durable platforms solve it: replicate writes, fence failed owners, rebuild after disk death, and keep restore paths boring. The exact node map is not public information; the reliability discipline is.

A disk will die. The question is whether your data dies with it. — First law of self-hosting

What the storage plane does

The private storage controller presents durable volumes to applications while it handles replica placement, repair, snapshots, and restore orchestration. The application does not know its bytes are being replicated; it just sees reliable disk.

Under the hood there are three layers:

Engine — a per-volume controller that coordinates writes to replicas.
Replicas — sparse-file-backed copies of the volume. Three is the default replica count; that means every write is acknowledged only after at least two replicas have it durably.
Manager — the coordination brain. Decides when to rebuild, when to garbage-collect, and how to handle maintenance events.

PUBLIC VIEW
replication     healthy     node map hidden
snapshots       scheduled   volume names hidden
restore path    tested      route hidden
backups         encrypted   target hidden

Replication: more copies, fewer 3am pages

Three replicas means three node failures simultaneously to lose data — and even then you’re losing only the in-flight writes since your last backup. In practice, a single node going down (planned maintenance, kernel panic, power blip) is invisible to applications. The engine fails over to a healthy replica, the volume reattaches in seconds, and the workload keeps running.

The cost is real: 3x storage amplification. A 100 GiB volume consumes 300 GiB across the private storage pool. For databases and primary data, that’s the right answer. For cold caches, replica count can be lowered consciously to claw back disk.

Detail

Replication is not backup. Replication protects against hardware failure. It does not protect against an operator running DROP TABLE on the wrong shell. For that you need snapshots and off-site copies. Keep replication and backup in your head as two different problems.

Snapshots and restore

Snapshots are copy-on-write, which makes them cheap and fast. Every important volume has a schedule attached inside the trust boundary:

schedule: "0 */6 * * *"      # every 6 hours
retain:   28                  # keep last 28
task:     snapshot

Restoring a snapshot is two API calls. In a real disaster, restore is a one-line operation: pick the snapshot, click revert, the volume rolls back. The application has to be detached during the restore, which usually means a short pod restart. Total recovery time for a snapshot revert: under two minutes.

Off-site backups and the 3-2-1 rule

The classic backup rule: three copies, on two different media, with one off-site. The private storage layer handles the replicated copies automatically and ships encrypted off-site backups without publishing the target route.

Encryption is client-side: remote storage sees ciphertext only. The restore key lives outside the runtime and gets re-injected only during a recovery drill.

Redacted

Capacity map

Private

Volume list

Default replicas

15 mo

No data loss

The trade-offs nobody tells you about

Two things you find out only after running this in production:

Latency is real. Every write hits more than one replica before the engine acknowledges it. For databases, that is usually fine. For high-churn workloads, you trade durability for throughput consciously instead of pretending storage is free.

Engine restarts are real restarts. When the storage controller upgrades, active volumes can briefly detach and reattach. Test upgrades in a window where you can babysit.

Storage is the layer where you find out whether you were lying to yourself about reliability. — After the second power outage

Takeaways

If you’re going to run anything stateful — a database, a wiki, a finance ledger, an inbox — you owe yourself a real storage layer. The single-disk shortcut isn’t a shortcut; it’s a deferred bill.

This is the most boring component of the stack and the most important. Every essay on this site — mail, social, finance — relies on it being correct. So far, fourteen months in, it has been.