Surface · Observability

The instruments that listen.

The private telemetry plane watches health, history, and alert posture. The public page should prove aliveness without publishing dashboard coordinates, query patterns, or the runtime map.

Reading time · 10 min Domain · Observability Status · Live
Sections
  1. Why observability matters before it doesn’t
  2. The telemetry roles
  3. Metrics: cardinality is the budget
  4. Logs: cheap to write, expensive to keep
  5. Alerts: the page that doesn’t come
  6. Dashboards: pictures that load fast
  7. Takeaways

Why observability matters before it doesn’t

A self-hosted system is quiet until it is not. When something does happen, the maintainer needs to know whether the issue is compute, storage, network, queue, DNS, or application behavior. Observability buys those minutes back.

The frame I use: every running process owes you three things — what it’s doing right now, what it just did, and whether it’s about to fall over. Metrics answer the first. Logs answer the second. Alerts answer the third. Skip any of them and you’re debugging in the dark next time prod hiccups at 2am.

You don’t build dashboards for the day everything works. You build them for the morning you wake up to seven Slack pings and one cup of coffee. — The only correct reason to keep dashboards running

The telemetry roles

The pieces are boring on purpose:

PUBLIC VIEW
metrics       sampled     source labels hidden
logs          retained    query map hidden
dashboards    gated       SSO required
alerts        active      routing hidden

Metrics: cardinality is the budget

The first thing you learn running metrics at home is that every label is a row in a table you cannot casually erase. A request counter with unbounded identity labels looks innocent until the index becomes the outage.

The rule I keep: label keys are bounded sets. Anything that can grow without limit gets dropped before it becomes history. Cardinality discipline is the difference between a telemetry system that holds useful history and one that crashes itself at noon every Tuesday.

Detail
The public page should never need live query strings. Query recipes belong in the private runbook, not in source that anyone can scrape.

Logs: cheap to write, expensive to keep

The log layer indexes labels, not every byte of content. That trade keeps the system fast when the operator already knows the class of problem being investigated.

Retention is the cost center. The useful public claim is that history exists long enough to answer “what happened during the outage last weekend” without publishing storage volumes, retention buckets, or raw log structure.

Alerts: the page that doesn’t come

The hardest part of alerting at home is calibrating the noise floor. Too sensitive, and the phone vibrates every time a pod restarts during a normal rollout. Too lax, and you find out about a dead disk three days after the fact, by which point the replicas have all rebuilt onto the wrong nodes.

The rules I land on, after a year of tuning:

A real test of an alerting setup: how many pages did you get this week, and how many of them actually mattered? In a healthy setup the ratio is one-to-one. In an unhealthy one it’s ten-to-one and you stop reading them.

Dashboards: pictures that load fast

Most dashboard setups die from sprawl. You build one for every service and every panicked midnight investigation, then the home screen becomes unusable. The fix is brutal: prune. The operator home view should answer health, flow, activity, and capacity first; everything else lives deeper.

The other discipline is keeping queries fast. A panel that takes six seconds to render is a panel nobody reads. Use recording rules to pre-aggregate the expensive stuff, and refuse to put a histogram quantile on the main view without one.

Live
Signal posture
Hidden
Query map
30 d
Log retention
0
Pages this week

Takeaways

The point of observability isn’t a wall of dashboards. It’s being able to answer two questions in under a minute: is anything broken right now, and was anything broken when this user said it was. If your stack can’t answer those, it’s decoration.

The cost — in disk, RAM, and operator attention — is real, but small. The value is knowing, immediately, when a service is lying to you without turning the public site into an operator map.

Next surface
Brain — the cognitive architecture behind Nix
Link copied