- Why observability matters before it doesn’t
- The telemetry roles
- Metrics: cardinality is the budget
- Logs: cheap to write, expensive to keep
- Alerts: the page that doesn’t come
- Dashboards: pictures that load fast
- Takeaways
Why observability matters before it doesn’t
A self-hosted system is quiet until it is not. When something does happen, the maintainer needs to know whether the issue is compute, storage, network, queue, DNS, or application behavior. Observability buys those minutes back.
The frame I use: every running process owes you three things — what it’s doing right now, what it just did, and whether it’s about to fall over. Metrics answer the first. Logs answer the second. Alerts answer the third. Skip any of them and you’re debugging in the dark next time prod hiccups at 2am.
You don’t build dashboards for the day everything works. You build them for the morning you wake up to seven Slack pings and one cup of coffee. — The only correct reason to keep dashboards running
The telemetry roles
The pieces are boring on purpose:
- Metrics answer what is happening right now.
- Logs answer what just happened and why.
- Dashboards turn the private data sources into fast operator decisions.
- Alerts decide what deserves a page, an email, or silence.
PUBLIC VIEW
metrics sampled source labels hidden
logs retained query map hidden
dashboards gated SSO required
alerts active routing hidden
Metrics: cardinality is the budget
The first thing you learn running metrics at home is that every label is a row in a table you cannot casually erase. A request counter with unbounded identity labels looks innocent until the index becomes the outage.
The rule I keep: label keys are bounded sets. Anything that can grow without limit gets dropped before it becomes history. Cardinality discipline is the difference between a telemetry system that holds useful history and one that crashes itself at noon every Tuesday.
Logs: cheap to write, expensive to keep
The log layer indexes labels, not every byte of content. That trade keeps the system fast when the operator already knows the class of problem being investigated.
Retention is the cost center. The useful public claim is that history exists long enough to answer “what happened during the outage last weekend” without publishing storage volumes, retention buckets, or raw log structure.
Alerts: the page that doesn’t come
The hardest part of alerting at home is calibrating the noise floor. Too sensitive, and the phone vibrates every time a pod restarts during a normal rollout. Too lax, and you find out about a dead disk three days after the fact, by which point the replicas have all rebuilt onto the wrong nodes.
The rules I land on, after a year of tuning:
- Critical — page me anywhere. Node down for >5 minutes, persistent volume read-only, certificate expiring in <48 hours, mail queue backed up >1000 messages.
- Warning — send email. CPU sustained over 80% for 30 minutes, memory pressure on any node, ingress 5xx rate over 1%.
- Info — log only. Pod restarts within thresholds, scheduled-job duration drift, slow query log threshold.
A real test of an alerting setup: how many pages did you get this week, and how many of them actually mattered? In a healthy setup the ratio is one-to-one. In an unhealthy one it’s ten-to-one and you stop reading them.
Dashboards: pictures that load fast
Most dashboard setups die from sprawl. You build one for every service and every panicked midnight investigation, then the home screen becomes unusable. The fix is brutal: prune. The operator home view should answer health, flow, activity, and capacity first; everything else lives deeper.
The other discipline is keeping queries fast. A panel that takes six seconds to render is a panel nobody reads. Use recording rules to pre-aggregate the expensive stuff, and refuse to put a histogram quantile on the main view without one.
Takeaways
The point of observability isn’t a wall of dashboards. It’s being able to answer two questions in under a minute: is anything broken right now, and was anything broken when this user said it was. If your stack can’t answer those, it’s decoration.
The cost — in disk, RAM, and operator attention — is real, but small. The value is knowing, immediately, when a service is lying to you without turning the public site into an operator map.