Observability — System AI-X

Sections

Why observability matters before it doesn’t
The telemetry roles
Metrics: cardinality is the budget
Logs: cheap to write, expensive to keep
Alerts: the page that doesn’t come
Dashboards: pictures that load fast
Takeaways

Why observability matters before it doesn’t

A self-hosted system is quiet until it is not. When something does happen, the maintainer needs to know whether the issue is compute, storage, network, queue, DNS, or application behavior. Observability buys those minutes back.

The frame I use: every running process owes you three things — what it’s doing right now, what it just did, and whether it’s about to fall over. Metrics answer the first. Logs answer the second. Alerts answer the third. Skip any of them and you’re debugging in the dark next time prod hiccups at 2am.

You don’t build dashboards for the day everything works. You build them for the morning you wake up to seven Slack pings and one cup of coffee. — The only correct reason to keep dashboards running

The telemetry roles

The pieces are boring on purpose:

Metrics answer what is happening right now.
Logs answer what just happened and why.
Dashboards turn the private data sources into fast operator decisions.
Alerts decide what deserves a page, an email, or silence.

PUBLIC VIEW
metrics       sampled     source labels hidden
logs          retained    query map hidden
dashboards    gated       SSO required
alerts        active      routing hidden

Metrics: cardinality is the budget

The first thing you learn running metrics at home is that every label is a row in a table you cannot casually erase. A request counter with unbounded identity labels looks innocent until the index becomes the outage.

The rule I keep: label keys are bounded sets. Anything that can grow without limit gets dropped before it becomes history. Cardinality discipline is the difference between a telemetry system that holds useful history and one that crashes itself at noon every Tuesday.

Detail

The public page should never need live query strings. Query recipes belong in the private runbook, not in source that anyone can scrape.

Logs: cheap to write, expensive to keep

The log layer indexes labels, not every byte of content. That trade keeps the system fast when the operator already knows the class of problem being investigated.

Retention is the cost center. The useful public claim is that history exists long enough to answer “what happened during the outage last weekend” without publishing storage volumes, retention buckets, or raw log structure.

Alerts: the page that doesn’t come

The hardest part of alerting at home is calibrating the noise floor. Too sensitive, and the phone vibrates every time a pod restarts during a normal rollout. Too lax, and you find out about a dead disk three days after the fact, by which point the replicas have all rebuilt onto the wrong nodes.

The rules I land on, after a year of tuning:

Critical — page me anywhere. Node down for >5 minutes, persistent volume read-only, certificate expiring in <48 hours, mail queue backed up >1000 messages.
Warning — send email. CPU sustained over 80% for 30 minutes, memory pressure on any node, ingress 5xx rate over 1%.
Info — log only. Pod restarts within thresholds, scheduled-job duration drift, slow query log threshold.

A real test of an alerting setup: how many pages did you get this week, and how many of them actually mattered? In a healthy setup the ratio is one-to-one. In an unhealthy one it’s ten-to-one and you stop reading them.

Dashboards: pictures that load fast

Most dashboard setups die from sprawl. You build one for every service and every panicked midnight investigation, then the home screen becomes unusable. The fix is brutal: prune. The operator home view should answer health, flow, activity, and capacity first; everything else lives deeper.

The other discipline is keeping queries fast. A panel that takes six seconds to render is a panel nobody reads. Use recording rules to pre-aggregate the expensive stuff, and refuse to put a histogram quantile on the main view without one.

Live

Signal posture

Hidden

Query map

30 d

Log retention

Pages this week

Takeaways

The point of observability isn’t a wall of dashboards. It’s being able to answer two questions in under a minute: is anything broken right now, and was anything broken when this user said it was. If your stack can’t answer those, it’s decoration.

The cost — in disk, RAM, and operator attention — is real, but small. The value is knowing, immediately, when a service is lying to you without turning the public site into an operator map.

The instruments that listen.

Why observability matters before it doesn’t

The telemetry roles

Metrics: cardinality is the budget

Logs: cheap to write, expensive to keep

Alerts: the page that doesn’t come

Dashboards: pictures that load fast

Takeaways