WRITING ·
Linux fleet baselines, explained without jargon
A baseline is "the picture of this server we agreed was fine". Everything Blackglass does — the things that look fancy in a screenshot, and the things that don't — is downstream of that idea.
If you've worked with configuration management tools you already know what a baseline is. If you haven't, the concept is simpler than the vocabulary makes it sound, and this post is for you. We'll cover what a baseline actually is, why it matters, how Blackglass uses it, and the trap most teams fall into without one.
The baseline is a frozen snapshot of "fine"
Imagine a Linux server that's just been hardened. Patches up to date. Only the right ports listening. SSH configured to refuse password auth. Two known administrators in the sudoers file. A standard set of cron jobs. You walk away. The server is, by every measure that matters today, fine.
Two weeks pass. Did anything change?
Without a baseline, the only way to answer that question is to redo the audit. Walk through every config file. Check every user. List every cron. It's tedious, error-prone, and nobody does it consistently. By the time you've finished, something has probably changed since you started.
A baseline is the answer to "did anything change?". It's a structured snapshot — SSH config, package list, listening ports, scheduled tasks, sudoers, kernel parameters, the set of things that matter — captured at a moment you agreed represented "fine". You pin it. Then, every subsequent observation compares against it.
The output is a list of differences: things that were not in the baseline that are now in the running system, things that were in the baseline but have disappeared, and things whose values have changed. Each difference is a piece of evidence. Some are expected (a patch was applied). Some are not (a previously-disabled service is now running).
Why drift happens, even on disciplined teams
The standard story for "configuration drift" is hotfixes — the production incident at 02:00 where someone SSH'd in, ran four commands, and forgot to merge the equivalent change back into the config management repo. That's the canonical case. It's also overrepresented in tooling marketing.
The less glamorous causes are bigger:
- Package updates pull in dependencies. You update one library; it brings a new version of something else; that something else changes a default. Repeat across forty machines.
- Defaults change between distro versions. Upgrade a node from Ubuntu 22.04 to 24.04 and a dozen settings shift quietly. None of them are individually a problem. Together they're a different security posture.
- Offboarding misses things. A contractor leaves. HR closes their account. SSH keys in
~/.ssh/authorized_keyson six servers are still there. - Internal tools assume root. The deploy script adds a user. The monitoring agent adds a sudoers line. Each one is justified. Cumulatively, the surface area of "things that can act on this server" grows.
None of these is malicious. None of these is even obviously wrong at the time. But they accumulate, and at some point an auditor or an incident asks "what's actually running here, and who can act on it?", and the honest answer is "we'd have to look".
Severity, not just a list of differences
The first version of a drift report is always a flat list. Useless. There's too much noise — package updates, log rotation, timestamp churn — for anyone to triage. So the next move is to rank by severity.
Blackglass uses a four-rung ladder:
- HARDENED — matches baseline. No action needed.
- LOW — a difference exists, but it's in a category we deliberately ignore (e.g. system uptime, log timestamps).
- MEDIUM — a difference in a meaningful category (e.g. a new package was installed) that doesn't on its own indicate exposure.
- HIGH — a difference that materially changes posture: a new listening port, an added sudoers entry, an SSH config that now permits password auth, a service running as root that wasn't before.
The point of the ladder is to make triage tractable. You don't read every drift report — you read the HIGH ones, glance at MEDIUM, ignore the rest. Over time you tune the ladder to fit your environment (we ship a sensible default that targets CIS Benchmark expectations).
"We have monitoring" is not the same thing
It's worth saying explicitly: event monitoring and configuration truth are different problems. Datadog, Prometheus, your SIEM — these tell you what's happening right now (CPU, network, login attempts). They don't tell you what's configured. A server can have a perfect Datadog dashboard and still have an old contractor's SSH key in authorized_keys.
The two answer different questions. Monitoring answers "is anything happening that I should know about?". Baselines answer "is the system in the shape I expected it to be?". You need both. They don't substitute.
Why agentless SSH is enough
An obvious objection: why not run an agent on each host? Agents see everything, instantly.
Three reasons we don't:
- Agents are software you have to deploy, update, patch, and trust. Every host gets one more thing that can break. Every CVE in the agent is a CVE on every server. SSH-based read-only collection inverts the trust model: nothing is installed, nothing runs continuously, the credentials are short-lived and read-scoped.
- Snapshots are sufficient. You don't need to see drift the millisecond it happens. You need to see it in time to act — daily for most environments, hourly for high-stakes ones. SSH polling at those cadences is straightforward and cheap.
- Audit and procurement teams trust read-over-SSH more than agents. The conversation goes "you're running a script over SSH that reads config files" rather than "you've installed a closed-source binary on our production servers". The first conversation is shorter.
For teams that want push-mode collection (air-gapped fleets, networks that don't permit inbound SSH) Blackglass supports an optional one-line push agent. The default is pull-mode SSH.
The trap of "we'll do it later"
The reason teams without a baseline don't have one is that baselining feels like a big up-front project. It isn't — or, more precisely, the upfront cost is once, and the per-day cost after that is small. The cost of not having one is paid in audit time, incident-response time, and the slow corrosion of "we used to know how this was configured".
If you'd like to see what a baseline-driven view of your own fleet looks like — or just kick the tyres — the Blackglass page has the product details and a 14-day trial. There's a free Lab tier for homelabs.
If you think we've over- or under-stated something here, tell us. We update posts.