Operations

Windows Watchdog Basics

The three trigger types, how to pick sensible thresholds, and why most watchdogs become alert-fatigue generators within a week — plus the patterns that prevent it.

A watchdog is the most over-promised and under-implemented piece of every Windows fleet's tooling. Every monitoring product has one. Most of the ones we've inherited from previous teams were either turned off because they generated too much noise, or still on but ignored because nobody trusts them. Both are failure modes.

This post is a working operator's guide to designing watchdogs that earn trust and stay useful. We'll cover what a watchdog actually is, the three trigger types you need, how to pick thresholds, the patterns that prevent alert fatigue, and when (carefully) to let it auto-remediate.

What a watchdog actually is

Some terminology, because the industry uses these words sloppily:

A monitoring dashboard without a watchdog is a glorified screensaver. A watchdog without intelligence about its own noise is a digital boy crying wolf. The difference between a good watchdog and a bad one is almost entirely in how it handles the gap between "a rule tripped" and "an operator should know."

The three trigger types

Every watchdog rule fits into one of three categories. The categories matter because they tell you what kind of false positives you'll generate and what to do about them.

1. Threshold triggers

The easiest and the most overused. "Fire when X > N for time T." Disk > 90%. CPU > 85% sustained 5 minutes. Free RAM < 500 MB.

These work well when the underlying signal is genuinely steady and crosses the threshold for a real reason. They fail miserably when the signal is bursty — a workstation CPU at 100% for 30 seconds while a build runs is not an emergency. That's why the sustained qualifier matters. A threshold rule without a duration is fundamentally broken.

2. State change triggers

"Fire when X transitioned from state A to state B." A service stopped that should be running. A host went offline (heartbeat stopped). An agent's version regressed. A disk got reformatted.

These have a different failure mode: they fire on every flap. A service that's restarting itself every 90 seconds will generate two state changes per cycle (Stopped → Running → Stopped → Running). Without dedup, you'll get a thousand alerts in an hour. State triggers must come with rate limiting baked in.

3. Pattern triggers

"Fire when X matches pattern Y N times in window W." Log line matched a known-bad regex. Event log code 4625 (failed logon) appeared 20 times in 5 minutes. A specific error string appeared more than usual.

The hardest of the three to get right because patterns drift. The error message your monitoring depends on gets reworded in the next patch. The Event ID gets re-purposed in a Windows update. Pattern rules need maintenance, and you should plan for that maintenance the moment you create them.

How to pick thresholds without generating noise

The single most common mistake: picking a threshold based on what feels reasonable in the abstract, instead of what's actually normal for the specific host.

A workstation that an accountant uses for Excel runs at 8% CPU all day. An app server compiling at night runs at 70% sustained for hours and that's fine. The same 85% rule applied to both produces noise on the server and silence on the workstation when the workstation gets compromised by something burning cycles.

Three rules of thumb that work:

  1. Watch the host for a week before setting any threshold. Look at the histogram, not the maximum. Set the threshold at p99 + headroom, not "what sounds high."
  2. Tier your hosts. Workstations, app servers, DB servers, RDP hosts — different normal profiles, different thresholds. Tag and apply rules by tag.
  3. Use deltas, not absolutes, when you can. "Disk free dropped 10 GB in 1 hour" is much more diagnostic than "Disk free < 50 GB."

You will not get this right on the first try. Plan to tune thresholds for the first month after a fleet onboards.

Alert fatigue and how to prevent it

If your watchdog generates so many alerts that operators stop reading them, you've engineered a worse outcome than no watchdog at all. Worse than no watchdog, because the operators trust their (now wrong) belief that they'd see it if it mattered.

The patterns that prevent this are simple and often skipped:

Cooldowns

Every (host, rule) pair gets a cooldown after firing. Default 1 hour. Tunable per rule. When a condition trips, the alert fires once, then the rule goes quiet for the cooldown window — even if the condition keeps tripping. When the cooldown expires, if the condition is still true, it fires again. If it self-recovered in the meantime, nothing.

This single rule cuts most fleet alert volumes by 80%.

Dedup by root cause, not by event

If a network outage takes 50 hosts offline, that's one incident, not 50 alerts. Group alerts by inferred root cause (same time window + same tag + same rule) and fire one summary. The dashboard still shows the 50 hosts; the notification stays at one.

Severity-tiered routing

Not every alert needs to wake somebody. Use at least three tiers:

The temptation is always to set things to Critical because "we should know." Resist it. A wakeup at 3 AM for a low-disk warning on a dev workstation will cost you the next real Critical alert in 6 weeks when operators have learned to silence the noise.

Daily summary

For every alert that didn't merit a real-time notification, roll it into a daily summary email or report. Operators who ignore notifications during the day will still read the rollup. It's also where you'll notice slow-moving trends — disk usage creeping up over a week, a service that flaps daily without anyone noticing.

Auto-remediation: when and when not to

The most exciting feature of a watchdog is when it can fix the problem itself. The most dangerous, too. Here are the rules we've converged on:

Auto-remediate when:

Do NOT auto-remediate when:

Rule of thumb: auto-remediation should be a janitor, not a surgeon. It mops up known messes. It does not do procedures.

Real examples that work

From our own watchdog rules running in production:

Notice the asymmetry: cleanup and reset actions auto-remediate freely; anything that could indicate a bigger problem (mismatched versions, lost heartbeats, security signals) goes to humans. That's the right shape.

What we'd tell our past selves

If you're setting up watchdog rules for the first time, three pieces of advice we wish we'd taken seriously sooner:

  1. Start with five rules. Not fifty. Tune them for a month. Then add more. The pull to be "comprehensive" out of the gate is the single biggest cause of fleet alert fatigue we've seen.
  2. Every rule needs an owner. If a rule generates noise and nobody owns it, nobody will tune it, and the team will silence the entire watchdog instead. Assign each rule to an operator who is on the hook for its quality.
  3. Watch your own watchdog. Track alert volume per rule per week. Rules that fire more than twice per week without a remediation are candidates for the chopping block.

A watchdog that fires three times a week and is right every single time is more valuable than a watchdog that fires 200 times a week and is right 20 times. Signal-to-noise is the whole game.

Want a watchdog that's tuned by default?

PeopleWorks Agent ships with a watchdog that follows the patterns in this post — cooldowns, severity tiers, daily summaries, and a curated set of starter rules.

Start Free