A watchdog is the most over-promised and under-implemented piece of every Windows fleet's tooling. Every monitoring product has one. Most of the ones we've inherited from previous teams were either turned off because they generated too much noise, or still on but ignored because nobody trusts them. Both are failure modes.
This post is a working operator's guide to designing watchdogs that earn trust and stay useful. We'll cover what a watchdog actually is, the three trigger types you need, how to pick thresholds, the patterns that prevent alert fatigue, and when (carefully) to let it auto-remediate.
What a watchdog actually is
Some terminology, because the industry uses these words sloppily:
- Monitoring is collecting state over time. CPU, RAM, disk, services, errors.
- Alerting is firing a notification when a piece of state crosses a rule.
- A watchdog is alerting plus the ability to act on the alert — at minimum to dedupe and escalate, at most to attempt automatic remediation.
A monitoring dashboard without a watchdog is a glorified screensaver. A watchdog without intelligence about its own noise is a digital boy crying wolf. The difference between a good watchdog and a bad one is almost entirely in how it handles the gap between "a rule tripped" and "an operator should know."
The three trigger types
Every watchdog rule fits into one of three categories. The categories matter because they tell you what kind of false positives you'll generate and what to do about them.
1. Threshold triggers
The easiest and the most overused. "Fire when X > N for time T." Disk > 90%. CPU > 85% sustained 5 minutes. Free RAM < 500 MB.
These work well when the underlying signal is genuinely steady and crosses the threshold for a real reason. They fail miserably when the signal is bursty — a workstation CPU at 100% for 30 seconds while a build runs is not an emergency. That's why the sustained qualifier matters. A threshold rule without a duration is fundamentally broken.
2. State change triggers
"Fire when X transitioned from state A to state B." A service stopped that should be running. A host went offline (heartbeat stopped). An agent's version regressed. A disk got reformatted.
These have a different failure mode: they fire on every flap. A service that's restarting itself every 90 seconds will generate two state changes per cycle (Stopped → Running → Stopped → Running). Without dedup, you'll get a thousand alerts in an hour. State triggers must come with rate limiting baked in.
3. Pattern triggers
"Fire when X matches pattern Y N times in window W." Log line matched a known-bad regex. Event log code 4625 (failed logon) appeared 20 times in 5 minutes. A specific error string appeared more than usual.
The hardest of the three to get right because patterns drift. The error message your monitoring depends on gets reworded in the next patch. The Event ID gets re-purposed in a Windows update. Pattern rules need maintenance, and you should plan for that maintenance the moment you create them.
How to pick thresholds without generating noise
The single most common mistake: picking a threshold based on what feels reasonable in the abstract, instead of what's actually normal for the specific host.
A workstation that an accountant uses for Excel runs at 8% CPU all day. An app server compiling at night runs at 70% sustained for hours and that's fine. The same 85% rule applied to both produces noise on the server and silence on the workstation when the workstation gets compromised by something burning cycles.
Three rules of thumb that work:
- Watch the host for a week before setting any threshold. Look at the histogram, not the maximum. Set the threshold at p99 + headroom, not "what sounds high."
- Tier your hosts. Workstations, app servers, DB servers, RDP hosts — different normal profiles, different thresholds. Tag and apply rules by tag.
- Use deltas, not absolutes, when you can. "Disk free dropped 10 GB in 1 hour" is much more diagnostic than "Disk free < 50 GB."
You will not get this right on the first try. Plan to tune thresholds for the first month after a fleet onboards.
Alert fatigue and how to prevent it
If your watchdog generates so many alerts that operators stop reading them, you've engineered a worse outcome than no watchdog at all. Worse than no watchdog, because the operators trust their (now wrong) belief that they'd see it if it mattered.
The patterns that prevent this are simple and often skipped:
Cooldowns
Every (host, rule) pair gets a cooldown after firing. Default 1 hour. Tunable per rule. When a condition trips, the alert fires once, then the rule goes quiet for the cooldown window — even if the condition keeps tripping. When the cooldown expires, if the condition is still true, it fires again. If it self-recovered in the meantime, nothing.
This single rule cuts most fleet alert volumes by 80%.
Dedup by root cause, not by event
If a network outage takes 50 hosts offline, that's one incident, not 50 alerts. Group alerts by inferred root cause (same time window + same tag + same rule) and fire one summary. The dashboard still shows the 50 hosts; the notification stays at one.
Severity-tiered routing
Not every alert needs to wake somebody. Use at least three tiers:
- Info — log to dashboard, include in daily summary. No notification.
- Warning — channel notification (Slack/Teams/Telegram). No phone.
- Critical — phone (PagerDuty, OpsGenie, manual). Reserved for production-affecting issues.
The temptation is always to set things to Critical because "we should know." Resist it. A wakeup at 3 AM for a low-disk warning on a dev workstation will cost you the next real Critical alert in 6 weeks when operators have learned to silence the noise.
Daily summary
For every alert that didn't merit a real-time notification, roll it into a daily summary email or report. Operators who ignore notifications during the day will still read the rollup. It's also where you'll notice slow-moving trends — disk usage creeping up over a week, a service that flaps daily without anyone noticing.
Auto-remediation: when and when not to
The most exciting feature of a watchdog is when it can fix the problem itself. The most dangerous, too. Here are the rules we've converged on:
Auto-remediate when:
- The remediation is fully reversible (cleaning temp files, restarting a stateless service, killing a runaway process).
- The cost of doing it incorrectly is low (a few seconds of downtime for a service the user never sees).
- You've manually run the remediation enough times to be confident in the failure mode.
- You log every remediation with a clear audit trail (who/watchdog, when, why, result).
Do NOT auto-remediate when:
- The action is destructive or hard to undo (deleting files, rolling back software, modifying registry).
- The action affects multiple hosts at once (mass restart of production servers).
- The action could mask a real underlying problem (auto-restarting a service that's crashing for a reason).
- You haven't seen the failure mode enough times to trust the rule.
Real examples that work
From our own watchdog rules running in production:
- Disk > 92% on workstation → run
TempCleaner, recheck in 5 minutes, alert only if still > 90%. (Recovers quietly 9 out of 10 times.) - Free RAM < 500 MB sustained 10 minutes on app server → snapshot top_processes, dispatch
MemoryOptimizer, log result. Warning to ops, no phone. - Print spooler stopped on reception host → restart spooler. If it stops again within an hour, escalate to Warning.
- Agent version mismatch on tag:prod → no auto-action. Daily summary only. Triggers manual rollout review.
- Heartbeat lost on tag:critical host → Critical alert with phone. No auto-action — we want a human to look.
- 4625 (failed logon) > 50 in 5 minutes from same source IP → block the IP at the firewall via webhook, page the security on-call.
Notice the asymmetry: cleanup and reset actions auto-remediate freely; anything that could indicate a bigger problem (mismatched versions, lost heartbeats, security signals) goes to humans. That's the right shape.
What we'd tell our past selves
If you're setting up watchdog rules for the first time, three pieces of advice we wish we'd taken seriously sooner:
- Start with five rules. Not fifty. Tune them for a month. Then add more. The pull to be "comprehensive" out of the gate is the single biggest cause of fleet alert fatigue we've seen.
- Every rule needs an owner. If a rule generates noise and nobody owns it, nobody will tune it, and the team will silence the entire watchdog instead. Assign each rule to an operator who is on the hook for its quality.
- Watch your own watchdog. Track alert volume per rule per week. Rules that fire more than twice per week without a remediation are candidates for the chopping block.
A watchdog that fires three times a week and is right every single time is more valuable than a watchdog that fires 200 times a week and is right 20 times. Signal-to-noise is the whole game.
Want a watchdog that's tuned by default?
PeopleWorks Agent ships with a watchdog that follows the patterns in this post — cooldowns, severity tiers, daily summaries, and a curated set of starter rules.
Start Free