The Incident That Proved Your Monitoring Was Working, CodeGood

During acquisition due diligence in late 2023, a private equity firm evaluated two competing SaaS companies. Both operated in the same market with similar customer bases and revenue (approximately $40 million annually). Both had engineering teams of comparable size (32 and 29 engineers respectively). The due diligence team asked each company for their incident logs from the previous year.

Company A reported 127 incidents across four quarters. Their incident log was detailed: root causes, mean time to detection, mean time to resolution, customer impact, and post-mortem summaries. The average incident was detected within 2.3 minutes and resolved within 18 minutes. Most incidents affected zero customers because they were caught before causing user-visible impact.

Company B reported 12 incidents across the same period. Their incident log was sparse: occurrence dates, brief descriptions, and resolution status. The average incident was detected after 47 minutes (typically by customer support receiving complaints) and resolved within 4.2 hours. Most incidents affected hundreds or thousands of customers before detection.

The due diligence team initially flagged Company A as higher risk. Their incident rate was ten times higher, suggesting systematic quality or reliability problems. Only after deeper technical investigation did the truth emerge: Company A had mature monitoring that detected problems proactively. Company B had minimal monitoring and only discovered problems when they became severe enough for customers to notice. The company with more reported incidents was significantly more reliable than the company with fewer reported incidents.

This pattern repeats throughout the software industry. Organizations with sophisticated observability report more incidents than organizations with basic monitoring. This creates a paradox where improved detection looks like degraded reliability. Understanding this paradox matters because many organizations optimize for the wrong metric, preferring low incident counts over high reliability.

The Detection Paradox

Incident counts measure detection capability more than reliability. A system with no monitoring will report zero incidents while serving corrupt data to customers. A system with comprehensive monitoring will report dozens of incidents while maintaining perfect customer experience. The difference is visibility, not reliability.

Consider two web services with identical architecture and code. Service A has basic monitoring: ping checks every minute and error rate alerts if errors exceed 5%. Service B has comprehensive monitoring: latency percentiles at one-second granularity, error rates by endpoint and customer, database connection pool utilization, cache hit rates, queue depths, and anomaly detection on dozens of metrics.

Both services develop the same problem: a database query begins performing slowly due to missing index after a schema migration. In Service A, the slow query gradually increases average latency from 200ms to 800ms over thirty minutes. Some requests timeout, but error rates remain below 5%. The monitoring does not alert. After forty-five minutes, enough requests are timing out that error rates cross 5% and an alert fires. Engineers investigate and fix the problem after another thirty minutes. Total incident duration: seventy-five minutes. Affected customers: thousands. Reported incidents: one.

In Service B, the slow query causes 95th percentile latency to spike from 300ms to 1.2 seconds within two minutes. The anomaly detection flags this immediately. Engineers receive an alert before any requests timeout, identify the missing index through automated query profiling, and add the index. Total incident duration: eight minutes. Affected customers: zero (requests were slow but completed successfully). Reported incidents: one.

From a reliability perspective, Service B performed dramatically better. From an incident count perspective, both services had one incident. From a naive external evaluation, they appear equivalent. This is the detection paradox: better observability reveals problems that poor observability masks, making sophisticated systems appear less reliable than primitive ones.

The Silent Failure Problem

The most dangerous failures are those that go undetected. A system returning corrupt data appears to be functioning correctly by most metrics. Request success rate is 100%. Latency is normal. CPU and memory usage are unremarkable. Yet the system is failing in ways that matter deeply to customers and might expose the company to significant liability.

A financial services company discovered this during a customer audit. The customer was reconciling transactions against their internal records and found discrepancies. Approximately 0.3% of transactions over the previous six months had incorrect amounts: the transaction had processed for $100 when it should have processed for $100.00 exactly, but floating-point rounding errors caused some transactions to process for $99.99 or $100.01. The errors were small individually but affected 847 transactions totaling $1,340 in incorrect charges.

The financial services company's monitoring had detected no problems. Success rate was 100% (all transactions completed). Latency was normal. Error logs were clean. Yet the system had been silently corrupting data for six months. The incident only became visible when a particularly diligent customer noticed the discrepancies. Other customers presumably experienced the same issue but had not noticed or had not complained.

The post-mortem revealed the root cause: a code change six months earlier had modified how decimal values were handled internally. The change was tested and deployed successfully. Tests verified that transactions processed and succeeded. Tests did not verify that processed amounts were exactly correct because this seemed obvious (of course the amount should be correct). The assumption that success meant correctness was wrong, but nothing in the monitoring challenged that assumption.

After the incident, the company implemented correctness monitoring: automated daily reconciliation between their transaction log and expected results, statistical analysis of transaction amounts to detect drift from expected distributions, and synthetic transactions with known amounts that were checked programmatically. Over the following year, these monitoring additions detected eight correctness issues, all of which would have been invisible to their previous monitoring. The incident count increased from approximately 40 per year to 48 per year. Actual reliability increased dramatically because problems were detected and fixed before customers were affected.

The Measurement Paradox

Organizations that measure reliability through incident counts create perverse incentives. Engineers learn that reporting incidents reflects poorly on their performance. The rational response is to avoid reporting incidents, either by not investigating anomalies or by handling them quietly without creating incident tickets.

A technology company implemented a policy where teams with more than three incidents per quarter would have their quarterly bonuses reduced. The policy was intended to incentivize reliability. The actual effect was that teams stopped reporting incidents. When monitoring detected anomalies, engineers would investigate quietly. If they found and fixed a problem before it caused customer impact, they would not create an incident ticket. If a problem did cause customer impact, they would fix it and document it as operational maintenance rather than an incident.

Over six months, reported incidents declined from an average of twelve per quarter to four per quarter. Management celebrated this as improved reliability. Meanwhile, customer satisfaction scores declined slightly. Support ticket volume increased. Engineers reported feeling more stressed. The company commissioned an external assessment of their reliability practices and discovered the truth: reliability had not improved, visibility had degraded.

The external assessment reviewed monitoring logs and found evidence of numerous problems that should have been classified as incidents but were not. Database connection pool exhaustion that was resolved by restarting services. Memory leaks that were resolved by deploying updated code. API rate limiting that was resolved by increasing limits. Cache invalidation failures that were resolved by manual cache flushes. All of these were operational problems that affected system reliability, but none were counted as incidents because engineers had learned that incidents were punished.

The company reversed the policy and implemented a different approach: teams were evaluated on mean time to detection and mean time to resolution, not on incident count. This created incentives to detect problems quickly (which required good monitoring) and fix them quickly (which required good operational practices). Reported incidents increased to twenty per quarter, but actual reliability improved measurably through better response times and reduced customer impact.

The Zero-Incident Fallacy

Organizations that achieve zero incidents for extended periods should be skeptical of this success. Zero incidents might indicate excellent reliability or it might indicate insufficient observability. Distinguishing between these possibilities requires examining what is being measured and what remains unmeasured.

A SaaS company went eighteen months without a reported incident. Management praised the engineering team's reliability focus. The team's monitoring dashboard showed healthy metrics: 99.99% uptime, sub-100ms latency, zero errors in logs. The team was confident their system was genuinely reliable.

Then a major customer churned. The customer's stated reason was "reliability concerns." The SaaS company was confused; their metrics showed excellent reliability. They asked for specifics. The customer provided logs showing that data synchronization had been failing intermittently for months. Some records would sync correctly, others would fail silently. The failures were not consistent enough to be obvious, but over time, data drift between systems became severe enough that the customer lost confidence in the product.

The SaaS company investigated and discovered that their synchronization system had a subtle bug in error handling. When synchronization failed for transient reasons (network timeouts, rate limits), the system would retry. When it failed for permanent reasons (invalid data format, missing required fields), the system would log an error and skip the record. The logs existed but were never monitored. The company's monitoring focused on infrastructure health (servers, databases, networks) but not on business logic correctness (were records syncing successfully?).

After this customer churning, the company implemented business logic monitoring: tracking successful synchronization rate per customer, alerting when sync success rate dropped below 99.9% for any customer, and daily reports of records that failed synchronization. Within the first week, the new monitoring detected problems for seven customers. All were subtle edge cases that customers had not reported (perhaps had not noticed, or had worked around). The incident count went from zero to approximately three per month. Customer satisfaction improved because problems were detected and resolved proactively.

The Detection Gradient

Organizations can be ranked along a detection maturity gradient. At the lowest level, problems are detected by customers. At the highest level, problems are detected before they occur through predictive analysis. The incident count increases as organizations move up the gradient, even as actual customer-facing reliability improves.

At level zero, monitoring is reactive and minimal. Problems are detected when customers complain. Incident counts are low because only the most severe problems are noticed. Mean time to detection is measured in hours because detection requires customers to notice the problem, contact support, and have support escalate to engineering. Most problems that affect small numbers of customers or cause subtle degradation are never detected.

At level one, monitoring covers basic infrastructure. Servers that crash trigger alerts. Databases that stop responding trigger alerts. Services that return errors trigger alerts. Incident counts increase because problems that previously went unnoticed are now detected. Mean time to detection drops to minutes for severe problems. Problems that do not cause complete failures (degraded performance, increased latency, elevated error rates below alerting thresholds) remain undetected.

At level two, monitoring covers application metrics. Latency percentiles, error rates by endpoint, and queue depths are tracked. Anomaly detection identifies unusual patterns. Incident counts increase further because degradation is now detected before it becomes severe. Mean time to detection drops below one minute for most problems. Problems that affect correctness without affecting performance (wrong results returned successfully, data corruption that appears successful, logic errors that complete normally) remain undetected.

At level three, monitoring covers business logic correctness. Success is defined not as "completed without error" but as "produced correct results." Synthetic transactions verify expected behavior. Reconciliation processes verify data consistency. Incident counts increase further because logic errors are now detected. Mean time to detection for all classes of problems is under one minute. The problems that remain undetected are those that affect so few requests or occur so rarely that they fall below monitoring sensitivity thresholds.

At level four, monitoring is predictive. Systems detect problems before they occur by identifying patterns that precede failures. Capacity exhaustion is predicted hours before it happens. Cascade failures are detected in early stages before they amplify. Incident counts are highest because the definition of an incident now includes "something that would have caused a problem if left unaddressed." Mean time to detection is negative: problems are resolved before they manifest.

A cloud infrastructure company tracked their progression through these levels over four years. In year one, they reported an average of 8 incidents per quarter. In year two, after implementing comprehensive infrastructure monitoring, they reported 23 incidents per quarter. In year three, after adding application metrics and anomaly detection, they reported 41 incidents per quarter. In year four, after implementing business logic monitoring and predictive analysis, they reported 67 incidents per quarter. Customer-reported incidents declined from 6 per quarter in year one to zero in year four. Actual reliability improved continuously even as reported incident counts increased.

The Resolution Speed Trade-Off

Organizations with mature observability resolve incidents faster than organizations with basic monitoring, even when the underlying problems are identical. This is because time to detection and time to diagnosis are the dominant components of total incident duration.

Consider an incident where a database query begins timing out due to lock contention. In a system with basic monitoring, the timeline is: thirty minutes until customer complaints accumulate sufficiently to trigger support escalation, fifteen minutes for support to gather information and escalate to engineering, twenty minutes for engineering to identify which service is having problems, fifteen minutes to identify which database queries are timing out, thirty minutes to identify why (lock contention from a long-running analytical query), and five minutes to kill the problematic query. Total incident duration: one hundred fifteen minutes.

In a system with mature observability, the timeline is: ten seconds until query latency anomaly detection triggers an alert, thirty seconds for the on-call engineer to acknowledge and open the incident dashboard, forty-five seconds to review database query profiles that automatically show lock contention, thirty seconds to identify the long-running query, and fifteen seconds to kill it. Total incident duration: two minutes twenty-five seconds.

The underlying technical problem was identical. The difference was observability. The basic monitoring system spent one hundred thirteen minutes detecting and diagnosing a problem that mature observability identified in seventy-five seconds. This forty-six-fold difference in resolution speed compounds across all incidents throughout the year. Over one hundred incidents per year, the organization with mature observability might spend 240 minutes total (four hours) resolving incidents that would consume 11,500 minutes (191 hours) at the same organization with basic monitoring.

This efficiency difference has economic value beyond just engineering time saved. Incidents that are resolved in two minutes typically affect zero customers because the problem is fixed before customers notice. Incidents that are resolved in two hours affect thousands of customers, generate support tickets, damage customer satisfaction, and sometimes result in customer churn. The customer impact difference between a two-minute incident and a two-hour incident might be ten thousand customers versus zero customers, a difference that translates to measurable revenue impact.

The Cost Justification

Organizations sometimes resist investing in observability because the costs are visible and immediate while the benefits are diffuse and counterfactual. A comprehensive observability platform might cost $200,000 annually in tooling costs and require two engineers focused on maintaining monitoring infrastructure. Demonstrating that this investment is worthwhile requires quantifying incidents that did not occur.

One approach is to analyze the cost of historical incidents and estimate how improved observability would have changed their impact. A company reviewed their ten most expensive incidents from the previous year. Total cost (engineering time, customer impact, and opportunity cost) was approximately $2.1 million. They estimated that with better observability, seven of those ten incidents would have been detected before they caused significant customer impact, reducing total cost to approximately $400,000. The $1.7 million difference provided clear ROI justification for a $200,000 investment in observability.

Another approach is to measure the value of early detection directly. A company implemented detailed monitoring for a subset of their services while maintaining basic monitoring for others. Over six months, the services with detailed monitoring had a mean time to detection of 1.8 minutes and mean customer impact of 34 users per incident. The services with basic monitoring had a mean time to detection of 38 minutes and mean customer impact of 1,847 users per incident. The fifty-four-fold difference in customer impact provided empirical evidence of observability value.

A third approach is to compare operational costs across companies with different observability maturity. Industry benchmarks suggest that companies with mature observability spend approximately 8% of engineering time on incident response and remediation, while companies with basic monitoring spend approximately 23% of engineering time on the same activities. For a company with fifty engineers at a fully loaded cost of $150,000 per engineer annually, this difference represents $1.125 million in annual productivity savings ($7.5 million total engineering cost times 15 percentage point difference). Against this savings, even substantial observability investment shows rapid payback.

The Cultural Challenge

Increasing incident counts through better observability creates a cultural challenge. Engineers and executives are conditioned to see increasing incident counts as evidence of declining reliability. Convincing them that the opposite is true requires education and careful framing of metrics.

A technology company implemented comprehensive observability and saw their incident count increase from 32 per quarter to 89 per quarter. The engineering leadership understood that this represented improved detection, but the executive team saw it as evidence that engineering quality had degraded. The VP of Engineering was asked to explain why reliability appeared to be declining despite significant investment in infrastructure and tooling.

The VP prepared a presentation comparing two metrics: reported incidents (which had increased) and customer-detected incidents (which had decreased from 18 per quarter to 2 per quarter). She showed that mean time to detection had dropped from 42 minutes to 1.3 minutes. She showed that mean customer impact had dropped from 2,400 users per incident to 12 users per incident. She showed that support tickets related to reliability had decreased by 68%. The narrative shifted from "why are we having more incidents?" to "why were we not detecting incidents before?"

The company adopted new reporting that emphasized customer-facing reliability metrics: customer-detected incidents, mean customer impact per incident, and total customer-hours of degraded service. Internal operational metrics including reported incidents and mean time to detection were still tracked but were not used to evaluate engineering effectiveness. This reframing allowed the engineering team to continue improving observability without fearing that better detection would be interpreted as worse reliability.

The Audit Vulnerability

Companies with poor observability appear reliable until they undergo external audits. Due diligence for acquisitions, enterprise customer security reviews, and regulatory compliance audits often reveal that apparent reliability was actually invisibility.

During an acquisition, a buyer's technical due diligence team asked to review incident logs, monitoring systems, and operational procedures. The target company had reported an average of 6 incidents per quarter over the previous two years. The due diligence team's first observation was that mean time to detection was forty-three minutes. This is unusually long and suggests that incidents are detected by customers rather than monitoring.

The due diligence team asked to see the monitoring dashboard. The target company's dashboard showed CPU utilization, memory usage, and disk space for servers. It did not show application-level metrics, error rates, latency percentiles, or business logic health. The due diligence team asked about observability for data correctness. The target company had no such monitoring. The due diligence team asked about synthetic transactions to verify critical flows. The target company had none.

The due diligence team's assessment: "The target company's low incident count reflects undetected problems rather than high reliability. Based on industry benchmarks for similar companies with mature observability, we estimate the target company is experiencing 40-60 incidents per quarter that remain undetected. These undetected incidents represent significant technical debt and customer satisfaction risk." The buyer reduced their offer by $4 million to account for the estimated cost of implementing proper observability and remediating accumulated technical debt.

This pattern repeats frequently. Companies with poor observability look reliable in normal operations but fail to survive scrutiny during audits. Conversely, companies with excellent observability may have high reported incident counts that concern auditors until the auditors understand that comprehensive detection is the cause of high incident counts, not poor reliability.

The Incident Learning Value

Incidents detected through comprehensive monitoring provide learning opportunities that customer-detected incidents do not. When a problem is caught before customers are affected, engineers can investigate thoroughly without time pressure. When a problem is detected through customer complaints during a major outage, investigation happens under extreme stress with pressure to restore service immediately.

A streaming media company used this dynamic deliberately. When their monitoring detected anomalies that did not require immediate action (degraded performance that was still within SLA, unusual traffic patterns that might indicate problems, resource utilization trends that could become problematic), they classified these as "learning incidents" rather than operational incidents. The learning incident process required investigation and documentation but did not require immediate resolution.

Over one year, they conducted sixty-seven learning incident investigations. Forty-three revealed actual problems that would have eventually caused operational incidents. Seventeen revealed monitoring false positives that led to improved alert tuning. Seven revealed architectural issues that were addressed during planned refactoring. The learning incident program created a pipeline where problems were discovered and addressed before they became urgent.

The company estimated that the learning incident program prevented approximately $800,000 in incident response costs and customer impact. The cost of the program (engineering time spent investigating learning incidents) was approximately $180,000. The ROI was clear. More importantly, the program created a culture where engineers viewed increased incident detection as positive (more opportunities to improve the system) rather than negative (evidence of poor quality).

The Reporting Strategy

Organizations can report incidents in ways that accurately represent reliability without creating perverse incentives. The key is to separate operational incidents (detected through monitoring, often with minimal customer impact) from customer incidents (detected through customer reports, typically with significant customer impact).

A technology company implemented a three-tier incident classification. Tier 1 incidents were customer-detected: problems that customers reported before monitoring detected them. These represented monitoring gaps and were treated as high-severity reliability problems. Tier 2 incidents were monitoring-detected with customer impact: problems that monitoring detected but not quickly enough to prevent customer impact. These represented opportunities to improve detection speed. Tier 3 incidents were monitoring-detected without customer impact: problems that monitoring detected and that were resolved before customers were affected. These represented successful observability.

The company's executive reporting showed Tier 1 incidents (the number everyone wanted to minimize) and total customer impact hours (the business metric that mattered). Internal engineering reporting showed all three tiers and used them to track observability maturity. Over two years, Tier 1 incidents dropped from 14 per quarter to 1 per quarter. Tier 2 incidents dropped from 8 per quarter to 3 per quarter. Tier 3 incidents increased from 12 per quarter to 47 per quarter. Total incidents increased from 34 to 51, but reliability improved dramatically by every business metric.

Conclusion

The paradox of observability is that improving detection capability often increases reported incident counts while simultaneously improving actual reliability. Organizations that understand this paradox measure reliability through customer impact rather than internal incident counts. They celebrate increased incident detection as evidence of improved observability rather than treating it as evidence of declining reliability.

Company A, with 127 incidents per quarter, was acquired at a premium valuation. The buyer recognized that their incident count represented mature observability rather than poor reliability. Their mean time to detection of 2.3 minutes and mean customer impact of 12 users per incident demonstrated that they could detect and resolve problems before customers were meaningfully affected. The high incident count was a strength, not a weakness.

Company B, with 12 incidents per quarter, received a significantly reduced offer. The buyer's technical due diligence revealed that low incident counts reflected undetected problems rather than high reliability. Their mean time to detection of 47 minutes and mean customer impact of thousands of users per incident demonstrated reactive rather than proactive operations. The buyer estimated that implementing proper observability would reveal an additional 50-80 incidents per quarter that were currently invisible.

The lesson is that zero incidents is not the goal. The goal is zero customer-detected incidents. Systems with mature observability may report dozens or hundreds of internal incidents per quarter while maintaining near-perfect customer-facing reliability. Systems with poor observability may report near-zero incidents while quietly failing in ways that customers notice but the monitoring does not.

Organizations should optimize for detection speed and customer impact, not for incident count. An incident detected in one minute and resolved in two minutes with zero customer impact represents excellent reliability, even if it counts as an incident in reporting. An undetected problem that affects customers for hours represents poor reliability, even if it never appears in incident logs.

The uncomfortable truth is that increasing your incident count is often evidence that you are doing things right. More incidents mean better detection. Better detection means faster resolution. Faster resolution means less customer impact. The incident that proves your monitoring is working is the incident that no customer ever noticed because you detected and resolved it before it affected them. Companies that understand this build better systems. Companies that do not are often surprised when external audits reveal how unreliable they actually are.

The Incident That Proved Your Monitoring Was Working