Network Monitoring and Observability: How Engineers Know When Things Go Wrong
In the perfect world of theoretical networking, everything works all the time. Packets arrive. Connections succeed. Latency is low. Nobody calls IT. That's the dream.
In the real world, things break. Links go down. Routers misconfigure. A new firmware update causes unexpected packet loss. A misconfigured firewall rule blocks legitimate traffic. A cable gets accidentally unplugged. Bandwidth gets saturated at 4pm every day because everyone streams video at lunch. Networks are complex, dynamic systems, and they fail in surprising ways.
The discipline of network monitoring and observability is how engineers stay ahead of these problems. It's the difference between finding out a service is down because users complain versus finding out thirty seconds after a metric exceeds a threshold, before most users have even noticed.
Monitoring vs. Observability: What's the Difference?
These terms are related but distinct:
Monitoring is the practice of collecting predefined metrics about your systems and alerting when they cross certain thresholds. It answers questions you already know to ask: "Is this router's CPU over 90%?" "Is the link utilization over 80%?" "Is this service responding?"
Observability is a broader property of a system — how well you can understand its internal state from its external outputs. An observable system lets you answer questions you *didn't* anticipate. When something goes wrong in an unexpected way, observability is what lets you figure out what happened and why, even if you never specifically set up monitoring for that particular failure mode.
In practice, both are important. Monitoring catches the known issues automatically. Observability provides the data you need to investigate the unknowns.
The Three Pillars of Observability
Modern observability is often described as having three pillars:
Metrics: Numerical measurements collected at regular intervals. CPU utilization percentage. Packets per second. Latency in milliseconds. Error rate per minute. Bandwidth consumed. Metrics are highly efficient to collect and store, excellent for trending, alerting, and dashboards.
Logs: Text records of events. "Connection accepted from 192.168.1.100 at 14:23:05." "OSPF neighbor relationship established with 10.0.0.1." "Packet dropped: destination unreachable." Logs provide detail and context that metrics can't. They're more expensive to store and search but invaluable for diagnosis.
Traces: Records of the path of individual requests through a distributed system. As a request passes through load balancers, application servers, databases, and microservices, each component adds to the trace. The result is a complete timeline of the request's journey, with timing for each step. Traces are essential for understanding latency problems in complex distributed systems.
Key Network Metrics to Monitor
Not all metrics are equally important. Here are the ones network engineers care about most:
Bandwidth/Throughput: How much data is being transferred over a link per unit of time, typically measured in bits per second (bps, Mbps, Gbps). Comparing actual throughput to the link's maximum capacity tells you how close to saturation you are. Links running above 80% utilization for sustained periods are a concern — they introduce queuing delays and risk dropping packets.
Latency: How long it takes for a packet to travel from source to destination. Round-trip time (RTT) includes the return journey. Low and consistent latency is good. High latency slows applications. *Variable* latency (jitter) is particularly bad for real-time applications like VoIP and video conferencing.
Packet Loss: The percentage of transmitted packets that never arrive at the destination. Any packet loss on a wired network is abnormal and warrants investigation. Even 0.1% packet loss significantly degrades TCP performance because TCP must retransmit lost packets.
Error Rates: The rate of various types of network errors. CRC errors indicate physical layer problems (bad cables, connectors, or interfaces). Input/output errors, giants (oversized frames), and runts (undersized frames) all indicate different types of issues.
Interface Utilization: Per-interface traffic in both directions. Asymmetric utilization (one direction much busier than the other) can indicate unusual traffic patterns.
CPU and Memory on Network Devices: High CPU on a router can cause routing protocol instability. High memory usage can lead to dropped packets.
BGP Session State: For networks that peer with other networks using BGP, monitoring BGP session state is critical — a dropped BGP session means routes are withdrawn and traffic flows change dramatically.
SNMP: The Classic Protocol
The most established protocol for collecting network device metrics is SNMP (Simple Network Management Protocol). SNMP is a protocol that allows a monitoring system (the manager) to query network devices (the agents) for statistics.
Network devices expose data through a Management Information Base (MIB) — a hierarchical database of variables (like interface packet counters, CPU utilization, memory usage). Each variable has a unique OID (Object Identifier) — a dotted number sequence like `1.3.6.1.2.1.1.3.0` that uniquely identifies that variable across all SNMP-enabled devices.
A monitoring system can poll (query) devices every few minutes for specific OIDs. It can also receive SNMP traps — unsolicited notifications that devices send when something happens (a link goes down, a threshold is exceeded, a configuration change occurs).
SNMP has been around since the 1980s and is supported by virtually every network device. SNMPv3 added authentication and encryption, addressing the major security weaknesses of earlier versions. Despite being old, SNMP remains widely deployed.
Modern Alternatives to SNMP
SNMP's polling model has limitations: you poll every 5 minutes, you miss spikes that happen between polls. And the MIB structure is complex and inconsistent across vendors.
Streaming telemetry is the modern alternative. Instead of waiting to be polled, devices push metrics continuously to a collection server at subsecond intervals. This gives far more granular, real-time visibility. gNMI (gRPC Network Management Interface) is a common protocol for streaming telemetry from modern network hardware.
NetFlow / sFlow / IPFIX: These protocols capture metadata about network *flows* — not raw packets, but summaries of conversations. For each flow (defined by source/destination IP, port, protocol), they record bytes transferred, packet count, start/end time, and other attributes. This flow data is sent to a collector for analysis.
Flow data is invaluable for understanding traffic patterns: "Where is all this bandwidth going?" "What are the top talkers on the network?" "Is there any unusual traffic to unusual destinations?" Unlike full packet capture (which is expensive and privacy-sensitive), flow data is summarized and manageable at scale.
Network Monitoring Tools and Platforms
The tooling ecosystem for network monitoring is large. A few key categories and examples:
SNMP-based monitoring: Nagios, Icinga, LibreNMS, Zabbix. These are open-source platforms that poll devices via SNMP, store metrics, display dashboards, and send alerts. LibreNMS is particularly good for automatic device discovery and a comprehensive default dashboard.
Time-series databases and visualization: InfluxDB, Prometheus, Grafana. Modern network monitoring often uses a time-series database to store metrics and Grafana to build dashboards. Prometheus is excellent for scraping metrics from modern systems; InfluxDB handles high-volume metric ingestion well. Grafana's visualization is beautiful and highly customizable.
Flow analysis: ntopng, Elastiflow (processes flow data and sends to Elasticsearch), Kentik. These tools analyze NetFlow/sFlow/IPFIX data to give you visibility into traffic patterns at scale.
Packet analysis: Wireshark (the gold standard for detailed packet inspection), tcpdump (command-line packet capture). These are for deep-dive investigation of specific problems, not continuous monitoring.
Cloud-native monitoring: AWS CloudWatch, Azure Monitor, Google Cloud Monitoring. If you're running in the cloud, these platform-native services provide deep integration with the platform's networking (VPC flow logs, load balancer metrics, CDN metrics) without requiring separate infrastructure.
Commercial NPMD (Network Performance Monitoring and Diagnostics): SolarWinds, PRTG, Datadog (has strong network monitoring capabilities), Dynatrace. These tend to be more comprehensive and easier to set up than open-source alternatives, but come with significant cost.
Setting Up Effective Alerting
Monitoring without alerting is just data collection. Alerting without careful thought generates alert fatigue — so many notifications that engineers learn to ignore them, defeating the purpose.
Principles for effective alerting:
Alert on symptoms, not causes. Alert when users are affected (service latency too high, error rate elevated) rather than on every individual component metric. Many internal problems resolve themselves quickly; what matters is whether users notice.
Set meaningful thresholds. A 90% bandwidth utilization on a link is probably concerning. 70% might be perfectly normal for that link. Thresholds should be based on knowledge of normal behavior, not arbitrary percentages.
Avoid flapping. An alert that fires and then immediately recovers, fires again, recovers — over and over — is more annoying than useful. Add hysteresis: alert when a metric exceeds a threshold for 5 consecutive minutes, not on the first data point.
Define severity clearly. A critical alert should mean "wake someone up right now." A warning should mean "investigate during business hours." If everything is critical, nothing is.
Alert on anomalies, not just thresholds. Machine learning-based anomaly detection can identify when a metric is behaving unusually compared to its historical pattern — even if it hasn't crossed a fixed threshold. This catches gradual degradation and unusual patterns that threshold-based alerting misses.
Capacity Planning: Looking Forward
Monitoring isn't just about catching problems — it's about predicting them. Capacity planning uses historical trend data to forecast when resources will be exhausted.
If a network link is averaging 40% utilization today and growing 5% per month, it will hit 80% in about eight months. That's when you should be ordering the upgrade — not when it's already at 95% and users are complaining.
Trend data from monitoring systems feeds directly into capacity planning. This is why long-term metric retention matters even though old data seems less valuable — those historical trends are exactly what you need for capacity forecasting.
The Culture of Observability
The best monitoring strategy in the world doesn't help if the culture doesn't support it. Teams that build observable systems from the start — instrumenting their code, logging meaningful events, defining SLOs (Service Level Objectives) and tracking them — are far more effective than teams that add monitoring as an afterthought when something breaks.
Blameless post-mortems: When incidents happen (and they will), the focus should be on understanding what happened and how to prevent it, not on assigning blame. This culture encourages honest documentation of failures, which drives real improvement.
On-call rotations: Someone needs to be responsible for responding to alerts. Well-designed on-call rotations spread this responsibility across a team and ensure the people on-call have the tools and access they need to diagnose and resolve issues quickly.
Runbooks: Documented procedures for common situations. When an alert fires at 2am, the on-call engineer shouldn't have to figure out from scratch what to do. A runbook with diagnostic steps and resolution procedures for known alert types dramatically reduces mean time to resolution.
Networks are living systems. They grow, they change, they fail, they recover. Monitoring and observability are how you maintain awareness of a system too complex to understand just by looking at it. They transform network management from a reactive, firefighting exercise into something more like a scientific discipline — where you observe, measure, understand, and proactively improve.
That's not just good engineering. That's the difference between keeping the lights on and building something you can actually be proud of.