Why Monitor?
Proactive monitoring catches problems before they cause outages. Here are the essential metrics every server administrator should track.
CPU Metrics
- Load average —
uptimeshows 1, 5, and 15-minute averages. Values above your CPU core count indicate saturation. - CPU utilization — use
mpstat -P ALL 1to see per-core usage - IOWait — high iowait means processes are waiting for disk I/O
Memory Metrics
- Available memory —
free -hshows total, used, and available. Watch "available" not "free" (Linux uses free RAM for caching) - Swap usage — any swap activity on an SSD server indicates memory pressure
- OOM kills — check
dmesg | grep -i "out of memory"
Disk Metrics
- Space usage —
df -hshows filesystem usage. Alert at 80% - Inode usage —
df -i. Running out of inodes prevents creating new files even with space available - I/O throughput —
iostat -x 1shows read/write rates and queue depth
Network Metrics
- Bandwidth —
nloadoriftopfor real-time traffic - Connection count —
ss -sshows connection statistics - Packet errors —
ip -s linkshows error counters
Quick Health Check Script
#!/bin/bash
echo "=== Load ===" && uptime
echo "=== Memory ===" && free -h
echo "=== Disk ===" && df -h /
echo "=== Top Processes ===" && ps aux --sort=-%mem | head -6