Performance profiling is the foundation of meaningful optimization. Rather than guessing where bottlenecks exist, Linux perf combined with Brendan Gregg's FlameGraphs provides a visual, data-driven approach to understanding exactly where your application spends CPU time. This guide covers installing and using perf, generating FlameGraphs, and interpreting results to fix real performance issues.
Installing perf and FlameGraph Tools
The perf tool ships with the Linux kernel tools package. Install it alongside the FlameGraph scripts:
# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-$(uname -r) linux-tools-generic
# RHEL/AlmaLinux
sudo dnf install perf
# Clone FlameGraph repository
cd /opt
sudo git clone https://github.com/brendangregg/FlameGraph.git
sudo chmod +x /opt/FlameGraph/*.pl
Verify the installation by checking versions:
perf --version
# perf version 6.x.x
ls /opt/FlameGraph/stackcollapse-perf.pl
# Should exist
Basic CPU Profiling with perf
The most common use case is CPU profiling to find which functions consume the most processor time:
# Profile the entire system for 30 seconds
sudo perf record -ag -F 99 -- sleep 30
# Profile a specific process
sudo perf record -g -F 99 -p $(pgrep -f "your-app") -- sleep 30
# Profile a specific command
sudo perf record -g -F 99 -- ./your-application --args
Key flags explained:
-g: Capture call graphs (stack traces) — essential for FlameGraphs-F 99: Sample at 99 Hz (avoids lockstep sampling artifacts at round numbers like 100)-a: Profile all CPUs system-wide-p PID: Profile a specific process
Reading perf report
After recording, examine results with perf report:
# Interactive TUI report
sudo perf report
# Text output sorted by overhead
sudo perf report --stdio --sort=comm,dso,symbol | head -50
Generating FlameGraphs
FlameGraphs transform perf stack traces into interactive SVG visualizations. The width of each box represents the proportion of CPU time spent in that function:
# Step 1: Record with perf
sudo perf record -ag -F 99 -- sleep 60
# Step 2: Generate the folded stack traces
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/out.folded
# Step 3: Generate the SVG
/opt/FlameGraph/flamegraph.pl /tmp/out.folded > /var/www/html/flamegraph.svg
# One-liner
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > flamegraph.svg
Interpreting FlameGraphs
Understanding how to read a FlameGraph is critical:
- X-axis: represents the proportion of total samples (wider = more CPU time). It is NOT a timeline.
- Y-axis: shows stack depth — the bottom is the entry point, the top is the leaf function actually executing.
- Plateaus: wide flat tops indicate functions that directly consume CPU (optimization targets).
- Towers: narrow deep stacks suggest deep call chains but minimal CPU time per function.
- Color: random warm colors by default; no semantic meaning unless customized.
Advanced Profiling Techniques
Off-CPU Analysis
Sometimes applications are slow not because of CPU usage but because they are waiting (I/O, locks, sleep). Off-CPU FlameGraphs reveal these waits:
# Record scheduler events
sudo perf record -e sched:sched_switch -a -g -- sleep 30
# Generate off-CPU FlameGraph
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/offcpu.folded
/opt/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time" /tmp/offcpu.folded > offcpu.svg
Memory Allocation Profiling
# Track memory allocations
sudo perf record -e kmem:kmalloc -a -g -- sleep 15
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl | \
/opt/FlameGraph/flamegraph.pl --title="Memory Allocations" > malloc.svg
Differential FlameGraphs
Compare performance before and after a change:
# Before the change
sudo perf record -ag -F 99 -o perf-before.data -- sleep 30
sudo perf script -i perf-before.data | /opt/FlameGraph/stackcollapse-perf.pl > before.folded
# After the change
sudo perf record -ag -F 99 -o perf-after.data -- sleep 30
sudo perf script -i perf-after.data | /opt/FlameGraph/stackcollapse-perf.pl > after.folded
# Generate diff FlameGraph (red = regression, blue = improvement)
/opt/FlameGraph/difffolded.pl before.folded after.folded | \
/opt/FlameGraph/flamegraph.pl > diff.svg
Profiling Specific Runtimes
Node.js with perf
# Run Node.js with perf map support
node --perf-basic-prof your-app.js &
# The above creates /tmp/perf-.map for symbol resolution
sudo perf record -g -F 99 -p $(pgrep -f "your-app") -- sleep 30
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > node-flame.svg
Java/JVM with perf
# Use async-profiler for JVM (better than perf for Java)
# Download from https://github.com/async-profiler/async-profiler
./asprof -d 30 -f flamegraph.html
Python with py-spy
# py-spy generates FlameGraphs directly
pip install py-spy
sudo py-spy record -o profile.svg --pid $(pgrep -f "python app.py")
Automating Performance Baselines
Create a script to capture regular performance snapshots:
#!/bin/bash
# /usr/local/bin/perf-snapshot.sh
OUTDIR="/var/log/perf-snapshots"
mkdir -p "$OUTDIR"
DATE=$(date +%Y%m%d-%H%M%S)
sudo perf record -ag -F 99 -o "$OUTDIR/perf-$DATE.data" -- sleep 60
sudo perf script -i "$OUTDIR/perf-$DATE.data" | \
/opt/FlameGraph/stackcollapse-perf.pl | \
/opt/FlameGraph/flamegraph.pl > "$OUTDIR/flamegraph-$DATE.svg"
# Clean up raw data older than 7 days
find "$OUTDIR" -name "perf-*.data" -mtime +7 -delete
Common Pitfalls
- Missing symbols: Install debug symbols (
-dbgsymor-debuginfopackages) for meaningful function names. - Kernel symbols: Run
sudo sysctl kernel.kptr_restrict=0to see kernel function names. - Frame pointers: Compile applications with
-fno-omit-frame-pointerfor accurate stack traces. Many modern distros now enable this by default. - Sampling bias: Use 99 Hz (not 100 Hz) to avoid synchronization artifacts with periodic timers.
- Short profiles: Profile for at least 30-60 seconds under realistic load to get statistically significant samples.
Summary
The perf + FlameGraph combination is the gold standard for performance analysis on Linux. Start with a CPU FlameGraph to identify hot spots, use off-CPU analysis for latency issues, and leverage differential FlameGraphs to validate your optimizations. Always profile under realistic production-like workloads, and make profiling a regular part of your deployment pipeline rather than a one-time debugging exercise.