Building a Prometheus + Grafana Monitoring System
Collect metrics with Prometheus and visualize them with Grafana to monitor server health, application performance, and container resources in real-time with alerting
Problem
Required Tools
A CNCF open-source metric collection/storage system. Periodically scrapes metrics from targets using a pull-based model and stores them in a time-series database (TSDB).
An open-source dashboard tool for visualizing metric data. Connects to Prometheus as a datasource to create real-time charts, graphs, and tables.
The official exporter that exposes server hardware/OS metrics (CPU, memory, disk, network) in Prometheus format.
An alert manager that groups, routes, and delivers alerts generated by Prometheus alert rules to channels (Slack, Email, PagerDuty).
A container monitoring tool that collects Docker container CPU, memory, and network metrics and exposes them to Prometheus.
Solution Steps
Deploy Monitoring Stack with Docker Compose
Deploy Prometheus, Grafana, Node Exporter, Alertmanager, and cAdvisor all at once with Docker Compose. Configure networking so each component can communicate, and mount volumes for data persistence. Access Prometheus on port 9090, Grafana on port 3000, and Alertmanager on port 9093.
# docker-compose.monitoring.yml
version: '3.8'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data: {}
grafana_data: {}
services:
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks:
- monitoring
grafana:
image: grafana/grafana:10.2.0
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
networks:
- monitoringConfigure Prometheus and Scrape Targets
Set scrape targets and intervals in prometheus.yml. Use static_configs for fixed targets and file_sd_configs for dynamic targets. Set different scrape_intervals per job so important metrics are collected more frequently.
# prometheus/prometheus.yml
global:
scrape_interval: 15s # Default collection interval
evaluation_interval: 15s # Alert rule evaluation interval
scrape_timeout: 10s
# Alert rule files
rule_files:
- "alert.rules.yml"
# Alertmanager connection
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Scrape target configuration
scrape_configs:
# Prometheus self metrics
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Server hardware metrics (Node Exporter)
- job_name: 'node-exporter'
scrape_interval: 10s
static_configs:
- targets:
- 'node-exporter:9100'
# Additional servers (specify by IP)
# - '192.168.1.10:9100'
# - '192.168.1.11:9100'
labels:
environment: 'production'
# Docker container metrics (cAdvisor)
- job_name: 'cadvisor'
scrape_interval: 10s
static_configs:
- targets: ['cadvisor:8080']
# Application metrics (Node.js, Java, etc.)
- job_name: 'api-server'
scrape_interval: 5s
metrics_path: '/metrics'
static_configs:
- targets:
- 'api-server:4000'
labels:
service: 'api'
environment: 'production'
# Dynamic service discovery (file-based)
- job_name: 'dynamic-services'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
refresh_interval: 30sImplement Custom Application Metrics (Node.js)
Define and expose business metrics directly from the Node.js application using the prom-client library. Use 4 metric types appropriately: Counter (cumulative), Gauge (current value), Histogram (distribution), Summary (percentiles). Prometheus periodically collects metrics through the /metrics endpoint.
// metrics.ts - Node.js application metrics
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
const register = new Registry();
// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({ register, prefix: 'app_' });
// HTTP request count (Counter)
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register],
});
// HTTP response time (Histogram)
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
// Active connections (Gauge)
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections',
registers: [register],
});
// Business metrics example
const ordersProcessed = new Counter({
name: 'orders_processed_total',
help: 'Total orders processed',
labelNames: ['status', 'payment_method'],
registers: [register],
});
// Express middleware (automatic metric collection)
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
const labels = {
method: req.method,
route,
status_code: String(res.statusCode),
};
httpRequestsTotal.inc(labels);
httpRequestDuration.observe(labels, duration);
activeConnections.dec();
});
next();
});
// /metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});Analyze Key Metrics with PromQL Queries
Query and analyze collected metrics using PromQL (Prometheus Query Language). Use rate() for per-second change rate, increase() for interval totals, and histogram_quantile() for percentiles. Both alert rules and Grafana dashboards use PromQL, so learning key functions is essential.
# Essential PromQL queries
# CPU usage (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage (%)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage (%)
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# HTTP request rate (requests per second)
rate(http_requests_total[5m])
# HTTP error rate (%)
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# API response time 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# API response time 99th percentile (per service)
histogram_quantile(0.99,
sum by(le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
# Container memory usage (cAdvisor)
container_memory_usage_bytes{container_label_com_docker_compose_service!=""}
/ 1024 / 1024 # Convert to MB
# Container CPU usage
rate(container_cpu_usage_seconds_total{
container_label_com_docker_compose_service!=""
}[5m]) * 100
# Network receive/transmit traffic (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
# Order throughput (per minute)
increase(orders_processed_total[1m])
# Spike detection: error rate exceeds 10% in 5 minutes
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.10Configure Alert Rules with Alertmanager
Define Prometheus alert rules and route alerts through Alertmanager to Slack, Email, etc. Use the for clause to only trigger alerts for sustained issues (not transient spikes), and use severity labels to prioritize alerts. Alertmanager's route/receiver configuration routes alerts to different channels based on severity.
# prometheus/alert.rules.yml - Alert rules
groups:
- name: server_alerts
rules:
# CPU usage exceeds 85% (sustained for 5 min)
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf "%.1f" }}%"
# Memory usage exceeds 90%
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf "%.1f" }}%"
# Disk usage exceeds 85%
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
- name: application_alerts
rules:
# API error rate exceeds 5%
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | printf "%.2f" }}%"
# API response time P95 > 2s
- alert: SlowApiResponse
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
# alertmanager/alertmanager.yml - Alertmanager configuration
global:
resolve_timeout: 5m
route:
receiver: 'slack-default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'slack-critical'
repeat_interval: 1h
receivers:
- name: 'slack-default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#monitoring'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts-critical'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'Build Grafana Dashboards and Visualizations
Add Prometheus as a datasource in Grafana and create dashboards that visualize key metrics. Import community dashboards (Node Exporter Full: ID 1860, Docker: ID 193) to get started quickly. Setting up variables enables dynamic dashboard switching via dropdown to select servers/services.
# Grafana datasource auto-provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# Grafana dashboard panels (key panel configuration)
# Create a dashboard with these panels:
# Panel 1: CPU Usage Graph
# PromQL: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)
# Visualization: Time series, threshold lines (80%, 90%)
# Panel 2: Memory Usage Gauge
# PromQL: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Visualization: Gauge, color thresholds (Green/Yellow/Red)
# Panel 3: API Response Time Heatmap
# PromQL: sum(increase(http_request_duration_seconds_bucket[$__rate_interval])) by (le)
# Visualization: Heatmap
# Panel 4: HTTP Request Rate and Error Rate
# PromQL 1: sum(rate(http_requests_total[$__rate_interval]))
# PromQL 2: sum(rate(http_requests_total{status_code=~"5.."}[$__rate_interval]))
# Visualization: Time series (two queries overlaid)
# Import community dashboards (CLI)
# Node Exporter Full (ID: 1860)
curl -X POST http://admin:admin123@localhost:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-d '{
"dashboard": { "id": null },
"overwrite": true,
"inputs": [{ "name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus" }],
"folderId": 0
}'
# Add variables to dashboard (dynamic server selection)
# Settings > Variables > Add variable
# Name: instance
# Type: Query
# Query: label_values(node_cpu_seconds_total, instance)
# Use in panels: {instance="$instance"}Core Code
Core Prometheus monitoring configuration: Collect metrics from 3 layers - server (Node Exporter), application (/metrics), and containers (cAdvisor), then analyze CPU/memory/error rate/response time with PromQL.
# Prometheus + Grafana Monitoring Core Configuration
# prometheus.yml - minimal setup
global:
scrape_interval: 15s
rule_files:
- "alert.rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
metrics_path: '/metrics'
static_configs:
- targets: ['api-server:4000']
- job_name: 'containers'
static_configs:
- targets: ['cadvisor:8080']
# Essential PromQL
# CPU: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
# Memory: (1 - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) * 100
# Error Rate: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
# P95 Latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))Common Mistakes
Setting scrape_interval too short (1-2 seconds), overloading both Prometheus and target servers
Most metrics are fine at 15-second intervals. Set 5-10 seconds only for important metrics (API response time), and 15-30 seconds for hardware metrics. The range in rate() functions ([5m]) should be at least 4x the scrape_interval.
Configuring alerts without the for clause, causing alerts on transient spikes
Always set the for clause so alerts only fire when conditions persist for a duration (3-5 minutes). CPU momentarily hitting 90% is normal, but sustained for 5 minutes is a problem. Use severity labels to distinguish warning and critical, reducing alert fatigue.
Not configuring Prometheus storage retention, causing disk to fill up
Use --storage.tsdb.retention.time=30d or --storage.tsdb.retention.size=10GB flags to limit data retention period/size. For long-term storage, consider using Thanos or Cortex for remote storage.