liminfo

Building a Prometheus + Grafana Monitoring System

Collect metrics with Prometheus and visualize them with Grafana to monitor server health, application performance, and container resources in real-time with alerting

Prometheus monitoringGrafana dashboardPromQL queriesserver monitoringPrometheus AlertmanagerNode Exportermetrics collectionGrafana alerting

Problem

A system running on microservices architecture fails to detect issues like server failures, performance degradation, and resource exhaustion in advance, delaying incident response. 15 servers and 40+ Docker containers are in operation, but monitoring relies on manually SSHing into each server and using top/htop. CPU/memory usage, disk I/O, API response times, and error rates need to be viewed in real-time on a unified dashboard, with automatic Slack alerts when thresholds are exceeded. A Prometheus + Grafana based monitoring system needs to be built for full observability.

Required Tools

Prometheus

A CNCF open-source metric collection/storage system. Periodically scrapes metrics from targets using a pull-based model and stores them in a time-series database (TSDB).

Grafana

An open-source dashboard tool for visualizing metric data. Connects to Prometheus as a datasource to create real-time charts, graphs, and tables.

Node Exporter

The official exporter that exposes server hardware/OS metrics (CPU, memory, disk, network) in Prometheus format.

Alertmanager

An alert manager that groups, routes, and delivers alerts generated by Prometheus alert rules to channels (Slack, Email, PagerDuty).

cAdvisor

A container monitoring tool that collects Docker container CPU, memory, and network metrics and exposes them to Prometheus.

Solution Steps

1

Deploy Monitoring Stack with Docker Compose

Deploy Prometheus, Grafana, Node Exporter, Alertmanager, and cAdvisor all at once with Docker Compose. Configure networking so each component can communicate, and mount volumes for data persistence. Access Prometheus on port 9090, Grafana on port 3000, and Alertmanager on port 9093.

# docker-compose.monitoring.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data: {}
  grafana_data: {}

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    networks:
      - monitoring
2

Configure Prometheus and Scrape Targets

Set scrape targets and intervals in prometheus.yml. Use static_configs for fixed targets and file_sd_configs for dynamic targets. Set different scrape_intervals per job so important metrics are collected more frequently.

# prometheus/prometheus.yml
global:
  scrape_interval: 15s      # Default collection interval
  evaluation_interval: 15s   # Alert rule evaluation interval
  scrape_timeout: 10s

# Alert rule files
rule_files:
  - "alert.rules.yml"

# Alertmanager connection
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Scrape target configuration
scrape_configs:
  # Prometheus self metrics
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Server hardware metrics (Node Exporter)
  - job_name: 'node-exporter'
    scrape_interval: 10s
    static_configs:
      - targets:
          - 'node-exporter:9100'
          # Additional servers (specify by IP)
          # - '192.168.1.10:9100'
          # - '192.168.1.11:9100'
        labels:
          environment: 'production'

  # Docker container metrics (cAdvisor)
  - job_name: 'cadvisor'
    scrape_interval: 10s
    static_configs:
      - targets: ['cadvisor:8080']

  # Application metrics (Node.js, Java, etc.)
  - job_name: 'api-server'
    scrape_interval: 5s
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'api-server:4000'
        labels:
          service: 'api'
          environment: 'production'

  # Dynamic service discovery (file-based)
  - job_name: 'dynamic-services'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 30s
3

Implement Custom Application Metrics (Node.js)

Define and expose business metrics directly from the Node.js application using the prom-client library. Use 4 metric types appropriately: Counter (cumulative), Gauge (current value), Histogram (distribution), Summary (percentiles). Prometheus periodically collects metrics through the /metrics endpoint.

// metrics.ts - Node.js application metrics
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

const register = new Registry();

// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({ register, prefix: 'app_' });

// HTTP request count (Counter)
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

// HTTP response time (Histogram)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// Active connections (Gauge)
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [register],
});

// Business metrics example
const ordersProcessed = new Counter({
  name: 'orders_processed_total',
  help: 'Total orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [register],
});

// Express middleware (automatic metric collection)
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route?.path || req.path;
    const labels = {
      method: req.method,
      route,
      status_code: String(res.statusCode),
    };
    httpRequestsTotal.inc(labels);
    httpRequestDuration.observe(labels, duration);
    activeConnections.dec();
  });

  next();
});

// /metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});
4

Analyze Key Metrics with PromQL Queries

Query and analyze collected metrics using PromQL (Prometheus Query Language). Use rate() for per-second change rate, increase() for interval totals, and histogram_quantile() for percentiles. Both alert rules and Grafana dashboards use PromQL, so learning key functions is essential.

# Essential PromQL queries

# CPU usage (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage (%)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage (%)
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# HTTP request rate (requests per second)
rate(http_requests_total[5m])

# HTTP error rate (%)
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100

# API response time 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# API response time 99th percentile (per service)
histogram_quantile(0.99,
  sum by(le, service) (rate(http_request_duration_seconds_bucket[5m]))
)

# Container memory usage (cAdvisor)
container_memory_usage_bytes{container_label_com_docker_compose_service!=""}
/ 1024 / 1024  # Convert to MB

# Container CPU usage
rate(container_cpu_usage_seconds_total{
  container_label_com_docker_compose_service!=""
}[5m]) * 100

# Network receive/transmit traffic (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# Order throughput (per minute)
increase(orders_processed_total[1m])

# Spike detection: error rate exceeds 10% in 5 minutes
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.10
5

Configure Alert Rules with Alertmanager

Define Prometheus alert rules and route alerts through Alertmanager to Slack, Email, etc. Use the for clause to only trigger alerts for sustained issues (not transient spikes), and use severity labels to prioritize alerts. Alertmanager's route/receiver configuration routes alerts to different channels based on severity.

# prometheus/alert.rules.yml - Alert rules
groups:
  - name: server_alerts
    rules:
      # CPU usage exceeds 85% (sustained for 5 min)
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf "%.1f" }}%"

      # Memory usage exceeds 90%
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf "%.1f" }}%"

      # Disk usage exceeds 85%
      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      # API error rate exceeds 5%
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | printf "%.2f" }}%"

      # API response time P95 > 2s
      - alert: SlowApiResponse
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning

# alertmanager/alertmanager.yml - Alertmanager configuration
global:
  resolve_timeout: 5m

route:
  receiver: 'slack-default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'
      repeat_interval: 1h

receivers:
  - name: 'slack-default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#monitoring'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts-critical'
        title: '[CRITICAL] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
6

Build Grafana Dashboards and Visualizations

Add Prometheus as a datasource in Grafana and create dashboards that visualize key metrics. Import community dashboards (Node Exporter Full: ID 1860, Docker: ID 193) to get started quickly. Setting up variables enables dynamic dashboard switching via dropdown to select servers/services.

# Grafana datasource auto-provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

# Grafana dashboard panels (key panel configuration)
# Create a dashboard with these panels:

# Panel 1: CPU Usage Graph
# PromQL: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)
# Visualization: Time series, threshold lines (80%, 90%)

# Panel 2: Memory Usage Gauge
# PromQL: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Visualization: Gauge, color thresholds (Green/Yellow/Red)

# Panel 3: API Response Time Heatmap
# PromQL: sum(increase(http_request_duration_seconds_bucket[$__rate_interval])) by (le)
# Visualization: Heatmap

# Panel 4: HTTP Request Rate and Error Rate
# PromQL 1: sum(rate(http_requests_total[$__rate_interval]))
# PromQL 2: sum(rate(http_requests_total{status_code=~"5.."}[$__rate_interval]))
# Visualization: Time series (two queries overlaid)

# Import community dashboards (CLI)
# Node Exporter Full (ID: 1860)
curl -X POST http://admin:admin123@localhost:3000/api/dashboards/import \
  -H 'Content-Type: application/json' \
  -d '{
    "dashboard": { "id": null },
    "overwrite": true,
    "inputs": [{ "name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus" }],
    "folderId": 0
  }'

# Add variables to dashboard (dynamic server selection)
# Settings > Variables > Add variable
# Name: instance
# Type: Query
# Query: label_values(node_cpu_seconds_total, instance)
# Use in panels: {instance="$instance"}

Core Code

Core Prometheus monitoring configuration: Collect metrics from 3 layers - server (Node Exporter), application (/metrics), and containers (cAdvisor), then analyze CPU/memory/error rate/response time with PromQL.

# Prometheus + Grafana Monitoring Core Configuration
# prometheus.yml - minimal setup
global:
  scrape_interval: 15s

rule_files:
  - "alert.rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['api-server:4000']

  - job_name: 'containers'
    static_configs:
      - targets: ['cadvisor:8080']

# Essential PromQL
# CPU: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
# Memory: (1 - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) * 100
# Error Rate: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
# P95 Latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Common Mistakes

Setting scrape_interval too short (1-2 seconds), overloading both Prometheus and target servers

Most metrics are fine at 15-second intervals. Set 5-10 seconds only for important metrics (API response time), and 15-30 seconds for hardware metrics. The range in rate() functions ([5m]) should be at least 4x the scrape_interval.

Configuring alerts without the for clause, causing alerts on transient spikes

Always set the for clause so alerts only fire when conditions persist for a duration (3-5 minutes). CPU momentarily hitting 90% is normal, but sustained for 5 minutes is a problem. Use severity labels to distinguish warning and critical, reducing alert fatigue.

Not configuring Prometheus storage retention, causing disk to fill up

Use --storage.tsdb.retention.time=30d or --storage.tsdb.retention.size=10GB flags to limit data retention period/size. For long-term storage, consider using Thanos or Cortex for remote storage.

Related liminfo Services