Observability

Documentation

Internals

Observability

The nxthdr platform implements a comprehensive observability strategy to monitor infrastructure health, service performance, and system availability. The stack combines metrics, logs, and external monitoring to provide complete visibility.

Architecture Overview

The observability infrastructure consists of multiple components working together:

Services → Prometheus (metrics) → Grafana (visualization)
        → Loki (logs)          ↗
        → Alloy (collection)  ↗

External: Upptime (HTTP) + UptimeRobot (ICMP)

Components

Prometheus

Prometheus is the core metrics collection and storage system. It scrapes metrics from all services and stores them as time-series data.

Configuration:

Scrape interval: 30 seconds
Retention: 15 days
Storage: Local filesystem
Authentication: Basic auth for remote write and queries

Scrape Targets:

Infrastructure services: Proxy (Caddy), ClickHouse, Redpanda, PostgreSQL
Data collectors: Risotto (BMP), Pesto (sFlow), Saimiris (probing)
Observability stack: Prometheus itself, Loki, Grafana, Alertmanager
System metrics: Node Exporter, cAdvisor
Application services: PeerLab Gateway, Saimiris Gateway, nxthdr.dev

Metrics Exposed: All services expose Prometheus metrics on /metrics endpoints:

Request rates and latencies
Error rates and types
Resource utilization (CPU, memory, disk)
Queue depths and processing rates
Custom business metrics

Loki

Loki aggregates logs from all services and makes them queryable alongside metrics in Grafana.

Architecture:

Single-instance deployment on core server
Filesystem storage for chunks and indexes
TSDB schema for efficient querying
7-day retention with automatic compaction

Log Sources:

Container logs: Docker container stdout/stderr via Alloy
System logs: Syslog messages via Alloy (TCP port 601, UDP port 514)
Application logs: Structured logging from services

Features:

Label-based indexing (no full-text indexing)
LogQL query language
Integration with Alertmanager for log-based alerts
Correlation with metrics in Grafana

Grafana

Grafana provides visualization and dashboards for metrics and logs.

Data Sources:

Prometheus: Default data source for metrics
Loki: Log aggregation and search
ClickHouse: Direct queries for BGP, flows, and probing data

Dashboards: Pre-configured dashboards for:

Infrastructure overview (system resources, network)
Service-specific metrics (Risotto, Pesto, Saimiris)
ClickHouse performance and query statistics
Redpanda broker and topic metrics
Container and host metrics

Access:

Public URL: https://grafana.nxthdr.dev
Authentication: Auth0 integration
Role-based access control

Alloy

Grafana Alloy is the telemetry collector that gathers logs and metrics from distributed servers and forwards them to the central observability stack.

Deployment: Alloy runs on all servers (core, IXP, probing) as Docker containers.

Log Collection:

Container logs: Reads Docker container logs from /var/lib/docker/containers
Syslog: Listens on TCP/UDP for RFC5424 syslog messages
Processing: Parses JSON logs, extracts labels, timestamps

Metrics Collection:

Node Exporter: System metrics (CPU, memory, disk, network)
cAdvisor: Container metrics (resource usage per container)
Service metrics: Application-specific metrics from local services
Remote write: Forwards all metrics to central Prometheus

Configuration:

Pipeline-based configuration with stages
Label relabeling for consistent naming
Instance identification for multi-server deployments

Node Exporter

Node Exporter exposes hardware and OS metrics for Unix systems.

Metrics Provided:

CPU usage and load averages
Memory and swap utilization
Disk I/O and space usage
Network interface statistics
Filesystem metrics
System uptime

Deployment: Runs as a Docker container on all servers, scraped by local Alloy instance.

Alertmanager

Alertmanager handles alerts from Prometheus and routes them to notification channels.

Configuration:

Grouping: Alerts grouped by name and job
Deduplication: Prevents duplicate notifications
Routing: All alerts sent to Discord webhook
Repeat interval: 3 hours for ongoing alerts

Alert Rules: Prometheus evaluates alert rules defined in alerts.yml:

Service availability: Detects when core services go down
Resource exhaustion: High CPU, memory, or disk usage
Data pipeline health: Kafka lag, ClickHouse errors
Network issues: BGP session failures, high packet loss

Notification Channels:

Discord webhook for real-time alerts
Future: Email, PagerDuty, Slack integration

Upptime

Upptime provides HTTP endpoint monitoring with a public status page.

Monitoring:

HTTP/HTTPS endpoint availability checks
Response time measurements
SSL certificate expiration tracking
Historical uptime statistics

Status Page:

Public dashboard showing service status
Incident history and response times
Automated via GitHub Actions
Static site hosted on GitHub Pages

Endpoints Monitored:

nxthdr.dev (main website)
Grafana, Prometheus, Loki (observability stack)
API endpoints (PeerLab Gateway, Saimiris Gateway)
Public services (Geofeed, CHProxy)

UptimeRobot

UptimeRobot provides ICMP ping monitoring for network-level availability.

Monitoring:

ICMP ping checks to core infrastructure
Network reachability validation
Complementary to HTTP monitoring
Independent external monitoring

Targets:

Core server IPv6 address
IXP server addresses
Probing server addresses
Critical network endpoints

Alerts:

Email notifications for downtime
SMS alerts for critical failures
Integration with status page

Observability Patterns

Metrics Collection

Push vs Pull:

Pull model: Prometheus scrapes metrics from services (preferred)
Push model: Alloy remote write for distributed servers

Metric Types:

Counters: Total requests, errors, bytes processed
Gauges: Current values (queue depth, active connections)
Histograms: Request latencies, response sizes
Summaries: Quantiles for performance analysis

Log Aggregation

Structured Logging: Services use structured logging with consistent fields:

Timestamp, level, message
Service name, instance, job
Request ID for tracing
Custom context fields

Label Strategy: Loki uses labels for indexing (not full-text):

job: Service type (risotto, pesto, saimiris)
instance: Server hostname
container_name: Docker container
level: Log level (info, warn, error)

Alerting Strategy

Alert Severity Levels:

Critical: Service down, data loss risk
Warning: Degraded performance, approaching limits
Info: Notable events, configuration changes

Alert Grouping: Related alerts grouped to reduce notification noise:

By service (all Redpanda alerts together)
By instance (all alerts from one server)
Time-based grouping (5-minute window)

Monitoring Coverage

Infrastructure Layer

System Metrics:

CPU, memory, disk, network utilization
Process counts and system load
Filesystem space and inode usage
Network interface errors and drops

Container Metrics:

Per-container resource usage
Container restart counts
Image pull statistics
Docker daemon health

Application Layer

Service Health:

HTTP endpoint availability (Upptime)
Service process uptime
Health check endpoints
Dependency availability

Performance Metrics:

Request rates and latencies
Error rates by type
Queue depths and processing times
Database query performance

Data Pipeline Layer

Kafka/Redpanda:

Topic lag and throughput
Producer and consumer rates
Broker health and replication
Partition distribution

ClickHouse:

Query execution times
Insert rates and batch sizes
Table sizes and compression ratios
Replication lag (if applicable)

Collectors:

Messages received and processed
Parse errors and invalid data
Buffer utilization
Backpressure indicators

Access and Usage

Grafana Access:

URL: https://grafana.nxthdr.dev
Authentication via Auth0
Pre-built dashboards for all services
Custom dashboard creation supported

Prometheus Access:

URL: https://prometheus.nxthdr.dev
Basic authentication required
PromQL query interface
API for programmatic access

Loki Access:

URL: https://loki.nxthdr.dev
Basic authentication required
LogQL query interface
Integrated in Grafana for log exploration

Status Pages:

Upptime: Public HTTP monitoring dashboard
UptimeRobot: ICMP monitoring status

Retention and Storage

Prometheus:

Retention: 15 days
Storage: Local filesystem on core server
Compaction: Automatic block compaction
Backup: Not currently implemented

Loki:

Retention: 7 days
Storage: Filesystem chunks and indexes
Compaction: Automatic with 2-hour delay
Cleanup: TTL-based deletion

Grafana:

Dashboard persistence: PostgreSQL
User sessions: In-memory
Snapshots: Stored indefinitely

Probing Platform Reference