Building an Intelligent Observability Stack: Auto-Healing Debugging with Grafana, OpenTelemetry, and a Custom Auto-Logging Service

observability Dec 28, 2025

When something breaks in production at 3 AM, you need two things: visibility into what's happening and enough detail to understand why. The problem is that running with verbose logging all the time is expensive, noisy and can itself degrade performance. What if your observability stack could detect errors and automatically flip on verbose logging, only when and where it matters?

This post walks through how I built a production-ready observability platform that does exactly that. It combines the Grafana Stack (Prometheus, Loki & Tempo) with OpenTelemetry and a custom Go service that orchestrates intelligent auto-logging sessions triggered by real-time error detection.

In a Nutshell

Before we dive into the architecture, here's what this stack delivers:

  • Auto-logging: DEBUG logs turn on automatically when errors spike, off after 30 minutes
  • Correlated signals: Jump from logs -> traces -> metrics in Grafana without copying IDs
  • Smart alerting: Three-tier alert system with automatic inhibition to prevent noise
  • Self-Hosted & Secure: Everything runs in a private VPC, with Caddy handling automatic HTTPS
  • Zero manual config expansion: Sed-based entrypoints solve YAML's lack of env var support

The Architecture

The stack follows a hub-and-spoke topology designed for multi-droplet deployment. A dedicated observability droplet receives telemetry data from application droplets over a private VPC networking-keeping sensitive telemetry data internal and secure.

Application Droplets (via VPC)
|
+-> OTLP gRPC (4317) / HTTP (4318) telemetry
+-> Auto-log status queries (5000)
|
Observability Droplet
+-> OTEL Collector (ingests & routes)
+-> Loki (logs) Tempo (traces) Prometheus (metrics)
+-> AlertManager (routing & notification)
+-> Auto-Log Service + Redis (intelligent verbose logging)
+-> Caddy (reverse proxy with auto-HTTPS)
+-> Grafana (visualization)

The key design decision: all telemetry flows through a single OpenTelemetry Collector that acts as both a gatekeeper and router. Applications never talk directly to Loki, Tempo or Prometheus. This centralizes processing, enrichment and routing in one place.

The Data Flow

Every request in an instrumented application generates three types of telemetry:

  1. Traces captures the full lifecycle of a request across service boundaries
  2. Logs capture structured events with severity levels
  3. Metrics capture aggregated counters, gauges and histograms

All three flow through the OTEL Collector, which enriches them with resource attributes (cluster name, environment), batches them for efficiency and routes them to the right backend.

The OTEL Collector configuration defines three parallel pipelines:

pipelines:
  traces:
    receivers: [otlp]
    processors: [memory_limiter, filter/known_services, batch, resource, attributes]
    exporters: [otlp/tempo]

  metrics:
    receivers: [otlp, prometheus]
    processors: [memory_limiter, batch, resource]
    exporters: [prometheusremotewrite]

  logs:
    receivers: [otlp]
    processors: [memory_limiter, filter/known_services, batch, resource/logs]
    exporters: [otlphttp/loki]

A filter/known_services processor acts as a gate. It drops any telemetry that lacks a service.name resource attribute, preventing unregistered or rogue services from flooding the pipeline. This filter is intentionally not applied to the metrics pipeline because Prometheus self-scrape metrics (from the collector itself) don't carry that attribute.

The memory limiter caps the collector at 512 MiB with a 128 MiB spike buffer, preventing OOM kills under sudden load.

Cross-Signal Correlation

One of the most valuable properties of this stack is the ability to jump between signals. A single Grafana datasource configuration wires everything together:

Logs to Traces: Loki extracts TraceID from structured logs using a derived field regex. Click a trace ID in any log line and you're taken directly to the full distributed trace in Tempo.

Traces to Logs: Tempo links back to Loki, filtering logs by the trace's time window (with a 1-minute buffer on each side) and service name. This means from any span in a trace, you can see every log line that was emitted during that operation.

Traces to Metrics: Tempo links to Prometheus queries for request rate, error rate, and P95 latency, scoped to the specific service and environment of the trace you're viewing.

This bidirectional linking eliminates the context-switching problem. You don't need to copy-paste trace IDs between tools or manually correlate timestamps. The datasource configuration handles it:

tracesToLogsV2:
  datasourceUid: loki
  spanStartTimeShift: '-1m'
  spanEndTimeShift: '1m'
  filterByTraceID: true
  tags:
    - key: 'service.name'
      value: 'service_name'

tracesToMetrics:
  datasourceUid: prometheus
  queries:
    - name: 'Request Rate'
      query: 'sum(rate(traces_spanmetrics_calls_total{service="${__tags.service}"}[5m]))'
    - name: 'Error Rate'
      query: '...'
    - name: 'Latency (p95)'
      query: 'histogram_quantile(0.95, ...)'

The Auto-Logging System

This is the component that makes the stack "intelligent." The idea is simple: applications normally run with minimal logging (INFO level or above). When the system detects an error pattern, it automatically enables verbose (DEBUG) logging for the affected service, environment, and optionally specific endpoint. After a configurable TTL (default 30 minutes), verbose logging turns itself off.

How It Works

Four components make up the auto-logging system:

  1. Prometheus/Loki alert rules detect error conditions
  2. AlertManager routes alerts tagged with autolog: "true" to the auto-log service
  3. Auto-Log Service (custom Go) manages logging state in Redis with TTL
  4. Applications poll the auto-log API to check if they should enable verbose logging

Here's how the auto-logging flow works visually:

sequenceDiagram
    participant App
    participant Loki
    participant Prometheus
    participant AlertManager
    participant AutoLogService
    participant Redis
    
    App->>Loki: Error log
    Loki->>Prometheus: Metrics ingested
    Prometheus->>AlertManager: Rule triggered
    AlertManager->>AutoLogService: Webhook with autolog=true
    AutoLogService->>Redis: Store session with TTL
    App->>AutoLogService: Poll /check
    AutoLogService->>App: { enabled: true }
    App->>Loki: DEBUG logs

The Trigger Conditions

Three alert rules carry the autolog: "true" label, meaning they automatically trigger verbose logging:

Fatal errors (any FATAL log in a 2-minute window):

- alert: FatalErrorDetected
  expr: |
    count_over_time(
      {service_name=~".+"} | json | severity_text="FATAL" [2m]
    ) > 0
  labels:
    autolog: "true"

Sustained high error counts (100+ errors in 15 minutes):

- alert: SustainedHighErrorCount
  expr: |
    sum by (service_name, deployment_environment) (
      count_over_time(
        {service_name=~".+"} | json | severity_text=~"ERROR|FATAL" [15m]
      )
    ) > 100
  labels:
    autolog: "true"

Repeated HTTP 5xx errors (10+ server errors in 5 minutes):

- alert: RepeatedErrors
  expr: increase(http_server_requests_total{status=~"5.."}[5m]) > 10
  labels:
    autolog: "true"

The Auto-Log Service

The service is a lightweight Go binary (builds to a scratch container image) that manages state through Redis. Key design decisions:

Redis with TTL for session management: Auto-log sessions are stored as JSON in Redis with a TTL matching the configured duration (default 30 minutes). When the TTL expires, verbose logging disables itself automatically. No cleanup jobs, no stale state.

Error counting with time windows: The service tracks error counts per app/environment/endpoint using Redis INCR with TTL. When the count exceeds the threshold within the window, it triggers a session independently of AlertManager.

Webhook secret validation: All AlertManager webhooks are authenticated with an X-Webhook-Secret header. The auto-log service validates this against a shared secret, preventing unauthorized triggering of verbose logging.

Prometheus metrics: The service exports its own metrics (webhook request rates, active session counts, trigger counts, processing duration) which are scraped by Prometheus and visualized in a dedicated Grafana dashboard.

The API is simple:

EndpointMethodPurpose
/api/autolog/check?app=X&environment=YGETCheck if verbose logging is enabled
/api/autolog/enablePOSTManually enable verbose logging
/api/autolog/disablePOSTManually disable verbose logging
/api/autolog/listGETList all active sessions
/webhook/{type}POSTAlertManager webhook receiver

Applications integrate by adding a simple check to their logging configuration middleware:

// In your .NET middleware or similar:
var response = await httpClient.GetAsync(
    $"{autoLogUrl}/api/autolog/check?app={appName}&environment={env}");
var result = await response.Content.ReadFromJsonAsync<AutoLogResponse>();

if (result.Enabled)
{
    // Switch to Debug level logging
    logger.MinimumLevel = LogEventLevel.Debug;
}

Alert Architecture

The alerting system operates on three layers, each targeting different failure modes:

Layer 1: Log-Based Error Detection (Loki Rules)

Twelve rules that analyze log content in real time using LogQL. These catch application-level errors that only appear in log messages:

  • Error rate monitoring: Fires when error-severity logs exceed 0.1/sec
  • Error spike detection: Compares current error rate to 1 hour ago, fires on 2x increase
  • Cross-environment errors: Detects when the same service is erroring in 2+ environments simultaneously (indicates a platform-wide issue, not a single-instance problem)
  • New error patterns: Uses the LogQL unless operator to find errors that weren't present an hour ago, catching regression from deployments
  • Domain-specific rules: Separate rules for database errors, authentication errors, and external service failures using regex matching

Layer 2: Metric-Based Application Alerts (Prometheus Rules)

Seven rules that analyze HTTP metrics and resource utilization:

  • 5xx error rate at 5% (warning) and 10% (critical) thresholds
  • P95 latency above 2 seconds
  • Application down (the up metric goes to 0)
  • Memory usage above 85% of available
  • Database connection pool above 80% capacity

Layer 3: Infrastructure Health (Prometheus Rules)

Sixteen rules that monitor the observability stack itself:

  • Component health: Alert when any stack component (Grafana, Loki, Tempo, Prometheus, OTEL Collector) becomes unreachable
  • OTEL Collector pipeline health: Separate alerts for dropped spans, dropped logs, dropped metrics, export failures, and high memory usage approaching the 512 MiB limit
  • Storage health: Prometheus storage utilization, Loki ingestion rate
  • Disk space: Warning at 80%, critical at 90%
  • Notification health: A Watchdog alert (always-firing vector(1)) for dead man's switch integration, plus an alert on alertmanager_notifications_failed_total to detect broken SMTP

Alert Routing

AlertManager routes alerts through a tiered system:

Critical severity  -> critical-alerts (email + webhook, 1h repeat)
Warning severity   -> warning-alerts (email, 12h repeat)
autolog=true       -> autolog-webhook (immediate, 30m repeat)
Watchdog           -> watchdog (5m repeat, dead man's switch)
category=app       -> application-alerts (webhook)
category=infra     -> infrastructure-alerts (webhook)

Inhibition rules prevent alert storms: if a critical infrastructure alert is firing, warning-level application alerts for the same cluster are suppressed.

Environment Variable Substitution

Most observability tools don't support environment variables in their YAML configs, which makes dynamic configuration (like secrets, retention periods) painful. The solution is a sed-based entrypoint pattern. Config files use __VAR__ placeholders (double-underscore convention), and the Docker Compose entrypoint runs sed to substitute them from environment variables before starting the actual process:

alertmanager:
  environment:
    - SMTP_USERNAME=${SMTP_USERNAME:-}
    - AUTOLOG_WEBHOOK_SECRET=${AUTOLOG_WEBHOOK_SECRET:-}
  entrypoint: ["/bin/sh", "-c"]
  command:
    - |
      sed -e "s|__SMTP_USERNAME__|$$SMTP_USERNAME|g" \
          -e "s|__AUTOLOG_WEBHOOK_SECRET__|$$AUTOLOG_WEBHOOK_SECRET|g" \
          /etc/alertmanager/alertmanager.yml > /tmp/alertmanager.yml
      exec /bin/alertmanager --config.file=/tmp/alertmanager.yml

The $$ is Docker Compose syntax for a literal $ in the command string, which then references the container's environment variable at runtime. The original config file is mounted read-only, and the substituted version is written to /tmp.

This same pattern is used for Loki (retention period) and Tempo (block retention), making all retention values configurable from a single .env file.

Reverse Proxy and Network Security

Caddy handles HTTPS termination with automatic Let's Encrypt certificates. Only two services are exposed to the internet: Grafana and the Auto-Log API.

grafana.yourdomain.com  -> grafana:3000  (dashboard UI)
autolog.yourdomain.com  -> autolog:5000  (auto-log API)

Both routes include security headers (HSTS with preload, content-type nosniff, frame denial, strict referrer policy) and JSON access logging to files collected by Promtail.

The network security model uses UFW firewall rules:

PortProtocolAccessPurpose
80, 443TCP/UDPPublicCaddy (HTTPS + HTTP/3)
4317, 4318TCPVPC CIDR onlyOTEL Collector
3100TCPVPC CIDR onlyLoki (direct log shipping)
5000TCPVPC CIDR onlyAuto-Log API
9090, 3200, 9093TCPInternal onlyPrometheus, Tempo, AlertManager

Telemetry endpoints are scoped to the VPC CIDR range, meaning only droplets within the same private network can send telemetry. This prevents accidental exposure of the ingestion endpoints to the public internet.

Log Collection Strategy

The stack uses two parallel log collection strategies:

OTEL-Native Ingestion (Primary)

Applications instrumented with the OpenTelemetry SDK send structured logs directly through the OTEL Collector to Loki's native OTLP endpoint (/otlp). This is the preferred path because:

  • Log records carry trace context (trace ID, span ID) automatically
  • Resource attributes (service name, environment, SDK version) are preserved as structured metadata
  • No additional agent needed on the application droplet

Loki is configured to index key resource attributes as labels and store others as structured metadata:

otlp_config:
  resource_attributes:
    attributes_config:
      - action: index_label
        attributes:
          - service.name
          - environment
          - log.source
      - action: structured_metadata
        attributes:
          - deployment.environment
          - service.instance.id
          - telemetry.sdk.language

Promtail (Stack Logs + Caddy)

Promtail runs on the observability droplet to collect container logs from the stack itself (Grafana, Prometheus, Loki, Tempo, etc.) and Caddy reverse proxy access logs.

The Caddy log job demonstrates collecting from a shared Docker volume. Caddy writes JSON logs to a named volume and Promtail mounts the same volume read-only:

# docker-compose.yml
caddy:
  volumes:
    - caddy-logs:/var/log/caddy

promtail:
  volumes:
    - caddy-logs:/var/log/caddy:ro

The Promtail pipeline parses Caddy's JSON format (which uses ts for Unix epoch timestamps) and extracts HTTP status, method, and host as labels for querying in Grafana.

Smoke Testing

A smoke test script validates end-to-end telemetry flow by sending a test trace, log, and metric through the OTEL Collector, waiting for propagation, then querying each backend to verify arrival:

./scripts/smoke-test.sh [otel-http-endpoint]

The script generates a random trace ID, sends OTLP JSON payloads via HTTP, waits 10 seconds for batch processing and storage, then checks:

  1. Tempo for the trace: GET /api/traces/{traceId}
  2. Loki for the log: GET /loki/api/v1/query?query={service_name="smoke-test"}
  3. Prometheus for the metric: GET /api/v1/query?query=smoke_test_counter

This catches configuration drift, broken pipelines, and backend availability issues before they affect production telemetry.

Application Integration

On the application side, integration is minimal. Here's what a .NET service needs in its Docker Compose configuration:

jaiye-api:
  environment:
    - Observability__Endpoint=http://10.0.1.10:4317
    - AutoLog__ServiceUrl=http://10.0.1.10:5000
    - AutoLog__AppName=jaiye
  labels:
    - "app=jaiye"
    - "environment=production"
    - "service=jaiye-api"

The OTEL SDK (configured via environment variables) handles trace, metric, and log export. Docker labels enable Promtail service discovery if you choose to run Promtail on the application droplet as well.

The auto-log integration is a simple HTTP check that can be added to middleware, a background service, or a logging configuration reload hook.

Backup and Disaster Recovery

All persistent data lives in Docker named volumes. A recovery strategy identifies three tiers:

Critical (must back up): Configuration files (git), .env (secure storage), Grafana storage (UI dashboards), Prometheus storage (metrics)

Recommended: Loki storage (logs), Tempo storage (traces), AlertManager storage (silences)

Ephemeral (can be rebuilt): Redis (auto-log sessions expire anyway), Caddy (certificates auto-renew)

Volume backup is straightforward with docker run --rm alpine tar:

for vol in grafana-storage prometheus-storage loki-storage tempo-storage; do
  docker run --rm \
    -v observability_${vol}:/source:ro \
    -v $(pwd)/backups:/backup \
    alpine tar czf /backup/${vol}-$(date +%Y%m%d).tar.gz -C /source .
done

For full disaster recovery: provision a new droplet, clone the repo, restore .env, run the setup script, restore volume backups, update DNS.

Lessons Learned

AlertManager does not support environment variable expansion. The documentation doesn't make this obvious. If you put ${SMTP_PASSWORD} in your alertmanager.yml, it's sent as a literal string. Use sed-based entrypoints with placeholder substitution.

The OTEL filter processor will break Prometheus self-scrape metrics. The Prometheus receiver in the OTEL Collector produces metrics without service.name resource attributes. Applying a filter that drops metrics with nil service.name will silently kill your collector's self-monitoring. Only apply service-name filters to traces and logs pipelines.

Promtail can't read Caddy logs without a shared volume. Caddy writes to its own filesystem, and Promtail has no way to reach it unless you explicitly share a Docker volume between them. This applies to any file-based log collection across containers.

Loki and Tempo retention are not configurable via CLI flags. Unlike Prometheus (which accepts --storage.tsdb.retention.time), Loki and Tempo require config file changes for retention settings. The sed entrypoint pattern solves this cleanly.

What's Next

The stack is production-ready as described, but there are natural extensions:

  • Multi-tenancy: Enable Loki and Tempo tenant headers for data isolation between teams
  • Service mesh integration: Add Envoy or Istio proxy metrics to the OTEL Collector
  • External secret management: Replace the .env file with HashiCorp Vault or similar for production secrets
  • Grafana OIDC: Replace password auth with SSO for team access

The core insight is that observability isn't just about collecting data. It's about connecting the right data at the right time, and sometimes, generating more data when the system tells you something is wrong. The auto-logging pattern turns your observability stack from a passive data store into an active debugging partner.

This stack is running in production today, but observability is never "done." If you've built something similar or have ideas to extend it, I'd love to hear from you. Feel free to reach out or leave a comment.


The complete source code for this observability stack, including all configurations, dashboards, alert rules, and the auto-logging service, is available on GitHub.

Tags

Views: Loading...