Observability
Configure backend traces and metrics for SurfSense
SurfSense uses OpenTelemetry for backend traces and metrics. Application logs include trace and span IDs so you can correlate logs with traces, but logs stay on the normal container stderr path.
Enable Locally
The development compose file reads backend settings from
surfsense_backend/.env. Add these values there:
SURFSENSE_ENABLE_OTEL=true
SURFSENSE_ENV=dev
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-lgtm:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_RESOURCE_ATTRIBUTES=service.namespace=surfsense
OTEL_METRIC_EXPORT_INTERVAL=300000Then start the development stack with the local LGTM backend:
docker compose -f docker/docker-compose.dev.yml up --buildGrafana is exposed on http://localhost:3001 by default.
Enable in Production Docker Compose
Production Docker Compose reads backend and collector settings from
docker/.env. The API and Celery worker export telemetry to the bundled
collector at otel-collector:4317; the collector is the only service that uses
the Grafana Cloud credentials.
Add these values to docker/.env:
SURFSENSE_ENV=production
SURFSENSE_ENABLE_OTEL=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_RESOURCE_ATTRIBUTES=service.namespace=surfsense
OTEL_METRIC_EXPORT_INTERVAL=300000
GRAFANA_CLOUD_OTLP_ENDPOINT=https://otlp-gateway-<region>.grafana.net/otlp
GRAFANA_CLOUD_INSTANCE_ID=<stack instance id>
GRAFANA_CLOUD_API_KEY=<cloud access policy token>Then start the stack:
docker compose -f docker/docker-compose.yml --profile observability up -dThe collector receives OTLP on otel-collector:4317, scrubs sensitive span
attributes, applies the configured tail-sampling policy, batches exports,
retries failures, and forwards traces and metrics to Grafana Cloud over OTLP
HTTP.
When deploying surfsense_backend/Dockerfile directly instead of production
compose, use the same split: SurfSense containers export to a collector, and
the collector owns the Grafana Cloud credentials.
Automatic Traces
When OpenTelemetry is enabled, the backend instruments:
- FastAPI inbound requests.
- SQLAlchemy queries from the main async engine and Celery task engine.
- Raw psycopg calls used by the LangGraph checkpointer.
- Redis commands.
- HTTPX outbound requests.
- Celery producer and worker execution.
Manual Spans
SurfSense keeps project-specific spans behind app.observability.otel:
model.calltool.callchat.requestkb.searchkb.persistconnector.syncsubagent.invokeetl.extractetl.parseetl.ocretl.picture.describeetl.picture.ocrcompaction.runpermission.askedinterrupt.raised
Keep span names and attributes low-cardinality. Do not attach user content, prompts, document titles, file paths, user-specific URLs, secrets, or raw queries as span attributes.
Metrics
The OpenTelemetry instrumentors provide HTTP, HTTPX, and Celery runtime
metrics. SurfSense adds these project metrics from app.observability.metrics:
surfsense.model.call.durationgen_ai.client.token.usagesurfsense.tool.call.durationsurfsense.tool.call.errorssurfsense.chat.request.durationsurfsense.chat.request.outcomesurfsense.kb.search.durationsurfsense.compaction.runssurfsense.permission.askssurfsense.interrupt.raisedsurfsense.indexing.document.durationsurfsense.indexing.document.outcomesurfsense.connector.sync.durationsurfsense.connector.sync.outcomesurfsense.subagent.invoke.durationsurfsense.subagent.invoke.outcomesurfsense.etl.extract.durationsurfsense.etl.extract.outcomesurfsense.celery.heartbeat.refreshessurfsense.celery.heartbeat.failuressurfsense.celery.queue.latencysurfsense.auth.failuressurfsense.rate_limit.rejectionssurfsense.perf.elapsed_ms
Runtime gauges include process RSS, CPU utilization, threads, open file descriptors, asyncio tasks, and CPython GC counters.
Logs
LoggingInstrumentor().instrument() injects otelTraceID and otelSpanID into
standard Python LogRecords. The root log format writes them as
trace_id=... span_id=....
SurfSense intentionally does not create an OpenTelemetry LoggerProvider,
LoggingHandler, or OTLPLogExporter. Container stderr remains the log
transport.
Verification
- Hit a FastAPI endpoint and confirm an inbound server span appears in Grafana.
- Run a chat request and confirm
model.callandtool.callchild spans. - Run a knowledge-base search and confirm
kb.searchspans and SQL child spans. - Run connector indexing and confirm Celery producer/worker spans share a trace ID and connector sync metrics increment.
- Confirm
gen_ai.client.token.usage, model/tool durations, request duration, Celery runtime, and runtime gauges appear within one export interval. - Confirm logs emitted inside a traced request show non-zero trace and span IDs.
Out Of Scope
- Frontend/browser OpenTelemetry.
- OpenTelemetry log export.
- Profiling.
- Production backend selection.
- Tail-sampling collector configuration.
- Replacing LangSmith.
- Vendor SDKs.
