mirror of https://github.com/karpathy/nanochat.git synced 2026-05-09 01:10:10 +00:00

Manmohan Sharma aa0818aae2

feat(observability): Prometheus + Grafana + Loki stack for samosaChaat (#9 )

Replaces the helm/observability scaffold with a real monitoring stack
wired into the samosaChaat platform.

Helm chart (helm/observability/)
- Chart.yaml declares kube-prometheus-stack (~62.0) and loki-stack
  (~2.10) as subchart dependencies.
- values.yaml configures Prometheus (15d retention, 50Gi PVC,
  ServiceMonitor + rule selector on app.kubernetes.io/part-of:
  samosachaat), Alertmanager (10Gi PVC), Grafana (OAuth-only via
  GitHub + Google, local login disabled, Prometheus + Loki datasources,
  dashboards auto-provisioned from a ConfigMap, email + Slack contact
  points with a critical route to Slack), Loki (50Gi, 30d retention,
  tsdb schema), and Promtail (JSON pipeline that lifts level / service
  / trace_id / user_id into labels, scrape config with pod labels).
- Alert rules: HighCPU, HighMemory, DiskSpaceLow, High5xxRate,
  InferenceServiceDown, HighP99Latency.
- templates/grafana-dashboards-configmap.yaml renders every file under
  dashboards/ into a single grafana_dashboard=1 ConfigMap.
- dashboards/node-health.json, app-performance.json, inference.json -
  fully-formed Grafana dashboards with Prometheus datasource variable,
  templated app selector, thresholded gauges, and LogQL-ready labels.

Scraping (helm/samosachaat/templates/servicemonitor.yaml)
- ServiceMonitor CRs for auth / chat-api / inference that Prometheus
  picks up via the part-of=samosachaat selector; scrapes /metrics
  every 15s and replaces the app label so dashboards line up.

Application instrumentation
- services/{auth,chat-api,inference} each depend on
  prometheus-fastapi-instrumentator and expose /metrics (request count,
  latency histograms, in-progress gauges).
- services/auth/src/logging_setup.py and
  services/inference/src/logging_setup.py mirror the canonical
  chat-api implementation - structlog JSON with service, trace_id,
  user_id context injection.
- configure_logging() is called at create_app() in auth and inference;
  inference's main.py now uses structlog via get_logger() instead of
  logging.getLogger.
- log_level setting added to auth + inference config (LOG_LEVEL env).

Docs
- contracts/logging-standard.md defines the required JSON fields,
  Python (structlog) + Node.js (pino) implementations, LogQL examples
  for cross-service queries, and the x-trace-id propagation contract.

Closes #9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-16 12:29:16 -07:00

4.7 KiB

Raw Blame History

samosaChaat Logging Standard

All services in the samosaChaat platform emit logs as single-line JSON on stdout. Promtail ships them to Loki, where Grafana queries them by label and by JSON field. Because every service shares the same schema, a single trace_id lets you follow a request from the frontend through auth → chat-api → inference.

Required fields

Every log line MUST include:

Field	Type	Source
`timestamp`	ISO8601	structlog `TimeStamper(fmt="iso")`
`level`	string	`debug` / `info` / `warning` / `error`
`service`	string	hard-coded per service (`auth`, `chat-api`, `inference`, `frontend`)
`message`	string	the human-readable event (`event` key in structlog)

Conditionally included (when present in the request context):

Field	When to include
`trace_id`	every request served by a backend service — propagated via the `x-trace-id` header
`user_id`	every request authenticated as a user
`inference_time_ms`	emitted by chat-api and inference around model calls
`error`	on exceptions — the stringified cause

Anything else is free-form structured context (method, path, status_code, model_tag, …). Keep keys snake_case.

Python implementation — `structlog`

The canonical setup lives at services/chat-api/src/logging_setup.py. services/auth/src/logging_setup.py and services/inference/src/logging_setup.py mirror it, differing only in the hard-coded service value.

Key pieces:

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,   # trace_id / user_id from context
        structlog.processors.add_log_level,        # -> level field
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        _inject_context,                           # service + trace_id + user_id defaults
        structlog.processors.JSONRenderer(),       # final JSON line
    ],
    ...
)

Each service calls configure_logging() once at startup (inside create_app()), and then uses logger = get_logger(__name__) everywhere. trace_id is set by a FastAPI middleware that reads the incoming x-trace-id header (or mints a new one via new_trace_id()) and propagates it to downstream calls.

Node.js implementation — `pino` (frontend)

The Next.js frontend should log JSON with the same schema. Reference config:

// services/frontend/lib/logger.ts
import pino from "pino";

export const logger = pino({
  base: { service: "frontend" },
  timestamp: pino.stdTimeFunctions.isoTime,
  formatters: {
    level: (label) => ({ level: label }),   // keep the string level
  },
  messageKey: "message",
});

When emitting, always include trace_id and user_id when known:

logger.info({ trace_id, user_id, path: req.url }, "request_start");

In API routes, read the incoming x-trace-id header and echo it back on the response so client-side traces can join up.

Cross-service querying (LogQL / Grafana Explore)

Labels Promtail applies: namespace, app, pod, level, service, container. Everything else is a JSON field — use | json to extract it.

Goal	Query
All errors in prod	`{namespace="samosachaat-prod"}
Trace a request across tiers	`{namespace="samosachaat-prod"}
Auth failures	`{app="auth"}
Slow inference calls	`{app="inference"}
5xx by service	`{namespace="samosachaat-prod"}
Rate-limited OAuth logins	`{app="auth"}

Trace propagation contract

Frontend — mint trace_id on navigation (or reuse an existing one from the current session), send it as x-trace-id on every fetch to the API.
chat-api — read x-trace-id, store in a context var, re-emit on the response, and forward it on every httpx call to auth or inference.
auth / inference — read x-trace-id and bind it to the logger context for the duration of the request.

Services MUST NOT log raw secrets, JWTs, OAuth client secrets, API keys, or full user message text. Log IDs, lengths, and booleans — not contents.

4.7 KiB Raw Blame History