nanochat/contracts/logging-standard.md
Manmohan Sharma aa0818aae2
feat(observability): Prometheus + Grafana + Loki stack for samosaChaat (#9)
Replaces the helm/observability scaffold with a real monitoring stack
wired into the samosaChaat platform.

Helm chart (helm/observability/)
- Chart.yaml declares kube-prometheus-stack (~62.0) and loki-stack
  (~2.10) as subchart dependencies.
- values.yaml configures Prometheus (15d retention, 50Gi PVC,
  ServiceMonitor + rule selector on app.kubernetes.io/part-of:
  samosachaat), Alertmanager (10Gi PVC), Grafana (OAuth-only via
  GitHub + Google, local login disabled, Prometheus + Loki datasources,
  dashboards auto-provisioned from a ConfigMap, email + Slack contact
  points with a critical route to Slack), Loki (50Gi, 30d retention,
  tsdb schema), and Promtail (JSON pipeline that lifts level / service
  / trace_id / user_id into labels, scrape config with pod labels).
- Alert rules: HighCPU, HighMemory, DiskSpaceLow, High5xxRate,
  InferenceServiceDown, HighP99Latency.
- templates/grafana-dashboards-configmap.yaml renders every file under
  dashboards/ into a single grafana_dashboard=1 ConfigMap.
- dashboards/node-health.json, app-performance.json, inference.json -
  fully-formed Grafana dashboards with Prometheus datasource variable,
  templated app selector, thresholded gauges, and LogQL-ready labels.

Scraping (helm/samosachaat/templates/servicemonitor.yaml)
- ServiceMonitor CRs for auth / chat-api / inference that Prometheus
  picks up via the part-of=samosachaat selector; scrapes /metrics
  every 15s and replaces the app label so dashboards line up.

Application instrumentation
- services/{auth,chat-api,inference} each depend on
  prometheus-fastapi-instrumentator and expose /metrics (request count,
  latency histograms, in-progress gauges).
- services/auth/src/logging_setup.py and
  services/inference/src/logging_setup.py mirror the canonical
  chat-api implementation - structlog JSON with service, trace_id,
  user_id context injection.
- configure_logging() is called at create_app() in auth and inference;
  inference's main.py now uses structlog via get_logger() instead of
  logging.getLogger.
- log_level setting added to auth + inference config (LOG_LEVEL env).

Docs
- contracts/logging-standard.md defines the required JSON fields,
  Python (structlog) + Node.js (pino) implementations, LogQL examples
  for cross-service queries, and the x-trace-id propagation contract.

Closes #9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 12:29:16 -07:00

111 lines
4.7 KiB
Markdown

# samosaChaat Logging Standard
All services in the samosaChaat platform emit logs as **single-line JSON**
on stdout. Promtail ships them to Loki, where Grafana queries them by label
and by JSON field. Because every service shares the same schema, a single
trace_id lets you follow a request from the frontend through auth → chat-api
→ inference.
## Required fields
Every log line MUST include:
| Field | Type | Source |
|-------------|---------|------------------------------------------|
| `timestamp` | ISO8601 | structlog `TimeStamper(fmt="iso")` |
| `level` | string | `debug` / `info` / `warning` / `error` |
| `service` | string | hard-coded per service (`auth`, `chat-api`, `inference`, `frontend`) |
| `message` | string | the human-readable event (`event` key in structlog) |
Conditionally included (when present in the request context):
| Field | When to include |
|--------------|------------------------------------------|
| `trace_id` | every request served by a backend service — propagated via the `x-trace-id` header |
| `user_id` | every request authenticated as a user |
| `inference_time_ms` | emitted by chat-api and inference around model calls |
| `error` | on exceptions — the stringified cause |
Anything else is free-form structured context (`method`, `path`,
`status_code`, `model_tag`, …). Keep keys `snake_case`.
## Python implementation — `structlog`
The canonical setup lives at `services/chat-api/src/logging_setup.py`.
`services/auth/src/logging_setup.py` and `services/inference/src/logging_setup.py`
mirror it, differing only in the hard-coded `service` value.
Key pieces:
```python
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars, # trace_id / user_id from context
structlog.processors.add_log_level, # -> level field
structlog.processors.TimeStamper(fmt="iso", utc=True),
_inject_context, # service + trace_id + user_id defaults
structlog.processors.JSONRenderer(), # final JSON line
],
...
)
```
Each service calls `configure_logging()` once at startup (inside
`create_app()`), and then uses `logger = get_logger(__name__)` everywhere.
`trace_id` is set by a FastAPI middleware that reads the incoming
`x-trace-id` header (or mints a new one via `new_trace_id()`) and propagates
it to downstream calls.
## Node.js implementation — `pino` (frontend)
The Next.js frontend should log JSON with the same schema. Reference config:
```ts
// services/frontend/lib/logger.ts
import pino from "pino";
export const logger = pino({
base: { service: "frontend" },
timestamp: pino.stdTimeFunctions.isoTime,
formatters: {
level: (label) => ({ level: label }), // keep the string level
},
messageKey: "message",
});
```
When emitting, always include `trace_id` and `user_id` when known:
```ts
logger.info({ trace_id, user_id, path: req.url }, "request_start");
```
In API routes, read the incoming `x-trace-id` header and echo it back on the
response so client-side traces can join up.
## Cross-service querying (LogQL / Grafana Explore)
Labels Promtail applies: `namespace`, `app`, `pod`, `level`, `service`,
`container`. Everything else is a JSON field — use `| json` to extract it.
| Goal | Query |
|-------------------------------|-----------------------------------------------------------------------|
| All errors in prod | `{namespace="samosachaat-prod"} | json | level="error"` |
| Trace a request across tiers | `{namespace="samosachaat-prod"} | json | trace_id="<trace>"` |
| Auth failures | `{app="auth"} | json | level="error"` |
| Slow inference calls | `{app="inference"} | json | inference_time_ms > 5000` |
| 5xx by service | `{namespace="samosachaat-prod"} | json | status_code >= 500` |
| Rate-limited OAuth logins | `{app="auth"} | json | path=~"/auth/oauth/.*" | status_code=429` |
## Trace propagation contract
1. **Frontend** — mint `trace_id` on navigation (or reuse an existing one from
the current session), send it as `x-trace-id` on every `fetch` to the API.
2. **chat-api** — read `x-trace-id`, store in a context var, re-emit on the
response, and forward it on every httpx call to auth or inference.
3. **auth / inference** — read `x-trace-id` and bind it to the logger context
for the duration of the request.
Services MUST NOT log raw secrets, JWTs, OAuth client secrets, API keys, or
full user message text. Log IDs, lengths, and booleans — not contents.