Replaces the helm/observability scaffold with a real monitoring stack
wired into the samosaChaat platform.
Helm chart (helm/observability/)
- Chart.yaml declares kube-prometheus-stack (~62.0) and loki-stack
(~2.10) as subchart dependencies.
- values.yaml configures Prometheus (15d retention, 50Gi PVC,
ServiceMonitor + rule selector on app.kubernetes.io/part-of:
samosachaat), Alertmanager (10Gi PVC), Grafana (OAuth-only via
GitHub + Google, local login disabled, Prometheus + Loki datasources,
dashboards auto-provisioned from a ConfigMap, email + Slack contact
points with a critical route to Slack), Loki (50Gi, 30d retention,
tsdb schema), and Promtail (JSON pipeline that lifts level / service
/ trace_id / user_id into labels, scrape config with pod labels).
- Alert rules: HighCPU, HighMemory, DiskSpaceLow, High5xxRate,
InferenceServiceDown, HighP99Latency.
- templates/grafana-dashboards-configmap.yaml renders every file under
dashboards/ into a single grafana_dashboard=1 ConfigMap.
- dashboards/node-health.json, app-performance.json, inference.json -
fully-formed Grafana dashboards with Prometheus datasource variable,
templated app selector, thresholded gauges, and LogQL-ready labels.
Scraping (helm/samosachaat/templates/servicemonitor.yaml)
- ServiceMonitor CRs for auth / chat-api / inference that Prometheus
picks up via the part-of=samosachaat selector; scrapes /metrics
every 15s and replaces the app label so dashboards line up.
Application instrumentation
- services/{auth,chat-api,inference} each depend on
prometheus-fastapi-instrumentator and expose /metrics (request count,
latency histograms, in-progress gauges).
- services/auth/src/logging_setup.py and
services/inference/src/logging_setup.py mirror the canonical
chat-api implementation - structlog JSON with service, trace_id,
user_id context injection.
- configure_logging() is called at create_app() in auth and inference;
inference's main.py now uses structlog via get_logger() instead of
logging.getLogger.
- log_level setting added to auth + inference config (LOG_LEVEL env).
Docs
- contracts/logging-standard.md defines the required JSON fields,
Python (structlog) + Node.js (pino) implementations, LogQL examples
for cross-service queries, and the x-trace-id propagation contract.
Closes #9
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.7 KiB
samosaChaat Logging Standard
All services in the samosaChaat platform emit logs as single-line JSON on stdout. Promtail ships them to Loki, where Grafana queries them by label and by JSON field. Because every service shares the same schema, a single trace_id lets you follow a request from the frontend through auth → chat-api → inference.
Required fields
Every log line MUST include:
| Field | Type | Source |
|---|---|---|
timestamp |
ISO8601 | structlog TimeStamper(fmt="iso") |
level |
string | debug / info / warning / error |
service |
string | hard-coded per service (auth, chat-api, inference, frontend) |
message |
string | the human-readable event (event key in structlog) |
Conditionally included (when present in the request context):
| Field | When to include |
|---|---|
trace_id |
every request served by a backend service — propagated via the x-trace-id header |
user_id |
every request authenticated as a user |
inference_time_ms |
emitted by chat-api and inference around model calls |
error |
on exceptions — the stringified cause |
Anything else is free-form structured context (method, path,
status_code, model_tag, …). Keep keys snake_case.
Python implementation — structlog
The canonical setup lives at services/chat-api/src/logging_setup.py.
services/auth/src/logging_setup.py and services/inference/src/logging_setup.py
mirror it, differing only in the hard-coded service value.
Key pieces:
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars, # trace_id / user_id from context
structlog.processors.add_log_level, # -> level field
structlog.processors.TimeStamper(fmt="iso", utc=True),
_inject_context, # service + trace_id + user_id defaults
structlog.processors.JSONRenderer(), # final JSON line
],
...
)
Each service calls configure_logging() once at startup (inside
create_app()), and then uses logger = get_logger(__name__) everywhere.
trace_id is set by a FastAPI middleware that reads the incoming
x-trace-id header (or mints a new one via new_trace_id()) and propagates
it to downstream calls.
Node.js implementation — pino (frontend)
The Next.js frontend should log JSON with the same schema. Reference config:
// services/frontend/lib/logger.ts
import pino from "pino";
export const logger = pino({
base: { service: "frontend" },
timestamp: pino.stdTimeFunctions.isoTime,
formatters: {
level: (label) => ({ level: label }), // keep the string level
},
messageKey: "message",
});
When emitting, always include trace_id and user_id when known:
logger.info({ trace_id, user_id, path: req.url }, "request_start");
In API routes, read the incoming x-trace-id header and echo it back on the
response so client-side traces can join up.
Cross-service querying (LogQL / Grafana Explore)
Labels Promtail applies: namespace, app, pod, level, service,
container. Everything else is a JSON field — use | json to extract it.
| Goal | Query |
|---|---|
| All errors in prod | `{namespace="samosachaat-prod"} |
| Trace a request across tiers | `{namespace="samosachaat-prod"} |
| Auth failures | `{app="auth"} |
| Slow inference calls | `{app="inference"} |
| 5xx by service | `{namespace="samosachaat-prod"} |
| Rate-limited OAuth logins | `{app="auth"} |
Trace propagation contract
- Frontend — mint
trace_idon navigation (or reuse an existing one from the current session), send it asx-trace-idon everyfetchto the API. - chat-api — read
x-trace-id, store in a context var, re-emit on the response, and forward it on every httpx call to auth or inference. - auth / inference — read
x-trace-idand bind it to the logger context for the duration of the request.
Services MUST NOT log raw secrets, JWTs, OAuth client secrets, API keys, or full user message text. Log IDs, lengths, and booleans — not contents.