mirror of
https://github.com/karpathy/nanochat.git
synced 2026-05-15 04:07:32 +00:00
Replaces the helm/observability scaffold with a real monitoring stack
wired into the samosaChaat platform.
Helm chart (helm/observability/)
- Chart.yaml declares kube-prometheus-stack (~62.0) and loki-stack
(~2.10) as subchart dependencies.
- values.yaml configures Prometheus (15d retention, 50Gi PVC,
ServiceMonitor + rule selector on app.kubernetes.io/part-of:
samosachaat), Alertmanager (10Gi PVC), Grafana (OAuth-only via
GitHub + Google, local login disabled, Prometheus + Loki datasources,
dashboards auto-provisioned from a ConfigMap, email + Slack contact
points with a critical route to Slack), Loki (50Gi, 30d retention,
tsdb schema), and Promtail (JSON pipeline that lifts level / service
/ trace_id / user_id into labels, scrape config with pod labels).
- Alert rules: HighCPU, HighMemory, DiskSpaceLow, High5xxRate,
InferenceServiceDown, HighP99Latency.
- templates/grafana-dashboards-configmap.yaml renders every file under
dashboards/ into a single grafana_dashboard=1 ConfigMap.
- dashboards/node-health.json, app-performance.json, inference.json -
fully-formed Grafana dashboards with Prometheus datasource variable,
templated app selector, thresholded gauges, and LogQL-ready labels.
Scraping (helm/samosachaat/templates/servicemonitor.yaml)
- ServiceMonitor CRs for auth / chat-api / inference that Prometheus
picks up via the part-of=samosachaat selector; scrapes /metrics
every 15s and replaces the app label so dashboards line up.
Application instrumentation
- services/{auth,chat-api,inference} each depend on
prometheus-fastapi-instrumentator and expose /metrics (request count,
latency histograms, in-progress gauges).
- services/auth/src/logging_setup.py and
services/inference/src/logging_setup.py mirror the canonical
chat-api implementation - structlog JSON with service, trace_id,
user_id context injection.
- configure_logging() is called at create_app() in auth and inference;
inference's main.py now uses structlog via get_logger() instead of
logging.getLogger.
- log_level setting added to auth + inference config (LOG_LEVEL env).
Docs
- contracts/logging-standard.md defines the required JSON fields,
Python (structlog) + Node.js (pino) implementations, LogQL examples
for cross-service queries, and the x-trace-id propagation contract.
Closes #9
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
111 lines
4.7 KiB
Markdown
111 lines
4.7 KiB
Markdown
# samosaChaat Logging Standard
|
|
|
|
All services in the samosaChaat platform emit logs as **single-line JSON**
|
|
on stdout. Promtail ships them to Loki, where Grafana queries them by label
|
|
and by JSON field. Because every service shares the same schema, a single
|
|
trace_id lets you follow a request from the frontend through auth → chat-api
|
|
→ inference.
|
|
|
|
## Required fields
|
|
|
|
Every log line MUST include:
|
|
|
|
| Field | Type | Source |
|
|
|-------------|---------|------------------------------------------|
|
|
| `timestamp` | ISO8601 | structlog `TimeStamper(fmt="iso")` |
|
|
| `level` | string | `debug` / `info` / `warning` / `error` |
|
|
| `service` | string | hard-coded per service (`auth`, `chat-api`, `inference`, `frontend`) |
|
|
| `message` | string | the human-readable event (`event` key in structlog) |
|
|
|
|
Conditionally included (when present in the request context):
|
|
|
|
| Field | When to include |
|
|
|--------------|------------------------------------------|
|
|
| `trace_id` | every request served by a backend service — propagated via the `x-trace-id` header |
|
|
| `user_id` | every request authenticated as a user |
|
|
| `inference_time_ms` | emitted by chat-api and inference around model calls |
|
|
| `error` | on exceptions — the stringified cause |
|
|
|
|
Anything else is free-form structured context (`method`, `path`,
|
|
`status_code`, `model_tag`, …). Keep keys `snake_case`.
|
|
|
|
## Python implementation — `structlog`
|
|
|
|
The canonical setup lives at `services/chat-api/src/logging_setup.py`.
|
|
`services/auth/src/logging_setup.py` and `services/inference/src/logging_setup.py`
|
|
mirror it, differing only in the hard-coded `service` value.
|
|
|
|
Key pieces:
|
|
|
|
```python
|
|
structlog.configure(
|
|
processors=[
|
|
structlog.contextvars.merge_contextvars, # trace_id / user_id from context
|
|
structlog.processors.add_log_level, # -> level field
|
|
structlog.processors.TimeStamper(fmt="iso", utc=True),
|
|
_inject_context, # service + trace_id + user_id defaults
|
|
structlog.processors.JSONRenderer(), # final JSON line
|
|
],
|
|
...
|
|
)
|
|
```
|
|
|
|
Each service calls `configure_logging()` once at startup (inside
|
|
`create_app()`), and then uses `logger = get_logger(__name__)` everywhere.
|
|
`trace_id` is set by a FastAPI middleware that reads the incoming
|
|
`x-trace-id` header (or mints a new one via `new_trace_id()`) and propagates
|
|
it to downstream calls.
|
|
|
|
## Node.js implementation — `pino` (frontend)
|
|
|
|
The Next.js frontend should log JSON with the same schema. Reference config:
|
|
|
|
```ts
|
|
// services/frontend/lib/logger.ts
|
|
import pino from "pino";
|
|
|
|
export const logger = pino({
|
|
base: { service: "frontend" },
|
|
timestamp: pino.stdTimeFunctions.isoTime,
|
|
formatters: {
|
|
level: (label) => ({ level: label }), // keep the string level
|
|
},
|
|
messageKey: "message",
|
|
});
|
|
```
|
|
|
|
When emitting, always include `trace_id` and `user_id` when known:
|
|
|
|
```ts
|
|
logger.info({ trace_id, user_id, path: req.url }, "request_start");
|
|
```
|
|
|
|
In API routes, read the incoming `x-trace-id` header and echo it back on the
|
|
response so client-side traces can join up.
|
|
|
|
## Cross-service querying (LogQL / Grafana Explore)
|
|
|
|
Labels Promtail applies: `namespace`, `app`, `pod`, `level`, `service`,
|
|
`container`. Everything else is a JSON field — use `| json` to extract it.
|
|
|
|
| Goal | Query |
|
|
|-------------------------------|-----------------------------------------------------------------------|
|
|
| All errors in prod | `{namespace="samosachaat-prod"} | json | level="error"` |
|
|
| Trace a request across tiers | `{namespace="samosachaat-prod"} | json | trace_id="<trace>"` |
|
|
| Auth failures | `{app="auth"} | json | level="error"` |
|
|
| Slow inference calls | `{app="inference"} | json | inference_time_ms > 5000` |
|
|
| 5xx by service | `{namespace="samosachaat-prod"} | json | status_code >= 500` |
|
|
| Rate-limited OAuth logins | `{app="auth"} | json | path=~"/auth/oauth/.*" | status_code=429` |
|
|
|
|
## Trace propagation contract
|
|
|
|
1. **Frontend** — mint `trace_id` on navigation (or reuse an existing one from
|
|
the current session), send it as `x-trace-id` on every `fetch` to the API.
|
|
2. **chat-api** — read `x-trace-id`, store in a context var, re-emit on the
|
|
response, and forward it on every httpx call to auth or inference.
|
|
3. **auth / inference** — read `x-trace-id` and bind it to the logger context
|
|
for the duration of the request.
|
|
|
|
Services MUST NOT log raw secrets, JWTs, OAuth client secrets, API keys, or
|
|
full user message text. Log IDs, lengths, and booleans — not contents.
|