- FastAPI service that manages conversations and messages in PostgreSQL
(SQLAlchemy 2.0 async + asyncpg) and streams assistant responses back
to the client via sse-starlette, forwarding the inference service SSE
contract unchanged.
- Auth guard validates every request against the auth service
/auth/validate endpoint (X-Internal-API-Key) and caches results in an
in-process TTL cache (5 min, 1024 entries) to absorb request bursts.
- Every query filters by authenticated user_id; cross-user access
returns 404. Message send flow auto-titles the first message,
persists the streamed assistant response after the client disconnects,
and records token_count + inference_time_ms.
- /api/models{,/swap} proxies the inference admin surface; swap
requires is_admin on the validated user.
- Structured JSON logging via structlog with trace_id + user_id
ContextVars attached to every log line.
- Test suite (pytest + aiosqlite + respx) covers CRUD, user scoping,
streaming SSE persistence, regenerate, model proxy admin gate,
and the stream proxy error path. 16/16 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add reusable Terraform modules and per-environment configs (dev/uat/prod)
in us-west-2 covering: VPC (3 AZ public/private), EKS 1.29 with IRSA and
ALB/EBS/EFS CSI add-ons, RDS PostgreSQL 15, four ECR repos, IAM roles
(EKS node, ALB controller IRSA, GitHub Actions OIDC), Route53 + ACM for
samosachaat.art, and EFS for model weights. State backend on S3
(samosachaat-terraform-state) with DynamoDB lock table.
terraform validate passes for dev, uat, and prod.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merged landing + chat into single page (samosa/chai slide out on first message)
- Positioned input bar between samosa and chai illustrations
- Footer at very bottom with Karpathy credit
- Removed cart icon, fixed "Aachaat" → "Chaat" everywhere
- Improved lemon SVG with stem/nub
- "Explore" → "Samosa" label
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Landing page with desi street-food aesthetic: lemon-mirchi toran with
pendulum animation, dual-script hero (Devanagari + English cursive),
samosa illustration with floating animation, brass chai kettle with
steam wisps, ambient chilli/lemon doodles.
Chat page carries the warm samosa-chaat palette with cream/gold user
bubbles, steam-wisp typing indicator, and WebGPU integration hooks
(window.samosaChaat API for local inference mode switching).
Added scripts/export_onnx.py for ONNX model export with KV cache
support, targeting WebGPU browser inference.
Credit to Andrej Karpathy's nanochat in footer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The KV cache was hardcoded to float32 on non-CUDA devices, but the model
weights are loaded in bfloat16 via NANOCHAT_DTYPE env var. This caused a
RuntimeError in scaled_dot_product_attention. Now uses COMPUTE_DTYPE from
common.py which respects the env var.
Also broadened CI/CD path triggers to nanochat/**.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploys to EC2 on push to master when UI/server files change.
Uses appleboy/ssh-action with stored secrets.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change
Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.