Previous text-stream approach lost markers because BPE partial-byte tokens decode to empty strings, so assistant_text never accumulated the full marker. Switch to matching the ordinary-token id sequence directly (python_start = [60,124,25145,95,17104,124,62]).
The SFT loader tokenizes assistant content with .encode() (ordinary), not .encode_special(), so the model was trained to emit <|python_start|> / <|python_end|> as the 7-token ordinary sequence [60, 124, 25145, 95, 17104, 124, 62] rather than as special token id 32764. My prior state-machine matched token_id == python_start_id, which never fired — so tool calls were never executed and the model just hallucinated fake tool results (Official leadership page etc). Fix: detect markers in the decoded text stream, parse the payload between <|python_start|> and <|python_end|>, execute the tool, inject the real <|output_start|>…<|output_end|> tokens into both the SSE stream and the model's input_ids. Next-token prediction is now grounded on real Tavily output.
Model R6 (97% pass rate on 33-probe eval, val_bpb 0.2635):
- modal/serve.py + modal/_tools.py: tool-aware streaming with
TavilySearchBackend auto-detect, python_start/end state machine,
output_start/end forcing; mount tavily secret
- modal/serve.py: MODEL_TAG=d24-sft-r6, model path points at new SFT r6
- services/chat-api/routes/messages.py: accept thinking_mode flag,
inject samosaChaat system prompt (direct or <think> variant) into
first user message before streaming to Modal
- services/frontend/components/chat/ChatInput.tsx: Brain toggle
'Think' button next to send; when active, model uses think mode
- services/frontend/components/chat/ChatWindow.tsx: track
thinkingMode state, pass through to API body as thinking_mode
- services/frontend/components/chat/MessageBubble.tsx: parse and
render <think>...</think> as collapsible italic blocks;
<|python_start|>...<|python_end|> as tool-call cards with icons
per tool name; <|output_start|>...<|output_end|> as result cards
with expandable JSON
- nanochat/tools.py: TavilySearchBackend class + auto-detect
- nanochat/ui.html: legacy UI reasoning toggle (kept for parity)
Tool execution verified live: query -> web_search via Tavily ->
Macron returned with grounded answer.
When docker compose recreates a service, it gets a new internal IP.
nginx was resolving upstream hostnames once at startup and serving 502
until someone manually restarted it — which is what broke /api/auth
after the last deploy.
Uses Docker Compose's embedded DNS (127.0.0.11) and moves each
proxy_pass onto a variable so nginx re-resolves every request.
Rewrites replace the path-stripping behavior that variable-form
proxy_pass doesn't provide out of the box.
Also adds a `nginx -t && nginx -s reload` step in the deploy workflow
so future nginx.conf edits land without manual ssh.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assistant responses were capped at max-w-[75%] of the column, so long
replies broke into a narrow block with dead space on the right. Cap
only applies to user bubbles now; assistant messages use w-full of the
max-w-3xl content column, matching how ChatGPT/Claude render replies.
Also bumps message vertical spacing from mb-3 to mb-5.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LandingNav was max-w-3xl which forced "How it works" and "Try
samosaChaat" to wrap on two lines. Bumps the pill to 1100px,
tightens the link padding, demotes the @ handle to lg+, and adds
whitespace-nowrap to every chip so nothing wraps again. Default
theme is now dark — the no-flash init script adds .dark unless the
user has explicitly stored 'light', and the useTheme hook seeds
from the same logic.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aws-actions/configure-aws-credentials needs id-token: write to mint the
OIDC JWT and assume AWS_ROLE_ARN. Without it the deploy-ec2 workflow
fails at the credentials step. Add the permission at workflow scope.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(inference): deploy d24 SFT weights to Modal
Repoint Modal inference app from the broken d20 checkpoint to our own
ManmohanSharma/nanochat-d24 SFT step 484. Rewrites the standalone model
as an inference-only port of nanochat/gpt.py so the modern architecture
(smear gate, per-layer value embeddings, ve_gate, backout, sliding
window attention via SDPA, rotary base 100000, padded vocab, logit
softcap) loads cleanly from the checkpoint. Tokenizer loads the pickled
tiktoken encoding directly so special tokens end up at their true IDs
(32759-32767), and the stop check uses that set instead of hardcoded
0-8. GPU bumped to L4 for headroom. HF token sourced from the
'huggingface' Modal secret.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(frontend): polished redesign with serif display + dark mode
Lifts the craft level of the landing and chat UI without changing the
desi identity. Adds Fraunces for display headlines, a floating pill
LandingNav, a saffron-glow hero with a large serif headline and black
pill CTAs, and three gradient-tiled feature cards with inline SVG
glyphs replacing the emoji cards. The chat empty state is now a serif
greeting with pill-chip prompt starters, and ChatInput is a single
rounded pod so the send button sits inside the input (fixes the
misaligned floating button). Adds a class-based dark mode across the
chat surfaces with a sun/moon toggle in the sidebar footer, powered by
a small useTheme hook and a no-flash init script in the root layout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(frontend): add ESLint config so CI lint step passes
next lint was failing with an interactive prompt because the repo had
no ESLint config. Adds a minimal next/core-web-vitals extends and
drops the now-unloadable @typescript-eslint/no-explicit-any disable
directive in the stream proxy by narrowing the body type to unknown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced the double-proxy (browser→Next.js→chat-api→Modal) with
direct streaming (browser→nginx→chat-api→Modal). Added nginx route
for /api/conversations → chat-api. Inlined SSE parsing in ChatWindow
instead of useSSE hook going through /api/chat/stream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chat-api expects {content, temperature, max_tokens, top_k} but frontend
was sending {messages: [...]}. Now extracts last user message as content
when proxying to /api/conversations/:id/messages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nginx was catching /api/chat/stream and /api/conversations and sending
them to chat-api:8002, bypassing the frontend's Next.js API routes.
Now only /api/auth/* goes directly to auth service. Everything else
goes to frontend, which proxies internally to backend services.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chat-api doesn't define the users model (owned by auth service), so
SQLAlchemy can't resolve the FK. use_alter=True defers the constraint
to ALTER TABLE, avoiding the NoReferencedTableError at startup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The inference client now auto-detects if the URL already ends with
/generate (Modal's endpoint URL pattern) and skips appending the path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Landing page: warm gradient background, illustrations flanking hero text
(180-220px), new tagline, features section with 3 cards, footer updated
to "Built by Manmohan", gold CTA and nav buttons, toran moved to hero.
Chat page: removed "Chat Completions" header, added samosa logo and
bigger suggestion cards to empty state, sidebar empty state message,
input area top border/shadow, more prominent new chat button.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AUTH_BASE_URL, FRONTEND_URL, INTERNAL_API_KEY, SESSION_SECRET,
COOKIE_SECURE, COOKIE_DOMAIN, REFRESH_COOKIE_NAME were in .env
but not passed to auth container. OAuth callbacks were using
localhost:8001 instead of the public URL.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root .gitignore had `models/` which matched both ML weights AND
SQLAlchemy model dirs. Changed to `/models/` (root only).
Added auth/src/models/ (User) and chat-api/src/models/ (Conversation, Message).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auth service was crash-looping with ModuleNotFoundError for
prometheus_fastapi_instrumentator. Chat-api was also missing it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove NextAuth and replace with token-based auth against the backend
auth service (OAuth + JWT). The frontend now redirects login to
/api/auth/google and /api/auth/github (proxied by nginx to the auth
service), captures the JWT from the redirect query param, and uses it
for all API calls.
Key changes:
- Remove next-auth dependency and all NextAuth config/routes
- Add lib/auth-client.ts (JWT token storage + auth headers)
- Add hooks/useAuth.ts (client-side auth state + token capture)
- Rewrite middleware.ts to pass-through (client-side auth only)
- Login page uses plain <a> links to /api/auth/{provider}
- Chat page captures access_token from OAuth redirect
- Zustand store fetches conversations from real chat-api via JWT
- API routes proxy /api/conversations/* to chat-api with auth
- chat/stream route supports conversationId + auth header forwarding
- useSSE hook accepts auth headers for authenticated streaming
- Sidebar loads conversations from API, supports delete
- Landing page (Hero, LandingNav) uses useAuth instead of useSession
- Add .env.production.example and scripts/generate-jwt-keys.sh
Mock echo fallback preserved when CHAT_API_URL is not set.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- deploy.sh: single script to switch between EC2 and EKS modes
- ec2: docker-compose with ECR images + nginx SSL reverse proxy
- eks: terraform apply + helm install (for demos/grading)
- eks-down: terraform destroy (stop costs)
- docker-compose.prod.yml: ECR image overrides + nginx service
- nginx/nginx.conf: reverse proxy with SSL, SSE streaming support
- deploy-ec2.yml: auto-deploy to EC2 after images are built
- Remove old single-server deploy.yml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The observability PR added structlog and prometheus-fastapi-instrumentator
to inference pyproject.toml but did not regenerate uv.lock, causing
Docker build to fail with --locked flag.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root pyproject.toml uses uv features (extra in sources, conflicts)
that caused uv sync to fail in CI. Fix by:
1. Replace pip install uv==0.4.30 with astral-sh/setup-uv@v4 (latest)
2. Add --no-workspace flag so services don't inherit root config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>