nanochat/modal/_tokenizer.py
Manmohan 1d2a76eec4
feat: deploy d24 SFT + polished UI redesign with dark mode (#39)
* feat(inference): deploy d24 SFT weights to Modal

Repoint Modal inference app from the broken d20 checkpoint to our own
ManmohanSharma/nanochat-d24 SFT step 484. Rewrites the standalone model
as an inference-only port of nanochat/gpt.py so the modern architecture
(smear gate, per-layer value embeddings, ve_gate, backout, sliding
window attention via SDPA, rotary base 100000, padded vocab, logit
softcap) loads cleanly from the checkpoint. Tokenizer loads the pickled
tiktoken encoding directly so special tokens end up at their true IDs
(32759-32767), and the stop check uses that set instead of hardcoded
0-8. GPU bumped to L4 for headroom. HF token sourced from the
'huggingface' Modal secret.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(frontend): polished redesign with serif display + dark mode

Lifts the craft level of the landing and chat UI without changing the
desi identity. Adds Fraunces for display headlines, a floating pill
LandingNav, a saffron-glow hero with a large serif headline and black
pill CTAs, and three gradient-tiled feature cards with inline SVG
glyphs replacing the emoji cards. The chat empty state is now a serif
greeting with pill-chip prompt starters, and ChatInput is a single
rounded pod so the send button sits inside the input (fixes the
misaligned floating button). Adds a class-based dark mode across the
chat surfaces with a sun/moon toggle in the sidebar footer, powered by
a small useTheme hook and a no-flash init script in the root layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(frontend): add ESLint config so CI lint step passes

next lint was failing with an interactive prompt because the repo had
no ESLint config. Adds a minimal next/core-web-vitals extends and
drops the now-unloadable @typescript-eslint/no-explicit-any disable
directive in the stream proxy by narrowing the body type to unknown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 19:55:16 -04:00

80 lines
2.5 KiB
Python

"""
Minimal standalone tokenizer for Modal inference.
Loads the pickled tiktoken Encoding from a nanochat tokenizer/ directory and
exposes encode / decode / encode_special methods used by serve.py.
"""
import os
import pickle
import tiktoken
SPECIAL_TOKENS = {
"<|bos|>": 0,
"<|user_start|>": 1,
"<|user_end|>": 2,
"<|assistant_start|>": 3,
"<|assistant_end|>": 4,
"<|python_start|>": 5,
"<|python_end|>": 6,
"<|output_start|>": 7,
"<|output_end|>": 8,
}
# nanochat split pattern (matches nanochat/tokenizer.py)
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
class NanochatTokenizer:
def __init__(self, model_dir: str):
pkl_path = os.path.join(model_dir, "tokenizer.pkl")
token_bytes_path = os.path.join(model_dir, "token_bytes.pt")
if os.path.exists(pkl_path):
with open(pkl_path, "rb") as f:
loaded = pickle.load(f)
if isinstance(loaded, tiktoken.Encoding):
self._enc = loaded
return
if isinstance(loaded, dict):
mergeable_ranks = loaded
elif hasattr(loaded, "_mergeable_ranks"):
mergeable_ranks = loaded._mergeable_ranks
else:
self._enc = loaded
return
elif os.path.exists(token_bytes_path):
import torch
token_bytes = torch.load(token_bytes_path, weights_only=True)
mergeable_ranks = {bytes(token_bytes[i].tolist()): i for i in range(len(token_bytes))}
else:
raise FileNotFoundError(f"No tokenizer found in {model_dir}")
# nanochat appends specials at the end of the merge table
offset = len(mergeable_ranks)
special_tokens = {name: offset + i for i, name in enumerate(SPECIAL_TOKENS)}
self._enc = tiktoken.Encoding(
name="nanochat",
pat_str=SPLIT_PATTERN,
mergeable_ranks=mergeable_ranks,
special_tokens=special_tokens,
)
def encode(self, text: str) -> list[int]:
return self._enc.encode_ordinary(text)
def decode(self, tokens: list[int]) -> str:
return self._enc.decode(tokens)
def encode_special(self, token_name: str) -> list[int]:
return [self._enc.encode_single_token(token_name)]
def get_vocab_size(self) -> int:
return self._enc.n_vocab
def get_tokenizer(model_dir: str) -> NanochatTokenizer:
return NanochatTokenizer(model_dir)