perf: optimize prepend operation in tokenizer encode method

Replace O(n) list.insert(0, ...) with O(1) list concatenation. This improves performance when encoding large batches of text. Before: - Single string: ids.insert(0, prepend_id) - O(n) - Batch: for ids_row in ids: ids_row.insert(0, prepend_id) - O(n*m) After: - Single string: ids = [prepend_id] + ids - O(n) but faster constant - Batch: ids = [[prepend_id] + row for row in ids] - O(n*m) but faster The code comments already noted this inefficiency (TODO: slightly inefficient here?), but it was never addressed until now.
2026-04-01 21:25:21 +00:00 · 2026-03-07 11:38:17 +08:00 · 2026-03-07 11:38:17 +08:00 · 4cfa58829e
commit 4cfa58829e
parent 1076f97059
1 changed files with 4 additions and 3 deletions
--- a/nanochat/tokenizer.py
+++ b/nanochat/tokenizer.py
@ -232,15 +232,16 @@ class RustBPETokenizer:

        if isinstance(text, str):
            ids = self.enc.encode_ordinary(text)
+            # Use list concatenation instead of insert(0, ...) for O(1) prepend
            if prepend is not None:
-                ids.insert(0, prepend_id) # TODO: slightly inefficient here? :( hmm
+                ids = [prepend_id] + ids
            if append is not None:
                ids.append(append_id)
        elif isinstance(text, list):
            ids = self.enc.encode_ordinary_batch(text, num_threads=num_threads)
+            # Use list concatenation instead of insert(0, ...) for O(1) prepend per row
            if prepend is not None:
-                for ids_row in ids:
-                    ids_row.insert(0, prepend_id) # TODO: same
+                ids = [[prepend_id] + row for row in ids]
            if append is not None:
                for ids_row in ids:
                    ids_row.append(append_id)