perf: optimize prepend operation in tokenizer encode method

Replace O(n) list.insert(0, ...) with O(1) list concatenation.
This improves performance when encoding large batches of text.

Before:
- Single string: ids.insert(0, prepend_id) - O(n)
- Batch: for ids_row in ids: ids_row.insert(0, prepend_id) - O(n*m)

After:
- Single string: ids = [prepend_id] + ids - O(n) but faster constant
- Batch: ids = [[prepend_id] + row for row in ids] - O(n*m) but faster

The code comments already noted this inefficiency (TODO: slightly inefficient here?),
but it was never addressed until now.
This commit is contained in:
sandog 2026-03-07 11:38:17 +08:00
parent 1076f97059
commit 4cfa58829e

View File

@ -232,15 +232,16 @@ class RustBPETokenizer:
if isinstance(text, str):
ids = self.enc.encode_ordinary(text)
# Use list concatenation instead of insert(0, ...) for O(1) prepend
if prepend is not None:
ids.insert(0, prepend_id) # TODO: slightly inefficient here? :( hmm
ids = [prepend_id] + ids
if append is not None:
ids.append(append_id)
elif isinstance(text, list):
ids = self.enc.encode_ordinary_batch(text, num_threads=num_threads)
# Use list concatenation instead of insert(0, ...) for O(1) prepend per row
if prepend is not None:
for ids_row in ids:
ids_row.insert(0, prepend_id) # TODO: same
ids = [[prepend_id] + row for row in ids]
if append is not None:
for ids_row in ids:
ids_row.append(append_id)