deleted doc for proposal to submit a PR

This commit is contained in:
Ethan Silverthorne 2025-10-26 13:51:09 -04:00
parent 72f803ff58
commit e7f6062ea7

View File

@ -1,167 +0,0 @@
# Multilingual Support Proposal for nanochat
## Original Issue
### Summary
Add first-class multilingual support across tokenizer, data, training, and evaluation so models trained with nanochat learn robust semantic embeddings for multiple languages, not just English.
### Motivation
Current setup is English-centric (dataset and SFT tasks), which limits performance and semantic fidelity for nonEnglish scripts. Many users want to analyze or serve content in languages such as Spanish, French, German, Arabic, Hindi, Chinese, Japanese, Korean, etc.
### Current Behavior
**Tokenizer**: Unicodeaware ByteLevel BPE with GPTstyle regex, but tuned for English; works on any script, though segmentation is suboptimal for CJK/Thai/etc. See `nanochat/tokenizer.py:30` and usage at `nanochat/tokenizer.py:75`.
**Data**: Pretraining pulls from FineWebEdu shards, predominantly English. See `nanochat/dataset.py:19`.
**Training**: No multilingualspecific settings, losses, or evaluations; SFT/mid tasks are English.
### Alternatives Considered
- Relying on Bytefallback only: functional but yields poor segmentation for CJK/Thai and weaker downstream performance.
- Training English tokenizer on multilingual data: still biases merges toward English and inflates nonEnglish token lengths.
---
## Proposed Changes
### Option 1: Multilingual Data Task (Minimal)
Add a generic task class that allows loading any HuggingFace dataset for multilingual training.
**Changes Required:**
1. **Create new file**: `tasks/multilingual.py` (~50 lines)
- Generic wrapper around `load_dataset()`
- Supports any HF dataset with conversation format
- Follows existing `Task` interface
2. **Documentation**: Update `README.md` with usage example
**Benefits:**
- Minimal code changes (~50 lines)
- Flexible: works with any HF dataset
- No breaking changes
- Users can add any language/corpus
**Limitations:**
- Doesn't address tokenizer segmentation quality
- No built-in evaluation for multilingual tasks
---
## Pull Request Checklist
### Core Implementation
- [x] Create `tasks/multilingual.py` with `MultilingualTask` class
- Inherit from `Task` base class
- Implement `num_examples()` method
- Implement `get_example()` method
- Handle dataset loading via `load_dataset()`
- [x] Add error handling for dataset format validation
- Check for required "messages" field
- Validate message structure (roles, content)
- [x] Test with sample dataset
- Load a simple HF dataset successfully
- Verify task integration with `TaskMixture`
### Documentation
- [x] Add multilingual example to `README.md`
- Show how to add `MultilingualTask` to training pipeline
- Include example HF dataset reference
- Add brief explanation of use case
- [x] Add docstring to `MultilingualTask` class
- Document constructor parameters
- Explain expected dataset format
- Provide usage example
### Testing
- [x] Test with at least one multilingual dataset
- Suggested: `HuggingFaceTB/smol-talk-lt` (Lithuanian)
- Verify data loads correctly
- Check conversation format compatibility
- [x] Verify backward compatibility
- Existing training scripts still work
- No regressions in existing tasks
### Code Quality
- [x] Follow existing code style
- Match formatting in `tasks/` directory
- Use similar naming conventions
- [x] Add type hints where appropriate
- Use `typing` module for return types
- Document parameter types
- [x] Handle edge cases
- Empty datasets
- Missing fields in data
- Invalid dataset names
---
## Minimal Documentation Changes
**Add to `README.md`:**
```markdown
## Multilingual Support (Experimental)
Add multilingual data to training using any HuggingFace dataset:
```python
from tasks.multilingual import MultilingualTask
train_ds = TaskMixture([
# ... existing tasks ...
MultilingualTask("HuggingFaceTB/smol-talk-lt", split="train"), # Lithuanian
MultilingualTask("tatsu-lab/alpaca", split="train"), # Example
])
```
See `docs/multilingual_proposal.md` for full details.
```
---
## Backward Compatibility
✅ All changes are **opt-in**. Default behavior remains English-only.
- Default `vocab_size=65536` unchanged
- New `MultilingualTask` is additive
- `get_embeddings()` method is new, doesn't affect existing code
- No changes to default training pipeline
---
## Implementation Summary
**Effort**: ~1 day, ~80 lines of code
- **New file**: `tasks/multilingual.py` (~50 lines)
- **Documentation**: ~30 lines in README
- **Testing**: 1 multilingual dataset
**Files Changed:**
1. `tasks/multilingual.py` (new)
2. `README.md` (add section)
---
## Future Enhancements (Out of Scope)
- SentencePiece tokenizer option for better CJK segmentation
- Configurable tokenizer vocab size (128k-200k)
- Sentence embedding API
- Custom per-language regex patterns
- Contrastive training script
- Multilingual eval bundle
- Per-language weighting in TaskMixture