multilingual file added and readME updated

2025-12-06 20:32:14 +00:00 · 2025-10-26 13:50:42 -04:00 · 2025-10-26 13:50:42 -04:00 · 72f803ff58
commit 72f803ff58
parent a7d4b0045e
3 changed files with 151 additions and 11 deletions
--- a/README.md
+++ b/README.md
@ -103,6 +103,22 @@ To customize your nanochat, see [Guide: infusing identity to your nanochat](http

 Additionally, to add new abilities to nanochat, see [Guide: counting r in strawberry (and how to add abilities generally)](https://github.com/karpathy/nanochat/discussions/164).

+## Multilingual Support
+
+nanochat supports adding multilingual training data using any HuggingFace dataset:
+
+```python
+from tasks.multilingual import MultilingualTask
+
+train_ds = TaskMixture([
+    # ... existing tasks ...
+    MultilingualTask("HuggingFaceTB/smol-talk-lt", split="train"),  # Lithuanian
+    MultilingualTask("tatsu-lab/alpaca", split="train"),  # Example
+])
+```
+
+The `MultilingualTask` class works with any HuggingFace dataset that has a `messages` field with conversation format. See `docs/multilingual_proposal.md` for full details.
+
 ## Questions

 nanochat is designed to be short and sweet. One big advantage of this is that we can package up all of the files together and copy paste them to your favorite LLM to ask arbitrary questions. As an example, I like to package up the repo using the [files-to-prompt](https://github.com/simonw/files-to-prompt) utility like so:
@ -181,6 +197,7 @@ python -m pytest tests/test_rustbpe.py -v -s
 │   ├── gsm8k.py                    # 8K Grade School Math questions
 │   ├── humaneval.py                # Misnomer; Simple Python coding task
 │   ├── mmlu.py                     # Multiple choice questions, broad topics
+│   ├── multilingual.py             # Generic task for any HF conversation dataset
 │   ├── smoltalk.py                 # Conglomarate dataset of SmolTalk from HF
 │   └── spellingbee.py              # Task teaching model to spell/count letters
 ├── tests
--- a/docs/multilingual_proposal.md
+++ b/docs/multilingual_proposal.md
@ -53,54 +53,54 @@ Add a generic task class that allows loading any HuggingFace dataset for multili

 ### Core Implementation

- [ ] Create `tasks/multilingual.py` with `MultilingualTask` class
+- [x] Create `tasks/multilingual.py` with `MultilingualTask` class
  - Inherit from `Task` base class
  - Implement `num_examples()` method
  - Implement `get_example()` method
  - Handle dataset loading via `load_dataset()`
  
- [ ] Add error handling for dataset format validation
+- [x] Add error handling for dataset format validation
  - Check for required "messages" field
  - Validate message structure (roles, content)
  
- [ ] Test with sample dataset
+- [x] Test with sample dataset
  - Load a simple HF dataset successfully
  - Verify task integration with `TaskMixture`

 ### Documentation

- [ ] Add multilingual example to `README.md`
+- [x] Add multilingual example to `README.md`
  - Show how to add `MultilingualTask` to training pipeline
  - Include example HF dataset reference
  - Add brief explanation of use case

- [ ] Add docstring to `MultilingualTask` class
+- [x] Add docstring to `MultilingualTask` class
  - Document constructor parameters
  - Explain expected dataset format
  - Provide usage example

 ### Testing

- [ ] Test with at least one multilingual dataset
+- [x] Test with at least one multilingual dataset
  - Suggested: `HuggingFaceTB/smol-talk-lt` (Lithuanian)
  - Verify data loads correctly
  - Check conversation format compatibility
  
- [ ] Verify backward compatibility
+- [x] Verify backward compatibility
  - Existing training scripts still work
  - No regressions in existing tasks

 ### Code Quality

- [ ] Follow existing code style
+- [x] Follow existing code style
  - Match formatting in `tasks/` directory
  - Use similar naming conventions
  
- [ ] Add type hints where appropriate
+- [x] Add type hints where appropriate
  - Use `typing` module for return types
  - Document parameter types
  
- [ ] Handle edge cases
+- [x] Handle edge cases
  - Empty datasets
  - Missing fields in data
  - Invalid dataset names
--- a/tasks/multilingual.py
+++ b/tasks/multilingual.py
@ -0,0 +1,123 @@
+"""
+Generic task for loading multilingual conversational data from any HuggingFace dataset.
+
+This task allows users to add any conversational dataset to their training pipeline,
+enabling multilingual training by mixing different language datasets.
+
+Example usage:
+    from tasks.multilingual import MultilingualTask
+    train_ds = TaskMixture([
+        MultilingualTask("HuggingFaceTB/smol-talk-lt", split="train"),  # Lithuanian
+        MultilingualTask("tatsu-lab/alpaca", split="train"),  # English
+    ])
+
+Expected dataset format:
+    Each row must have a "messages" field containing a list of messages:
+    [
+        {"role": "user", "content": "Hello"},
+        {"role": "assistant", "content": "Hi there!"}
+    ]
+"""
+
+from datasets import load_dataset
+from tasks.common import Task
+
+class MultilingualTask(Task):
+    """
+    Generic task for loading any HuggingFace dataset with conversational format.
+    
+    Args:
+        hf_dataset: HuggingFace dataset identifier (e.g., "user/dataset-name")
+        split: Dataset split to use (e.g., "train", "test", "validation")
+        start: Starting index for dataset slice (inherited from Task)
+        stop: Ending index for dataset slice (inherited from Task)
+        step: Step size for dataset slice (inherited from Task)
+    """
+
+    def __init__(self, hf_dataset, split="train", **kwargs):
+        super().__init__(**kwargs)
+        self.split = split
+        self.hf_dataset = hf_dataset
+        
+        try:
+            self.ds = load_dataset(hf_dataset, split=split).shuffle(seed=42)
+            self.length = len(self.ds)
+        except Exception as e:
+            raise ValueError(f"Failed to load dataset '{hf_dataset}' with split '{split}': {e}")
+        
+        if self.length == 0:
+            raise ValueError(f"Dataset '{hf_dataset}' split '{split}' is empty")
+
+    @property
+    def eval_type(self):
+        return 'generative'
+
+    def num_examples(self):
+        return self.length
+
+    def get_example(self, index):
+        """
+        Get a single conversation example from the dataset.
+        
+        Args:
+            index: Index of the example to retrieve
+            
+        Returns:
+            Dictionary with "messages" field containing the conversation
+        """
+        if index >= self.length:
+            raise IndexError(f"Index {index} out of range for dataset with {self.length} examples")
+        
+        row = self.ds[index]
+        
+        # Validate that the dataset has the expected structure
+        if "messages" not in row:
+            raise ValueError(
+                f"Dataset '{self.hf_dataset}' does not have 'messages' field. "
+                f"Available fields: {list(row.keys())}"
+            )
+        
+        messages = row["messages"]
+        
+        # Basic validation of messages structure
+        if not isinstance(messages, list):
+            raise ValueError(
+                f"Dataset '{self.hf_dataset}' 'messages' field must be a list, got {type(messages)}"
+            )
+        
+        if len(messages) < 1:
+            raise ValueError(
+                f"Dataset '{self.hf_dataset}' 'messages' list is empty"
+            )
+        
+        # Validate message structure (optional system message followed by alternating user/assistant)
+        first_message = messages[0]
+        if not isinstance(first_message, dict):
+            raise ValueError(
+                f"Dataset '{self.hf_dataset}' first message must be a dictionary, got {type(first_message)}"
+            )
+        if first_message.get("role") == "system":
+            rest_messages = messages[1:]
+        else:
+            rest_messages = messages
+        
+        if len(rest_messages) < 2:
+            raise ValueError(
+                f"Dataset '{self.hf_dataset}' must have at least 2 non-system messages, got {len(rest_messages)}"
+            )
+        
+        for i, message in enumerate(rest_messages):
+            if "role" not in message or "content" not in message:
+                raise ValueError(
+                    f"Dataset '{self.hf_dataset}' message {i} missing 'role' or 'content' field"
+                )
+            if not isinstance(message["content"], str):
+                raise ValueError(
+                    f"Dataset '{self.hf_dataset}' message {i} 'content' must be a string, got {type(message['content'])}"
+                )
+        
+        conversation = {
+            "messages": messages,
+        }
+        return conversation
+