mirror of
https://github.com/karpathy/nanochat.git
synced 2025-12-16 01:02:18 +00:00
2.5 KiB
2.5 KiB
Running nanochat on the Cloud with SkyPilot
This directory contains SkyPilot configurations for easily launching nanochat on major cloud providers (AWS, GCP, Azure), GPU clouds (Lambda, Nebius, RunPod, etc.), and Kubernetes clusters.
Prerequisites
- Install SkyPilot and configure it with your cloud provider(s) or Kubernetes cluster:
- Follow the SkyPilot installation guide
- Configure your cloud credentials (AWS, GCP, Azure, Lambda, Nebius, etc.) OR
- Configure Kubernetes access via SkyPilot's Kubernetes support
Training: Running the Speedrun Pipeline
Launch the speedrun training pipeline on any cloud provider with a single command:
sky launch -c nanochat-speedrun cloud/speedrun.sky.yaml --infra <k8s|aws|gcp|nebius|lambda|etc>
This will:
- Provision an 8xH100 GPU node
- Set up the environment
- Run the complete training pipeline via
speedrun.sh - Save trained model checkpoints to
s3://nanochat-data(change this to your own bucket) - Complete in approximately 4 hours (~$100 on most providers)
Monitoring Training Progress
After launching, you can SSH into the cluster and monitor progress:
# SSH into the cluster
ssh nanochat-speedrun
# View the speedrun logs
sky logs nanochat-speedrun
Serving: Deploy Your Trained Model
Once training is complete, serve your trained model with the web UI:
sky launch -c nanochat-serve cloud/serve.sky.yaml --infra <k8s|aws|gcp|nebius|lambda|etc>
This will:
- Provision a 1xH100 GPU node (much cheaper then an 8xH100 VM used for training)
- Load model weights from the same
s3://nanochat-databucket used during training - Serve the web chat interface on port 8000
- Cost is ~$2-3/hour on most providers
Accessing the Web UI
Get the endpoint URL to access the chat interface:
sky status --endpoint 8000 nanochat-serve
Open the displayed URL in your browser to chat with your trained model!
Shared Storage
Both training and serving tasks use SkyPilot's bucket mounting functionality to preserve and share model weights. This allows you to:
- Train once, serve multiple times without re-downloading weights
- Share trained models across different serving instances