javasoup
2adcc95c4e
Merge branch 'master' into refactor-vertex-ai-pipelines
2025-12-01 20:07:43 -05:00
Nuno Pereira
13001597c2
Success on Vertex Pipelines
2025-12-01 19:59:58 -05:00
Andrej
4a87a0d19f
Merge pull request #299 from samjabrahams/rotary_embedding_head_dim_comment_cleanup
...
Fix comment: rotary embeddings final dimension size
2025-11-17 13:29:21 -08:00
Sam Abrahams
11e68bf442
Fix comment: rotary embeddings final dimension size
2025-11-17 11:32:56 -05:00
Andrej Karpathy
bc1fca39f3
mqa -> gqa to reduce confusion
2025-11-15 15:43:37 +00:00
Andrej
f66a780f68
Fix torch.dtype mismatching when running engine inline test.
2025-11-14 07:28:29 -08:00
Andrej
4763ce612a
Small fixes to typos
2025-11-14 07:25:59 -08:00
Sofie Van Landeghem
c6f5bd67db
revert change of base to sft for quick inline test
2025-11-14 12:20:03 +01:00
svlandeg
a2fb3c83a6
fix typos
2025-11-14 11:20:25 +01:00
svlandeg
e5efb4b471
add test_engine.py to file structure
2025-11-14 11:13:42 +01:00
Andrej Karpathy
9a71d13688
typo oops
2025-11-13 16:08:30 +00:00
Andrej Karpathy
7b7fd0fe71
thank you Sophie for your help with nanochat
2025-11-13 16:07:54 +00:00
Andrej Karpathy
c6abcdfe3a
big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.
2025-11-13 15:34:40 +00:00
Andrej Karpathy
91f09ccd0d
minor fix comment in engine
2025-11-13 15:28:18 +00:00
Andrej Karpathy
adb5d4a16c
uv lock has to change when we removed numpy the other commit
2025-11-13 15:16:27 +00:00
howardgao@outlook.com
b399e43168
fix engine test bug
2025-11-06 08:56:45 +08:00
Andrej Karpathy
c6b7ab7440
grad clip logging and printing and cosmetics
2025-11-05 21:08:30 +00:00
Andrej
885a4f25e7
Replace fcntl with filelock for Windows compatibility
2025-11-04 16:35:39 -08:00
Andrej
3a2ae631c4
Merge branch 'master' into master
2025-11-04 16:35:02 -08:00
Andrej
12d995f58c
Add NPROC_PER_NODE var to speedrun.sh and run1000.sh
2025-11-04 16:26:33 -08:00
svlandeg
f1683c5b16
set nproc_per_node as var in speedrun and run1000 scripts
2025-11-04 21:36:10 +01:00
Andrej
d1558c7873
handle bf16 on MPS by casting to fp32 during load checkpoint
2025-11-04 09:42:50 -08:00
Andrej
df25293087
Add explicit UTF-8 encoding on open
2025-11-04 09:38:18 -08:00
Yasser Makram
1e89af9862
Replace fcntl with filelock for Windows compatibility
2025-11-04 07:22:34 +00:00
google-labs-jules[bot]
a88e7ec21f
fix: Correct Docker build for rustbpe tokenizer
...
This commit fixes a build failure in the Docker image by implementing a more robust build process for the `rustbpe` tokenizer.
The `Dockerfile` now explicitly creates a `uv` virtual environment, adds its `bin` directory to the `PATH`, installs `maturin` into the environment, and then runs the `maturin develop` command. This ensures that the build command executes within a fully configured environment with all necessary tools available on the `PATH`, resolving the "No such file or directory" error.
2025-11-04 02:24:08 +00:00
google-labs-jules[bot]
fa04262889
fix: Correct Docker build for rustbpe tokenizer
...
This commit fixes a build failure in the Docker image by adding the `--uv` flag to the `maturin develop` command.
The `maturin` build process was failing because it could not find `pip` within the `uv` environment. The `--uv` flag ensures that `maturin` correctly uses the `uv` environment to build the `rustbpe` tokenizer.
2025-11-04 02:05:34 +00:00
google-labs-jules[bot]
a2189d20d0
feat: Use Cloud Build for Vertex AI pipeline image creation
...
This commit streamlines the process of running the nanochat pipeline on Vertex AI by using Cloud Build to automate the Docker image creation process.
A `cloudbuild.yaml` file has been added to define the build steps, and a `run_pipeline.sh` script has been created to orchestrate the build and pipeline submission.
The `README.md` has been updated to reflect the new, simplified workflow.
2025-11-04 01:47:20 +00:00
google-labs-jules[bot]
2781d216c6
feat: Refactor nanochat to run on Vertex AI Pipelines
...
This refactoring enables the nanochat project to be executed as a scalable and robust pipeline on Vertex AI.
The monolithic `speedrun.sh` script has been decomposed into a series of containerized components orchestrated by a Kubeflow pipeline.
The codebase has been updated to use Google Cloud Storage for artifact management, allowing for seamless data sharing between pipeline steps.
A `Dockerfile` and Python wrappers for each pipeline step have been added to the `vertex_pipelines` directory.
The `README.md` has been updated with instructions on how to build the Docker image and run the Vertex AI pipeline.
2025-11-04 01:26:51 +00:00
Dipesh Babu
7a40ee77b4
fix: cast bf16 to fp32 on MPS (like CPU) to avoid dtype issues
2025-11-03 16:00:56 -05:00
svlandeg
2ce62ec076
ensure consistency of quotes within each statement
2025-11-03 21:52:02 +01:00
svlandeg
e22fc6f2fa
few more explicit UTF-8 encodings
2025-11-03 21:46:39 +01:00
svlandeg
c72b8b2309
add explicit UTF-8 encoding
2025-11-03 21:27:12 +01:00
Andrej
a83646e098
fix(eval): use UTF-8 when reading CORE JSONL and writing CSV
2025-11-03 06:38:33 -08:00
Andrej
8681922328
fix lstrip bug, make it removeprefix, TIL.
2025-11-03 06:37:48 -08:00
Dipesh Babu
226953b841
fix: open JSONL and results CSV with UTF-8 encoding for portability
2025-11-03 01:20:56 -05:00
Josh Odom
f1e15f5f4d
Fixing subtle bug: lstrip removes all matching characters, including potentially required ones. Use removeprefix instead.
2025-11-02 23:40:37 -06:00
Andrej
b6da6982f6
fix nanochat logo: the t was placed too far to the right
2025-11-02 08:17:00 -08:00
Andrej
c2c4f77e22
oops small bugfix to run1000.sh missing kwarg
2025-11-02 08:14:41 -08:00
Andrej
d1ac0b2d07
when loading models on CPU, convert tensors from bfloat16 to float
2025-11-02 07:58:56 -08:00
svlandeg
5bfcd31b73
revert more formatting changes
2025-11-02 14:17:10 +01:00
svlandeg
036a3c5881
revert formatting changes to facilitate review
2025-11-02 14:16:43 +01:00
svlandeg
52e85aaf80
Merge branch 'master' into fix/typo
2025-11-02 13:41:13 +01:00
Jing Zhang
ba4f40bf58
Update run1000.sh to add missing --run=$WANDB_RUN
2025-11-01 21:27:00 -07:00
Manuel Saelices
d54c9cbf8c
CPU Support, as bfloat16 params breaks inference
2025-11-01 23:38:50 +01:00
Andrej Karpathy
cf587acb1a
move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts
2025-11-01 16:04:38 +00:00
Andrej Karpathy
7d2c4a3d95
delete pandas dep in base_eval use csv instead
2025-11-01 15:28:30 +00:00
Andrej
ad39db5a23
tiny fix to comment
...
Update engine.py with correct error message on assert
2025-11-01 07:43:57 -07:00
Andrej
630f54ae5a
use empty locals and globals in call to eval() in engine tool use
...
harden eval: prevent the calc tool from accessing globals and locals
2025-11-01 07:22:59 -07:00
Andrej Karpathy
f15732524a
make deepwiki link better
2025-11-01 14:13:29 +00:00
Andrej
dfc88334b6
fix tok/sec calculation bug when grad accum steps > 1
...
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
2025-10-30 08:36:32 -07:00