google-labs-jules[bot]
bbc816dc77
Reduce base_train batch size and set PYTORCH_HIP_ALLOC_CONF
...
To address "HIP out of memory" errors on some AMD ROCm configurations (potentially due to memory fragmentation or limited per-device VRAM), this change:
1. Reduces the default `device_batch_size` from 32 to 16.
2. Explicitly sets `PYTORCH_HIP_ALLOC_CONF=expandable_segments:True` when ROCm is detected, which helps the allocator manage fragmented memory better than the default behavior.
2025-11-23 16:03:02 +00:00
google-labs-jules[bot]
48e632245e
Fix ROCm/APU detection and CPU DDP OOM crash
2025-11-22 09:18:40 +00:00
google-labs-jules[bot]
a35621e726
Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC, add script safety
2025-11-22 05:31:47 +00:00
Andrej Karpathy
c6abcdfe3a
big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.
2025-11-13 15:34:40 +00:00
Andrej Karpathy
c6b7ab7440
grad clip logging and printing and cosmetics
2025-11-05 21:08:30 +00:00
Andrej
dfc88334b6
fix tok/sec calculation bug when grad accum steps > 1
...
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
2025-10-30 08:36:32 -07:00
svlandeg
8c9b004c99
typo fixes in scripts
2025-10-28 20:17:31 +01:00
water-vapor
a9de4b1038
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
2025-10-26 01:43:49 -05:00
Andrej Karpathy
81597cd616
move the lr schedule args up in base_train so they are tunable in configurator
2025-10-24 13:27:31 +00:00
Andrej Karpathy
a088b7a6ec
use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available
2025-10-21 18:07:33 +00:00
Andrej Karpathy
5bdc99abfb
merge and resolve conflict
2025-10-21 17:19:10 +00:00
Andrej Karpathy
dfcb1c16f1
Merge branch 'master' into cpu-mps-dev
2025-10-21 17:15:53 +00:00
Andrej Karpathy
c1d2ed1c13
use orig_model in sampling, silly of me to miss this
2025-10-20 00:05:09 +00:00
Andrej Karpathy
2bc521a6de
use orig_model in sampling, silly of me to miss this
2025-10-20 00:04:15 +00:00
karpathy
df600b6ed5
many small tweaks. base, eval, core work now i think
2025-10-16 15:46:18 -07:00
karpathy
786119d593
add autodetect of device and related stuff. getting weird warnings/errors still, so wip
2025-10-16 10:26:19 -07:00
karpathy
279b74312c
adjust comment/guidance on device type
2025-10-16 10:06:39 -07:00
karpathy
306bc380ab
add support for CPU and for MPS. I had to change a few cosmetic things. I also discovered I think a bit of a bug, where I was casting wte to bfloat16 in the wrong place (the model init) instead of in init_weights
2025-10-16 10:04:43 -07:00
Andrej Karpathy
722da4f543
trying to add basic cpu support, will try mps too
2025-10-16 16:14:38 +00:00
karpathy
3a5e0bc50b
initial commit
2025-10-13 06:49:24 -07:00