mirror of
https://github.com/karpathy/nanochat.git
synced 2026-02-04 09:39:50 +00:00
fix typo in scripts/chat_rl.py
typo in comments: change "GAPO" to "DAPO"
This commit is contained in:
parent
962b6bfba3
commit
db5e62fc2a
|
|
@ -6,7 +6,7 @@ simpler and more similar to just REINFORCE:
|
|||
|
||||
1) Delete trust region, so there is no KL regularization to a reference model
|
||||
2) We are on policy, so there's no need for PPO ratio+clip.
|
||||
3) We use GAPO style normalization that is token-level, not sequence-level.
|
||||
3) We use DAPO style normalization that is token-level, not sequence-level.
|
||||
4) Instead of z-score normalization (r - mu)/sigma, only use (r - mu) as the advantage.
|
||||
|
||||
1 GPU:
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user