doc/todo.md

# nanochat-omni TODO

定位：**质感感知语音输入**（audio-first，输出仅 text，vision 排后期）。
参考：[research_feasibility.md](research_feasibility.md)（mochi, 2026-05-05）的 W1–W8 时间盘。

---

## 近期 — 仓库结构 / 工程基建

- [ ] **submodule 展平为 monorepo fork**
  - `git remote add upstream https://github.com/karpathy/nanochat.git`
  - 重写 `main`：`git reset --hard upstream/main` + cherry-pick 我们的 7 个 commits（CI / smoke / wandb / gitignore / README）
  - force push（repo 没人 fork，安全），删 `.gitmodules` + `upstream/nanochat/`
  - 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`（fork 后不用 sed）
- [ ] CI smoke 跟着 fork 重路径化（`upstream/nanochat/` → 根目录）

## W1 — Whisper encoder + Projector forward smoke

参考 research §1.2 模块图。

- [x] `nanochat/audio.py`：WhisperEncoder wrapper（冻结，ModelScope 优先经 `WHISPER_MS_ID`，HF fallback 默认 `openai/whisper-base`）+ Projector（MLP，输出维度对齐 nanochat `n_embd`）
- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数，作为 soft tokens prepend 到 text embedding 前面（kv_cache 路径暂不支持，audio 位置 targets 自动 -1 mask）
- [x] mini dataset：5 段 5s 合成正弦 + 字幕，落 `data/audio_smoke/`（wav 由 `scripts/audio_smoke_data.py` 生成，gitignore 排除）
- [x] `scripts/audio_align_smoke.py`：50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过（4090 实测 ~1s，5.55→0.17）
- [ ] CI 加 audio smoke job（ailab runner 装 ffmpeg；whisper 走 transformers 即可）

W1 后续可改进（暂搁，留给 W3+/W5+ 质感任务）：

- 当前用 `last_hidden_state`（最偏文本语义的层）；为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
- d6 GPT 是随机初始化，alignment 信号其实在练 LM 而非 projector；W2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐

## W2 — S1 弱对齐训练

- [ ] 拉 LibriSpeech 100h（HF mirror），预提 Whisper-base encoder 特征落盘 webdataset
- [ ] `scripts/audio_align_train.py`：冻结 LM + Whisper，只训 Projector
- [ ] PCA 可视化对齐效果（特征→文本嵌入空间是否聚类）
- [ ] wandb 项目：`nanochat-omni-audio`（跟 nanochat 文本 base 的 `nanochat` 分开）

## W3 — S2 指令 + LoRA

- [ ] LoRA 接入 nanochat `Linear`（rank=16，仅 attention/MLP）
- [ ] 5w 条音频指令数据 mix（AudioBench + 自合成）
- [ ] eval：自建 200 题 AudioBench-mini

## W4 — MVP demo

- [ ] 复用 `scripts/chat_web.py`，加录音上传
- [ ] AudioBench-mini 准确率 ≥40%（baseline 25%）
- [ ] 4090 端到端首 token <2s

## W5+ — 扩规模 / 质感数据 / vision

参考 research §4.1，留到 W5–W8 展开。

## 决定事项

- **backbone**：nanochat 自训 d12 → d20 → d26（不借现成 gemma/qwen，保持 hackable 灵魂）
- **顺序**：audio 先，vision 排 W7+，多模态输出（TTS/imagegen）不做
- **infra**：训练 + smoke CI 都跑在 ailab（5090, 32G）；CN mirror 走 sjtu/aliyun（pip）、modelscope（模型权重，首选）、hf-mirror（HF 数据集 / 权重 fallback）
- **monorepo fork pattern**：上游 nanochat 的代码就是我们的代码，omni 改动直接进 `nanochat/` 包

## 暂搁 / 待定

- [ ] vision 通路：W7+ 启动，参考 LLaVA recipe，跟 audio 复用 Projector 抽象
- [ ] 质感数据自合成：用 ailab CosyVoice 或 IndexTTS 生情感变体（s1/i7 上有现成 server，跨机数据生产链待定）
- [ ] B40 / GB10 实测：MVP 不依赖
-												omni: CI smoke + docs + README preamble

- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files

											
										
										
											2026-05-05 22:21:31 +01:00
+								# nanochat-omni TODO
 								定位：**质感感知语音输入**（audio-first，输出仅 text，vision 排后期）。
 								参考：[research_feasibility.md](research_feasibility.md)（mochi, 2026-05-05）的 W1–W8 时间盘。
 								---
 								## 近期 — 仓库结构 / 工程基建
 								- [ ] **submodule 展平为 monorepo fork**
 								  - `git remote add upstream https://github.com/karpathy/nanochat.git`
 								  - 重写 `main`：`git reset --hard upstream/main` + cherry-pick 我们的 7 个 commits（CI / smoke / wandb / gitignore / README）
 								  - force push（repo 没人 fork，安全），删 `.gitmodules` + `upstream/nanochat/`
 								  - 拉 upstream 更新就 `git fetch upstream && git merge upstream/main`
 								- [ ] CN mirror patch 直接落到 `pyproject.toml` / `nanochat/dataset.py`（fork 后不用 sed）
 								- [ ] CI smoke 跟着 fork 重路径化（`upstream/nanochat/` → 根目录）
 								## W1 — Whisper encoder + Projector forward smoke
 								参考 research §1.2 模块图。
-												omni: W1 audio align smoke — synthetic dataset + 50-step script

End-to-end smoke proving the audio path:
  wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
      -> tiny d6 GPT (random init) -> CE loss on text only

Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.

Two design calls worth keeping in mind:

1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
   alignment quality, and a deterministic offline dataset means no network
   on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
   committed; wavs are regenerated by audio_smoke_data.py and gitignored.
   W2 swaps in real LibriSpeech.

2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
   Avoids depending on a trained nanochat BPE — the d6 GPT is random
   anyway, so vocab choice doesn't matter for "does the gradient flow"
   smoke. W2 onwards uses the real BPE on a real base.

Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.

											
										
										
											2026-05-05 22:39:20 +01:00
+								- [x] `nanochat/audio.py`：WhisperEncoder wrapper（冻结，ModelScope 优先经 `WHISPER_MS_ID`，HF fallback 默认 `openai/whisper-base`）+ Projector（MLP，输出维度对齐 nanochat `n_embd`）
 								- [x] `nanochat/gpt.py` `GPT.forward()` 加可选 `audio_features` 参数，作为 soft tokens prepend 到 text embedding 前面（kv_cache 路径暂不支持，audio 位置 targets 自动 -1 mask）
 								- [x] mini dataset：5 段 5s 合成正弦 + 字幕，落 `data/audio_smoke/`（wav 由 `scripts/audio_smoke_data.py` 生成，gitignore 排除）
 								- [x] `scripts/audio_align_smoke.py`：50 步、d6 随机初始化 GPT、字节级 tokenizer、loss 下降即过（4090 实测 ~1s，5.55→0.17）
-												omni: CI smoke + docs + README preamble

- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files

											
										
										
											2026-05-05 22:21:31 +01:00
+								- [ ] CI 加 audio smoke job（ailab runner 装 ffmpeg；whisper 走 transformers 即可）
-												omni: W1 audio align smoke — synthetic dataset + 50-step script

End-to-end smoke proving the audio path:
  wav -> WhisperEncoder (frozen) -> Projector -> prepend to text embeddings
      -> tiny d6 GPT (random init) -> CE loss on text only

Pass criterion is a plain "loss drops by at least 0.5". On a 4090 the run
finishes in ~1 s and goes 5.55 -> 0.17 over 50 steps, so the threshold has
plenty of headroom against false positives.

Two design calls worth keeping in mind:

1. Synthetic sine clips, not LibriSpeech. W1 is forward-path proof, not
   alignment quality, and a deterministic offline dataset means no network
   on the smoke path. data/audio_smoke/manifest.jsonl is the only thing
   committed; wavs are regenerated by audio_smoke_data.py and gitignored.
   W2 swaps in real LibriSpeech.

2. Standalone byte-level tokenizer (UTF-8 bytes + a single BOS, vocab=257).
   Avoids depending on a trained nanochat BPE — the d6 GPT is random
   anyway, so vocab choice doesn't matter for "does the gradient flow"
   smoke. W2 onwards uses the real BPE on a real base.

Caveat documented in doc/todo.md: because the LM is also random and being
trained, the loss-down here mostly reflects the LM memorising 5 short
strings, not Whisper-Projector alignment. That's fine for proving
plumbing; W2 freezes the LM so projector-only gradient is the only path
to lower loss.

											
										
										
											2026-05-05 22:39:20 +01:00
+								W1 后续可改进（暂搁，留给 W3+/W5+ 质感任务）：
 								- 当前用 `last_hidden_state`（最偏文本语义的层）；为质感感知应切到中间层 / 多层 weighted sum / w2v-bert
 								- d6 GPT 是随机初始化，alignment 信号其实在练 LM 而非 projector；W2 上真 base 后 freeze LM、只练 projector 才是真正的弱对齐
-												omni: CI smoke + docs + README preamble

- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files

											
										
										
											2026-05-05 22:21:31 +01:00
+								## W2 — S1 弱对齐训练
 								- [ ] 拉 LibriSpeech 100h（HF mirror），预提 Whisper-base encoder 特征落盘 webdataset
 								- [ ] `scripts/audio_align_train.py`：冻结 LM + Whisper，只训 Projector
 								- [ ] PCA 可视化对齐效果（特征→文本嵌入空间是否聚类）
 								- [ ] wandb 项目：`nanochat-omni-audio`（跟 nanochat 文本 base 的 `nanochat` 分开）
 								## W3 — S2 指令 + LoRA
 								- [ ] LoRA 接入 nanochat `Linear`（rank=16，仅 attention/MLP）
 								- [ ] 5w 条音频指令数据 mix（AudioBench + 自合成）
 								- [ ] eval：自建 200 题 AudioBench-mini
 								## W4 — MVP demo
 								- [ ] 复用 `scripts/chat_web.py`，加录音上传
 								- [ ] AudioBench-mini 准确率 ≥40%（baseline 25%）
 								- [ ] 4090 端到端首 token <2s
 								## W5+ — 扩规模 / 质感数据 / vision
 								参考 research §4.1，留到 W5–W8 展开。
 								## 决定事项
 								- **backbone**：nanochat 自训 d12 → d20 → d26（不借现成 gemma/qwen，保持 hackable 灵魂）
 								- **顺序**：audio 先，vision 排 W7+，多模态输出（TTS/imagegen）不做
-												doc: prefer ModelScope for Whisper encoder weights (closes #4)

W1 todo 里 audio.py 的 WhisperEncoder 之前写的是从 HF mirror 拉权重，
国内拉 HF（哪怕走 hf-mirror）经常被卡。改成首选 ModelScope（例如
iic/Whisper-large-v3 / iic/Whisper-small），HF mirror 留作 fallback。
infra 决定那条也顺手把 mirror 列表对齐到 pip / 模型权重 / HF 数据集
三条线，写清楚 modelscope 是模型权重首选。

											
										
										
											2026-05-05 22:25:38 +01:00
+								- **infra**：训练 + smoke CI 都跑在 ailab（5090, 32G）；CN mirror 走 sjtu/aliyun（pip）、modelscope（模型权重，首选）、hf-mirror（HF 数据集 / 权重 fallback）
-												omni: CI smoke + docs + README preamble

- .gitea/workflows/smoke.yml: gitea CI on ailab gpu runner (manual git
  clone since actions/checkout@v4 mis-resolves subpath gitea); injects
  WANDB_API_KEY + CI_RUN_TAG=smoke-$run_number
- scripts/smoke.sh: in-place smoke (uv sync + 1 shard + tokenizer +
  d=6 50-step base_train); idempotent cache at /data/nanochat-smoke/
- doc/research_feasibility.md: voice-first multimodal feasibility study (mochi)
- doc/todo.md: phase-by-phase roadmap (W1 Whisper smoke → W4 MVP)
- README.md: omni preamble pointing at upstream nanochat README
- .gitignore: exclude .claude/ runtime files

											
										
										
											2026-05-05 22:21:31 +01:00
+								- **monorepo fork pattern**：上游 nanochat 的代码就是我们的代码，omni 改动直接进 `nanochat/` 包
 								## 暂搁 / 待定
 								- [ ] vision 通路：W7+ 启动，参考 LLaVA recipe，跟 audio 复用 Projector 抽象
 								- [ ] 质感数据自合成：用 ailab CosyVoice 或 IndexTTS 生情感变体（s1/i7 上有现成 server，跨机数据生产链待定）
 								- [ ] B40 / GB10 实测：MVP 不依赖