markdown · 7924 bytes Raw Blame History

Audio training (audio + Qwen2-Audio)

Sprint 35.2 adds audio sections to .dlm files. This recipe walks a spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts → train → query the adapter against new audio.

Prerequisites

  • Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥ 24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB consumer GPUs don't fit this base without quantization (4-bit audio training is deferred).
  • Qwen2-Audio cached locally (huggingface-cli download Qwen/Qwen2-Audio-7B-Instruct). First train without this triggers the download automatically.
  • The audio extra installed: uv sync --extra audio (pulls soundfile for decoding .wav / .flac / .ogg).

Step 1 — Scaffold an audio .dlm

dlm init my-audio.dlm --audio

--audio pins the base to qwen2-audio-7b-instruct and emits a schema-v11 scaffold with a sample ::audio:: fence. The initial body references clips/your-clip.wav (non-existent by default — drop a real clip at that path before the first train).

Step 2 — Author audio sections

Two ways to supply audio. Inline each fence with the transcript:

::audio path="clips/intro.wav" transcript="Welcome to the podcast."::

::instruction::
### Q
What did the speaker say?

### A
"Welcome to the podcast."

Or ingest a directory through a source directive. Audio files need a matching <stem>.txt sidecar with the transcript:

corpus/
├── intro.wav
├── intro.txt         ← transcript for intro.wav
├── outro.flac
└── outro.txt
---
dlm_id: 01JZ...
dlm_version: 11
base_model: qwen2-audio-7b-instruct
training:
  sources:
    - path: ./corpus
      include: ["**/*.wav", "**/*.flac"]
---

Each .wav/.flac/.ogg with a sibling .txt becomes an ::audio:: section. Files without a sidecar are silently skipped + counted in provenance (dlm show --json surfaces the skip count under source_directives[].skipped_audio_no_transcript).

Step 3 — Train

dlm train my-audio.dlm

The trainer:

  1. Loads Qwen2-Audio via Qwen2AudioForConditionalGeneration + its matching AutoProcessor (feature extractor + tokenizer).
  2. Walks training.sources directives, copies each audio file's bytes into the content-addressed blob store at ~/.dlm/store/<dlm_id>/blobs/.
  3. Emits training rows shaped {audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}.
  4. Runs our AudioLmCollator (custom — TRL 1.2 has no audio auto-dispatch). The collator decodes each waveform via soundfile, truncates to 30 s, hands the batch to the processor, and emits {input_ids, attention_mask, input_features, labels}.
  5. Commits the adapter under adapter/versions/v0001/.

Sample-rate policy. By default the trainer refuses audio whose native rate doesn't match the base's pinned sample_rate (Qwen2-Audio: 16 kHz). Two ways to reconcile:

(1) Manual re-encode — preferred for archive-stable corpora:

ffmpeg -i in.mp3 -ar 16000 out.wav

(2) Opt into automatic resampling — flip the frontmatter knob:

training:
  audio:
    auto_resample: true

With auto_resample: true, any clip whose native rate disagrees resamples on-the-fly via dlm.data.audio_resample (soxr if installed, else scipy.signal.resample_poly). Resampled waveforms cache separately from native-rate ones — toggling the flag on an existing corpus doesn't serve stale entries. Install soxr for best quality + speed (pip install dlm[audio] pulls it in), or pip install scipy as a fallback. Without either, the trainer raises AudioResampleUnavailable at first mismatched decode rather than training on the wrong rate.

Wall-clock expectations. A 5-clip (30 s each) corpus + 3 epochs on an RTX 4090 at micro_batch_size=1 + grad_accum=4 takes about 20 minutes. Apple Silicon is ~4× slower.

Step 4 — Prompt the trained adapter

dlm prompt my-audio.dlm --audio clips/new-clip.wav \
  "What did the speaker say?"

--audio is required for audio bases. Repeat the flag for multi-clip prompts; each occurrence expands to one <|AUDIO|> placeholder that the processor replaces with 750 audio tokens (30 s × 25 tokens/s).

--image and --audio cannot be combined — each targets a different modality.

Step 5 — Export

Audio bases take the HF-snapshot path (audio architectures aren't on llama.cpp's roadmap, so GGUF isn't available):

dlm export my-audio.dlm

Writes to ~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/:

hf-audio-snapshot/
  adapter/                  # PEFT LoRA weights
  processor/                # AutoProcessor config + feature extractor
  snapshot_manifest.json    # export_target=hf_snapshot + sha256s
  README.md                 # how to load downstream

Load on the other side:

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel

base = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
)
model = PeftModel.from_pretrained(base, "./adapter")
processor = AutoProcessor.from_pretrained("./processor")

The base isn't bundled — recipients download it on first use.

Troubleshooting

"audio not found: clips/your-clip.wav"

The --audio scaffold points at a placeholder; drop a real clip at that path or edit the ::audio path="...":: fence.

"native sample_rate=44100 Hz does not match pinned 16000 Hz"

Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz. Either re-encode manually:

ffmpeg -i in.wav -ar 16000 out.wav

Or opt into on-the-fly resampling by setting training.audio.auto_resample: true in the frontmatter. The error message from the trainer now names this knob directly.

"AudioResampleUnavailable: requires either soxr or scipy"

You set auto_resample: true but neither resampler is importable. Install one: pip install soxr (recommended, ships with the dlm[audio] extra) or pip install scipy as a pure-Python fallback.

"audio-language base requires at least one --audio PATH"

You ran dlm prompt on an audio .dlm without attaching a clip. Audio bases always expect a waveform — even a throwaway question about transcript content needs an audio input to anchor the placeholder token.

"AUDIO section has empty transcript"

Both the inline transcript="..." form and the sibling <stem>.txt form must produce a non-empty transcript. Whitespace-only transcripts are refused (the trainer has no target text to predict).

Disk / memory issues

Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at fp16. Close other GPU consumers, use --max-steps 1 to dry-run, or wait for the audio-QLoRA path (deferred).

What's not yet in Sprint 35.2

  • Resampling. v1 refuses sample-rate mismatches. Opt-in automatic resampling via training.audio.auto_resample: true landed as a deferred-item follow-up (soxr preferred, scipy.signal.resample_poly fallback). Defaults to off so the refuse-on-mismatch contract stays backward-compatible.
  • MP3 support. soundfile needs libsndfile ≥ 1.1 for MP3; we lock to .wav / .flac / .ogg in v1 to avoid shipping a libsndfile hard-pin.
  • Audio feature caching in training. AudioCache is wired for the standalone inference path and the slow integration test; the training hot path doesn't re-use the cache yet (each epoch re-extracts features). Meaningful speed-up lands alongside multi-epoch audio corpora where re-extraction dominates.
  • QLoRA for audio. 4-bit audio training needs extra safety testing for the audio encoder weights; deferred.
  • Multiple audio clips per section. Each ::audio:: fence carries one clip; prompts can stack multiple <|AUDIO|> tokens by repeating --audio on the CLI.
View source
1 # Audio training (audio + Qwen2-Audio)
2
3 Sprint 35.2 adds audio sections to `.dlm` files. This recipe walks a
4 spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts
5 → train → query the adapter against new audio.
6
7 ## Prerequisites
8
9 - Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥
10 24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB
11 consumer GPUs don't fit this base without quantization (4-bit audio
12 training is deferred).
13 - Qwen2-Audio cached locally (`huggingface-cli download
14 Qwen/Qwen2-Audio-7B-Instruct`). First train without this triggers
15 the download automatically.
16 - The `audio` extra installed: `uv sync --extra audio` (pulls
17 `soundfile` for decoding `.wav` / `.flac` / `.ogg`).
18
19 ## Step 1 — Scaffold an audio `.dlm`
20
21 ```bash
22 dlm init my-audio.dlm --audio
23 ```
24
25 `--audio` pins the base to `qwen2-audio-7b-instruct` and emits a
26 schema-v11 scaffold with a sample `::audio::` fence. The initial
27 body references `clips/your-clip.wav` (non-existent by default —
28 drop a real clip at that path before the first train).
29
30 ## Step 2 — Author audio sections
31
32 Two ways to supply audio. Inline each fence with the transcript:
33
34 ```dlm
35 ::audio path="clips/intro.wav" transcript="Welcome to the podcast."::
36
37 ::instruction::
38 ### Q
39 What did the speaker say?
40
41 ### A
42 "Welcome to the podcast."
43 ```
44
45 Or ingest a directory through a source directive. Audio files need
46 a matching `<stem>.txt` sidecar with the transcript:
47
48 ```
49 corpus/
50 ├── intro.wav
51 ├── intro.txt ← transcript for intro.wav
52 ├── outro.flac
53 └── outro.txt
54 ```
55
56 ```dlm
57 ---
58 dlm_id: 01JZ...
59 dlm_version: 11
60 base_model: qwen2-audio-7b-instruct
61 training:
62 sources:
63 - path: ./corpus
64 include: ["**/*.wav", "**/*.flac"]
65 ---
66 ```
67
68 Each `.wav`/`.flac`/`.ogg` with a sibling `.txt` becomes an
69 `::audio::` section. Files without a sidecar are silently skipped +
70 counted in provenance (`dlm show --json` surfaces the skip count
71 under `source_directives[].skipped_audio_no_transcript`).
72
73 ## Step 3 — Train
74
75 ```bash
76 dlm train my-audio.dlm
77 ```
78
79 The trainer:
80
81 1. Loads Qwen2-Audio via `Qwen2AudioForConditionalGeneration` + its
82 matching `AutoProcessor` (feature extractor + tokenizer).
83 2. Walks `training.sources` directives, copies each audio file's
84 bytes into the content-addressed blob store at
85 `~/.dlm/store/<dlm_id>/blobs/`.
86 3. Emits training rows shaped
87 `{audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}`.
88 4. Runs our `AudioLmCollator` (custom — TRL 1.2 has no audio
89 auto-dispatch). The collator decodes each waveform via
90 `soundfile`, truncates to 30 s, hands the batch to the processor,
91 and emits `{input_ids, attention_mask, input_features, labels}`.
92 5. Commits the adapter under `adapter/versions/v0001/`.
93
94 **Sample-rate policy.** By default the trainer refuses audio whose
95 native rate doesn't match the base's pinned `sample_rate`
96 (Qwen2-Audio: 16 kHz). Two ways to reconcile:
97
98 **(1) Manual re-encode** — preferred for archive-stable corpora:
99
100 ```bash
101 ffmpeg -i in.mp3 -ar 16000 out.wav
102 ```
103
104 **(2) Opt into automatic resampling** — flip the frontmatter knob:
105
106 ```yaml
107 training:
108 audio:
109 auto_resample: true
110 ```
111
112 With `auto_resample: true`, any clip whose native rate disagrees
113 resamples on-the-fly via `dlm.data.audio_resample` (soxr if
114 installed, else scipy.signal.resample_poly). Resampled waveforms
115 cache separately from native-rate ones — toggling the flag on an
116 existing corpus doesn't serve stale entries. Install soxr for best
117 quality + speed (`pip install dlm[audio]` pulls it in), or
118 `pip install scipy` as a fallback. Without either, the trainer
119 raises `AudioResampleUnavailable` at first mismatched decode
120 rather than training on the wrong rate.
121
122 **Wall-clock expectations.** A 5-clip (30 s each) corpus + 3 epochs
123 on an RTX 4090 at `micro_batch_size=1` + `grad_accum=4` takes about
124 20 minutes. Apple Silicon is ~4× slower.
125
126 ## Step 4 — Prompt the trained adapter
127
128 ```bash
129 dlm prompt my-audio.dlm --audio clips/new-clip.wav \
130 "What did the speaker say?"
131 ```
132
133 `--audio` is required for audio bases. Repeat the flag for multi-clip
134 prompts; each occurrence expands to one `<|AUDIO|>` placeholder that
135 the processor replaces with 750 audio tokens (30 s × 25 tokens/s).
136
137 `--image` and `--audio` cannot be combined — each targets a different
138 modality.
139
140 ## Step 5 — Export
141
142 Audio bases take the HF-snapshot path (audio architectures aren't on
143 `llama.cpp`'s roadmap, so GGUF isn't available):
144
145 ```bash
146 dlm export my-audio.dlm
147 ```
148
149 Writes to `~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/`:
150
151 ```
152 hf-audio-snapshot/
153 adapter/ # PEFT LoRA weights
154 processor/ # AutoProcessor config + feature extractor
155 snapshot_manifest.json # export_target=hf_snapshot + sha256s
156 README.md # how to load downstream
157 ```
158
159 Load on the other side:
160
161 ```python
162 from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
163 from peft import PeftModel
164
165 base = Qwen2AudioForConditionalGeneration.from_pretrained(
166 "Qwen/Qwen2-Audio-7B-Instruct",
167 )
168 model = PeftModel.from_pretrained(base, "./adapter")
169 processor = AutoProcessor.from_pretrained("./processor")
170 ```
171
172 The base isn't bundled — recipients download it on first use.
173
174 ## Troubleshooting
175
176 ### "audio not found: clips/your-clip.wav"
177
178 The `--audio` scaffold points at a placeholder; drop a real clip at
179 that path or edit the `::audio path="..."::` fence.
180
181 ### "native sample_rate=44100 Hz does not match pinned 16000 Hz"
182
183 Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz.
184 Either re-encode manually:
185
186 ```bash
187 ffmpeg -i in.wav -ar 16000 out.wav
188 ```
189
190 Or opt into on-the-fly resampling by setting
191 `training.audio.auto_resample: true` in the frontmatter. The error
192 message from the trainer now names this knob directly.
193
194 ### "AudioResampleUnavailable: requires either soxr or scipy"
195
196 You set `auto_resample: true` but neither resampler is importable.
197 Install one: `pip install soxr` (recommended, ships with the
198 `dlm[audio]` extra) or `pip install scipy` as a pure-Python
199 fallback.
200
201 ### "audio-language base requires at least one --audio PATH"
202
203 You ran `dlm prompt` on an audio `.dlm` without attaching a clip.
204 Audio bases always expect a waveform — even a throwaway question
205 about transcript content needs an audio input to anchor the
206 placeholder token.
207
208 ### "AUDIO section has empty transcript"
209
210 Both the inline `transcript="..."` form and the sibling `<stem>.txt`
211 form must produce a non-empty transcript. Whitespace-only transcripts
212 are refused (the trainer has no target text to predict).
213
214 ### Disk / memory issues
215
216 Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at
217 fp16. Close other GPU consumers, use `--max-steps 1` to dry-run, or
218 wait for the audio-QLoRA path (deferred).
219
220 ## What's not yet in Sprint 35.2
221
222 - **Resampling.** ~~v1 refuses sample-rate mismatches.~~ Opt-in
223 automatic resampling via `training.audio.auto_resample: true`
224 landed as a deferred-item follow-up (soxr preferred,
225 scipy.signal.resample_poly fallback). Defaults to off so the
226 refuse-on-mismatch contract stays backward-compatible.
227 - **MP3 support.** `soundfile` needs libsndfile ≥ 1.1 for MP3;
228 we lock to `.wav` / `.flac` / `.ogg` in v1 to avoid shipping a
229 libsndfile hard-pin.
230 - **Audio feature caching in training.** `AudioCache` is wired for
231 the standalone inference path and the slow integration test;
232 the training hot path doesn't re-use the cache yet (each epoch
233 re-extracts features). Meaningful speed-up lands alongside
234 multi-epoch audio corpora where re-extraction dominates.
235 - **QLoRA for audio.** 4-bit audio training needs extra safety
236 testing for the audio encoder weights; deferred.
237 - **Multiple audio clips per section.** Each `::audio::` fence carries
238 one clip; prompts can stack multiple `<|AUDIO|>` tokens by repeating
239 `--audio` on the CLI.