documentlanguagemodel Public
Audio training (audio + Qwen2-Audio)
Sprint 35.2 adds audio sections to .dlm files. This recipe walks a
spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts
→ train → query the adapter against new audio.
Prerequisites
- Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥ 24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB consumer GPUs don't fit this base without quantization (4-bit audio training is deferred).
- Qwen2-Audio cached locally (
huggingface-cli download Qwen/Qwen2-Audio-7B-Instruct). First train without this triggers the download automatically. - The
audioextra installed:uv sync --extra audio(pullssoundfilefor decoding.wav/.flac/.ogg).
Step 1 — Scaffold an audio .dlm
dlm init my-audio.dlm --audio
--audio pins the base to qwen2-audio-7b-instruct and emits a
schema-v11 scaffold with a sample ::audio:: fence. The initial
body references clips/your-clip.wav (non-existent by default —
drop a real clip at that path before the first train).
Step 2 — Author audio sections
Two ways to supply audio. Inline each fence with the transcript:
::audio path="clips/intro.wav" transcript="Welcome to the podcast."::
::instruction::
### Q
What did the speaker say?
### A
"Welcome to the podcast."
Or ingest a directory through a source directive. Audio files need
a matching <stem>.txt sidecar with the transcript:
corpus/
├── intro.wav
├── intro.txt ← transcript for intro.wav
├── outro.flac
└── outro.txt
---
dlm_id: 01JZ...
dlm_version: 11
base_model: qwen2-audio-7b-instruct
training:
sources:
- path: ./corpus
include: ["**/*.wav", "**/*.flac"]
---
Each .wav/.flac/.ogg with a sibling .txt becomes an
::audio:: section. Files without a sidecar are silently skipped +
counted in provenance (dlm show --json surfaces the skip count
under source_directives[].skipped_audio_no_transcript).
Step 3 — Train
dlm train my-audio.dlm
The trainer:
- Loads Qwen2-Audio via
Qwen2AudioForConditionalGeneration+ its matchingAutoProcessor(feature extractor + tokenizer). - Walks
training.sourcesdirectives, copies each audio file's bytes into the content-addressed blob store at~/.dlm/store/<dlm_id>/blobs/. - Emits training rows shaped
{audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}. - Runs our
AudioLmCollator(custom — TRL 1.2 has no audio auto-dispatch). The collator decodes each waveform viasoundfile, truncates to 30 s, hands the batch to the processor, and emits{input_ids, attention_mask, input_features, labels}. - Commits the adapter under
adapter/versions/v0001/.
Sample-rate policy. By default the trainer refuses audio whose
native rate doesn't match the base's pinned sample_rate
(Qwen2-Audio: 16 kHz). Two ways to reconcile:
(1) Manual re-encode — preferred for archive-stable corpora:
ffmpeg -i in.mp3 -ar 16000 out.wav
(2) Opt into automatic resampling — flip the frontmatter knob:
training:
audio:
auto_resample: true
With auto_resample: true, any clip whose native rate disagrees
resamples on-the-fly via dlm.data.audio_resample (soxr if
installed, else scipy.signal.resample_poly). Resampled waveforms
cache separately from native-rate ones — toggling the flag on an
existing corpus doesn't serve stale entries. Install soxr for best
quality + speed (pip install dlm[audio] pulls it in), or
pip install scipy as a fallback. Without either, the trainer
raises AudioResampleUnavailable at first mismatched decode
rather than training on the wrong rate.
Wall-clock expectations. A 5-clip (30 s each) corpus + 3 epochs
on an RTX 4090 at micro_batch_size=1 + grad_accum=4 takes about
20 minutes. Apple Silicon is ~4× slower.
Step 4 — Prompt the trained adapter
dlm prompt my-audio.dlm --audio clips/new-clip.wav \
"What did the speaker say?"
--audio is required for audio bases. Repeat the flag for multi-clip
prompts; each occurrence expands to one <|AUDIO|> placeholder that
the processor replaces with 750 audio tokens (30 s × 25 tokens/s).
--image and --audio cannot be combined — each targets a different
modality.
Step 5 — Export
Audio bases take the HF-snapshot path (audio architectures aren't on
llama.cpp's roadmap, so GGUF isn't available):
dlm export my-audio.dlm
Writes to ~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/:
hf-audio-snapshot/
adapter/ # PEFT LoRA weights
processor/ # AutoProcessor config + feature extractor
snapshot_manifest.json # export_target=hf_snapshot + sha256s
README.md # how to load downstream
Load on the other side:
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel
base = Qwen2AudioForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-Audio-7B-Instruct",
)
model = PeftModel.from_pretrained(base, "./adapter")
processor = AutoProcessor.from_pretrained("./processor")
The base isn't bundled — recipients download it on first use.
Troubleshooting
"audio not found: clips/your-clip.wav"
The --audio scaffold points at a placeholder; drop a real clip at
that path or edit the ::audio path="...":: fence.
"native sample_rate=44100 Hz does not match pinned 16000 Hz"
Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz. Either re-encode manually:
ffmpeg -i in.wav -ar 16000 out.wav
Or opt into on-the-fly resampling by setting
training.audio.auto_resample: true in the frontmatter. The error
message from the trainer now names this knob directly.
"AudioResampleUnavailable: requires either soxr or scipy"
You set auto_resample: true but neither resampler is importable.
Install one: pip install soxr (recommended, ships with the
dlm[audio] extra) or pip install scipy as a pure-Python
fallback.
"audio-language base requires at least one --audio PATH"
You ran dlm prompt on an audio .dlm without attaching a clip.
Audio bases always expect a waveform — even a throwaway question
about transcript content needs an audio input to anchor the
placeholder token.
"AUDIO section has empty transcript"
Both the inline transcript="..." form and the sibling <stem>.txt
form must produce a non-empty transcript. Whitespace-only transcripts
are refused (the trainer has no target text to predict).
Disk / memory issues
Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at
fp16. Close other GPU consumers, use --max-steps 1 to dry-run, or
wait for the audio-QLoRA path (deferred).
What's not yet in Sprint 35.2
- Resampling.
v1 refuses sample-rate mismatches.Opt-in automatic resampling viatraining.audio.auto_resample: truelanded as a deferred-item follow-up (soxr preferred, scipy.signal.resample_poly fallback). Defaults to off so the refuse-on-mismatch contract stays backward-compatible. - MP3 support.
soundfileneeds libsndfile ≥ 1.1 for MP3; we lock to.wav/.flac/.oggin v1 to avoid shipping a libsndfile hard-pin. - Audio feature caching in training.
AudioCacheis wired for the standalone inference path and the slow integration test; the training hot path doesn't re-use the cache yet (each epoch re-extracts features). Meaningful speed-up lands alongside multi-epoch audio corpora where re-extraction dominates. - QLoRA for audio. 4-bit audio training needs extra safety testing for the audio encoder weights; deferred.
- Multiple audio clips per section. Each
::audio::fence carries one clip; prompts can stack multiple<|AUDIO|>tokens by repeating--audioon the CLI.
View source
| 1 | # Audio training (audio + Qwen2-Audio) |
| 2 | |
| 3 | Sprint 35.2 adds audio sections to `.dlm` files. This recipe walks a |
| 4 | spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts |
| 5 | → train → query the adapter against new audio. |
| 6 | |
| 7 | ## Prerequisites |
| 8 | |
| 9 | - Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥ |
| 10 | 24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB |
| 11 | consumer GPUs don't fit this base without quantization (4-bit audio |
| 12 | training is deferred). |
| 13 | - Qwen2-Audio cached locally (`huggingface-cli download |
| 14 | Qwen/Qwen2-Audio-7B-Instruct`). First train without this triggers |
| 15 | the download automatically. |
| 16 | - The `audio` extra installed: `uv sync --extra audio` (pulls |
| 17 | `soundfile` for decoding `.wav` / `.flac` / `.ogg`). |
| 18 | |
| 19 | ## Step 1 — Scaffold an audio `.dlm` |
| 20 | |
| 21 | ```bash |
| 22 | dlm init my-audio.dlm --audio |
| 23 | ``` |
| 24 | |
| 25 | `--audio` pins the base to `qwen2-audio-7b-instruct` and emits a |
| 26 | schema-v11 scaffold with a sample `::audio::` fence. The initial |
| 27 | body references `clips/your-clip.wav` (non-existent by default — |
| 28 | drop a real clip at that path before the first train). |
| 29 | |
| 30 | ## Step 2 — Author audio sections |
| 31 | |
| 32 | Two ways to supply audio. Inline each fence with the transcript: |
| 33 | |
| 34 | ```dlm |
| 35 | ::audio path="clips/intro.wav" transcript="Welcome to the podcast.":: |
| 36 | |
| 37 | ::instruction:: |
| 38 | ### Q |
| 39 | What did the speaker say? |
| 40 | |
| 41 | ### A |
| 42 | "Welcome to the podcast." |
| 43 | ``` |
| 44 | |
| 45 | Or ingest a directory through a source directive. Audio files need |
| 46 | a matching `<stem>.txt` sidecar with the transcript: |
| 47 | |
| 48 | ``` |
| 49 | corpus/ |
| 50 | ├── intro.wav |
| 51 | ├── intro.txt ← transcript for intro.wav |
| 52 | ├── outro.flac |
| 53 | └── outro.txt |
| 54 | ``` |
| 55 | |
| 56 | ```dlm |
| 57 | --- |
| 58 | dlm_id: 01JZ... |
| 59 | dlm_version: 11 |
| 60 | base_model: qwen2-audio-7b-instruct |
| 61 | training: |
| 62 | sources: |
| 63 | - path: ./corpus |
| 64 | include: ["**/*.wav", "**/*.flac"] |
| 65 | --- |
| 66 | ``` |
| 67 | |
| 68 | Each `.wav`/`.flac`/`.ogg` with a sibling `.txt` becomes an |
| 69 | `::audio::` section. Files without a sidecar are silently skipped + |
| 70 | counted in provenance (`dlm show --json` surfaces the skip count |
| 71 | under `source_directives[].skipped_audio_no_transcript`). |
| 72 | |
| 73 | ## Step 3 — Train |
| 74 | |
| 75 | ```bash |
| 76 | dlm train my-audio.dlm |
| 77 | ``` |
| 78 | |
| 79 | The trainer: |
| 80 | |
| 81 | 1. Loads Qwen2-Audio via `Qwen2AudioForConditionalGeneration` + its |
| 82 | matching `AutoProcessor` (feature extractor + tokenizer). |
| 83 | 2. Walks `training.sources` directives, copies each audio file's |
| 84 | bytes into the content-addressed blob store at |
| 85 | `~/.dlm/store/<dlm_id>/blobs/`. |
| 86 | 3. Emits training rows shaped |
| 87 | `{audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}`. |
| 88 | 4. Runs our `AudioLmCollator` (custom — TRL 1.2 has no audio |
| 89 | auto-dispatch). The collator decodes each waveform via |
| 90 | `soundfile`, truncates to 30 s, hands the batch to the processor, |
| 91 | and emits `{input_ids, attention_mask, input_features, labels}`. |
| 92 | 5. Commits the adapter under `adapter/versions/v0001/`. |
| 93 | |
| 94 | **Sample-rate policy.** By default the trainer refuses audio whose |
| 95 | native rate doesn't match the base's pinned `sample_rate` |
| 96 | (Qwen2-Audio: 16 kHz). Two ways to reconcile: |
| 97 | |
| 98 | **(1) Manual re-encode** — preferred for archive-stable corpora: |
| 99 | |
| 100 | ```bash |
| 101 | ffmpeg -i in.mp3 -ar 16000 out.wav |
| 102 | ``` |
| 103 | |
| 104 | **(2) Opt into automatic resampling** — flip the frontmatter knob: |
| 105 | |
| 106 | ```yaml |
| 107 | training: |
| 108 | audio: |
| 109 | auto_resample: true |
| 110 | ``` |
| 111 | |
| 112 | With `auto_resample: true`, any clip whose native rate disagrees |
| 113 | resamples on-the-fly via `dlm.data.audio_resample` (soxr if |
| 114 | installed, else scipy.signal.resample_poly). Resampled waveforms |
| 115 | cache separately from native-rate ones — toggling the flag on an |
| 116 | existing corpus doesn't serve stale entries. Install soxr for best |
| 117 | quality + speed (`pip install dlm[audio]` pulls it in), or |
| 118 | `pip install scipy` as a fallback. Without either, the trainer |
| 119 | raises `AudioResampleUnavailable` at first mismatched decode |
| 120 | rather than training on the wrong rate. |
| 121 | |
| 122 | **Wall-clock expectations.** A 5-clip (30 s each) corpus + 3 epochs |
| 123 | on an RTX 4090 at `micro_batch_size=1` + `grad_accum=4` takes about |
| 124 | 20 minutes. Apple Silicon is ~4× slower. |
| 125 | |
| 126 | ## Step 4 — Prompt the trained adapter |
| 127 | |
| 128 | ```bash |
| 129 | dlm prompt my-audio.dlm --audio clips/new-clip.wav \ |
| 130 | "What did the speaker say?" |
| 131 | ``` |
| 132 | |
| 133 | `--audio` is required for audio bases. Repeat the flag for multi-clip |
| 134 | prompts; each occurrence expands to one `<|AUDIO|>` placeholder that |
| 135 | the processor replaces with 750 audio tokens (30 s × 25 tokens/s). |
| 136 | |
| 137 | `--image` and `--audio` cannot be combined — each targets a different |
| 138 | modality. |
| 139 | |
| 140 | ## Step 5 — Export |
| 141 | |
| 142 | Audio bases take the HF-snapshot path (audio architectures aren't on |
| 143 | `llama.cpp`'s roadmap, so GGUF isn't available): |
| 144 | |
| 145 | ```bash |
| 146 | dlm export my-audio.dlm |
| 147 | ``` |
| 148 | |
| 149 | Writes to `~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/`: |
| 150 | |
| 151 | ``` |
| 152 | hf-audio-snapshot/ |
| 153 | adapter/ # PEFT LoRA weights |
| 154 | processor/ # AutoProcessor config + feature extractor |
| 155 | snapshot_manifest.json # export_target=hf_snapshot + sha256s |
| 156 | README.md # how to load downstream |
| 157 | ``` |
| 158 | |
| 159 | Load on the other side: |
| 160 | |
| 161 | ```python |
| 162 | from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor |
| 163 | from peft import PeftModel |
| 164 | |
| 165 | base = Qwen2AudioForConditionalGeneration.from_pretrained( |
| 166 | "Qwen/Qwen2-Audio-7B-Instruct", |
| 167 | ) |
| 168 | model = PeftModel.from_pretrained(base, "./adapter") |
| 169 | processor = AutoProcessor.from_pretrained("./processor") |
| 170 | ``` |
| 171 | |
| 172 | The base isn't bundled — recipients download it on first use. |
| 173 | |
| 174 | ## Troubleshooting |
| 175 | |
| 176 | ### "audio not found: clips/your-clip.wav" |
| 177 | |
| 178 | The `--audio` scaffold points at a placeholder; drop a real clip at |
| 179 | that path or edit the `::audio path="..."::` fence. |
| 180 | |
| 181 | ### "native sample_rate=44100 Hz does not match pinned 16000 Hz" |
| 182 | |
| 183 | Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz. |
| 184 | Either re-encode manually: |
| 185 | |
| 186 | ```bash |
| 187 | ffmpeg -i in.wav -ar 16000 out.wav |
| 188 | ``` |
| 189 | |
| 190 | Or opt into on-the-fly resampling by setting |
| 191 | `training.audio.auto_resample: true` in the frontmatter. The error |
| 192 | message from the trainer now names this knob directly. |
| 193 | |
| 194 | ### "AudioResampleUnavailable: requires either soxr or scipy" |
| 195 | |
| 196 | You set `auto_resample: true` but neither resampler is importable. |
| 197 | Install one: `pip install soxr` (recommended, ships with the |
| 198 | `dlm[audio]` extra) or `pip install scipy` as a pure-Python |
| 199 | fallback. |
| 200 | |
| 201 | ### "audio-language base requires at least one --audio PATH" |
| 202 | |
| 203 | You ran `dlm prompt` on an audio `.dlm` without attaching a clip. |
| 204 | Audio bases always expect a waveform — even a throwaway question |
| 205 | about transcript content needs an audio input to anchor the |
| 206 | placeholder token. |
| 207 | |
| 208 | ### "AUDIO section has empty transcript" |
| 209 | |
| 210 | Both the inline `transcript="..."` form and the sibling `<stem>.txt` |
| 211 | form must produce a non-empty transcript. Whitespace-only transcripts |
| 212 | are refused (the trainer has no target text to predict). |
| 213 | |
| 214 | ### Disk / memory issues |
| 215 | |
| 216 | Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at |
| 217 | fp16. Close other GPU consumers, use `--max-steps 1` to dry-run, or |
| 218 | wait for the audio-QLoRA path (deferred). |
| 219 | |
| 220 | ## What's not yet in Sprint 35.2 |
| 221 | |
| 222 | - **Resampling.** ~~v1 refuses sample-rate mismatches.~~ Opt-in |
| 223 | automatic resampling via `training.audio.auto_resample: true` |
| 224 | landed as a deferred-item follow-up (soxr preferred, |
| 225 | scipy.signal.resample_poly fallback). Defaults to off so the |
| 226 | refuse-on-mismatch contract stays backward-compatible. |
| 227 | - **MP3 support.** `soundfile` needs libsndfile ≥ 1.1 for MP3; |
| 228 | we lock to `.wav` / `.flac` / `.ogg` in v1 to avoid shipping a |
| 229 | libsndfile hard-pin. |
| 230 | - **Audio feature caching in training.** `AudioCache` is wired for |
| 231 | the standalone inference path and the slow integration test; |
| 232 | the training hot path doesn't re-use the cache yet (each epoch |
| 233 | re-extracts features). Meaningful speed-up lands alongside |
| 234 | multi-epoch audio corpora where re-extraction dominates. |
| 235 | - **QLoRA for audio.** 4-bit audio training needs extra safety |
| 236 | testing for the audio encoder weights; deferred. |
| 237 | - **Multiple audio clips per section.** Each `::audio::` fence carries |
| 238 | one clip; prompts can stack multiple `<|AUDIO|>` tokens by repeating |
| 239 | `--audio` on the CLI. |