documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 7924 bytes Raw Blame History

Audio training (audio + Qwen2-Audio)

Sprint 35.2 adds audio sections to .dlm files. This recipe walks a spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts → train → query the adapter against new audio.

Prerequisites

Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥ 24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB consumer GPUs don't fit this base without quantization (4-bit audio training is deferred).
Qwen2-Audio cached locally (huggingface-cli download Qwen/Qwen2-Audio-7B-Instruct). First train without this triggers the download automatically.
The audio extra installed: uv sync --extra audio (pulls soundfile for decoding .wav / .flac / .ogg).

Step 1 — Scaffold an audio `.dlm`

dlm init my-audio.dlm --audio

--audio pins the base to qwen2-audio-7b-instruct and emits a schema-v11 scaffold with a sample ::audio:: fence. The initial body references clips/your-clip.wav (non-existent by default — drop a real clip at that path before the first train).

Step 2 — Author audio sections

Two ways to supply audio. Inline each fence with the transcript:

::audio path="clips/intro.wav" transcript="Welcome to the podcast."::

::instruction::
### Q
What did the speaker say?

### A
"Welcome to the podcast."

Or ingest a directory through a source directive. Audio files need a matching <stem>.txt sidecar with the transcript:

corpus/
├── intro.wav
├── intro.txt         ← transcript for intro.wav
├── outro.flac
└── outro.txt

---
dlm_id: 01JZ...
dlm_version: 11
base_model: qwen2-audio-7b-instruct
training:
  sources:
    - path: ./corpus
      include: ["**/*.wav", "**/*.flac"]
---

Each .wav/.flac/.ogg with a sibling .txt becomes an ::audio:: section. Files without a sidecar are silently skipped + counted in provenance (dlm show --json surfaces the skip count under source_directives[].skipped_audio_no_transcript).

Step 3 — Train

dlm train my-audio.dlm

The trainer:

Loads Qwen2-Audio via Qwen2AudioForConditionalGeneration + its matching AutoProcessor (feature extractor + tokenizer).
Walks training.sources directives, copies each audio file's bytes into the content-addressed blob store at ~/.dlm/store/<dlm_id>/blobs/.
Emits training rows shaped {audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}.
Runs our AudioLmCollator (custom — TRL 1.2 has no audio auto-dispatch). The collator decodes each waveform via soundfile, truncates to 30 s, hands the batch to the processor, and emits {input_ids, attention_mask, input_features, labels}.
Commits the adapter under adapter/versions/v0001/.

Sample-rate policy. By default the trainer refuses audio whose native rate doesn't match the base's pinned sample_rate (Qwen2-Audio: 16 kHz). Two ways to reconcile:

(1) Manual re-encode — preferred for archive-stable corpora:

ffmpeg -i in.mp3 -ar 16000 out.wav

(2) Opt into automatic resampling — flip the frontmatter knob:

training:
  audio:
    auto_resample: true

With auto_resample: true, any clip whose native rate disagrees resamples on-the-fly via dlm.data.audio_resample (soxr if installed, else scipy.signal.resample_poly). Resampled waveforms cache separately from native-rate ones — toggling the flag on an existing corpus doesn't serve stale entries. Install soxr for best quality + speed (pip install dlm[audio] pulls it in), or pip install scipy as a fallback. Without either, the trainer raises AudioResampleUnavailable at first mismatched decode rather than training on the wrong rate.

Wall-clock expectations. A 5-clip (30 s each) corpus + 3 epochs on an RTX 4090 at micro_batch_size=1 + grad_accum=4 takes about 20 minutes. Apple Silicon is ~4× slower.

Step 4 — Prompt the trained adapter

dlm prompt my-audio.dlm --audio clips/new-clip.wav \
  "What did the speaker say?"

--audio is required for audio bases. Repeat the flag for multi-clip prompts; each occurrence expands to one <|AUDIO|> placeholder that the processor replaces with 750 audio tokens (30 s × 25 tokens/s).

--image and --audio cannot be combined — each targets a different modality.

Step 5 — Export

Audio bases take the HF-snapshot path (audio architectures aren't on llama.cpp's roadmap, so GGUF isn't available):

dlm export my-audio.dlm

Writes to ~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/:

hf-audio-snapshot/
  adapter/                  # PEFT LoRA weights
  processor/                # AutoProcessor config + feature extractor
  snapshot_manifest.json    # export_target=hf_snapshot + sha256s
  README.md                 # how to load downstream

Load on the other side:

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel

base = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
)
model = PeftModel.from_pretrained(base, "./adapter")
processor = AutoProcessor.from_pretrained("./processor")

The base isn't bundled — recipients download it on first use.

Troubleshooting

"audio not found: clips/your-clip.wav"

The --audio scaffold points at a placeholder; drop a real clip at that path or edit the ::audio path="...":: fence.

"native sample_rate=44100 Hz does not match pinned 16000 Hz"

Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz. Either re-encode manually:

ffmpeg -i in.wav -ar 16000 out.wav

Or opt into on-the-fly resampling by setting training.audio.auto_resample: true in the frontmatter. The error message from the trainer now names this knob directly.

"AudioResampleUnavailable: requires either soxr or scipy"

You set auto_resample: true but neither resampler is importable. Install one: pip install soxr (recommended, ships with the dlm[audio] extra) or pip install scipy as a pure-Python fallback.

"audio-language base requires at least one --audio PATH"

You ran dlm prompt on an audio .dlm without attaching a clip. Audio bases always expect a waveform — even a throwaway question about transcript content needs an audio input to anchor the placeholder token.

"AUDIO section has empty transcript"

Both the inline transcript="..." form and the sibling <stem>.txt form must produce a non-empty transcript. Whitespace-only transcripts are refused (the trainer has no target text to predict).

Disk / memory issues

Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at fp16. Close other GPU consumers, use --max-steps 1 to dry-run, or wait for the audio-QLoRA path (deferred).

What's not yet in Sprint 35.2

Resampling. ~~v1 refuses sample-rate mismatches.~~ Opt-in automatic resampling via training.audio.auto_resample: true landed as a deferred-item follow-up (soxr preferred, scipy.signal.resample_poly fallback). Defaults to off so the refuse-on-mismatch contract stays backward-compatible.
MP3 support. soundfile needs libsndfile ≥ 1.1 for MP3; we lock to .wav / .flac / .ogg in v1 to avoid shipping a libsndfile hard-pin.
Audio feature caching in training. AudioCache is wired for the standalone inference path and the slow integration test; the training hot path doesn't re-use the cache yet (each epoch re-extracts features). Meaningful speed-up lands alongside multi-epoch audio corpora where re-extraction dominates.
QLoRA for audio. 4-bit audio training needs extra safety testing for the audio encoder weights; deferred.
Multiple audio clips per section. Each ::audio:: fence carries one clip; prompts can stack multiple <|AUDIO|> tokens by repeating --audio on the CLI.

View source

  
        1
        # Audio training (audio + Qwen2-Audio)
      
        2
        
        3
        Sprint 35.2 adds audio sections to `.dlm` files. This recipe walks a
      
        4
        spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts
      
        5
        → train → query the adapter against new audio.
      
        6
        
        7
        ## Prerequisites
      
        8
        
        9
        - Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥
      
        10
          24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB
      
        11
          consumer GPUs don't fit this base without quantization (4-bit audio
      
        12
          training is deferred).
      
        13
        - Qwen2-Audio cached locally (`huggingface-cli download
      
        14
          Qwen/Qwen2-Audio-7B-Instruct`). First train without this triggers
      
        15
          the download automatically.
      
        16
        - The `audio` extra installed: `uv sync --extra audio` (pulls
      
        17
          `soundfile` for decoding `.wav` / `.flac` / `.ogg`).
      
        18
        
        19
        ## Step 1 — Scaffold an audio `.dlm`
      
        20
        
        21
        ```bash
      
        22
        dlm init my-audio.dlm --audio
      
        23
        ```
      
        24
        
        25
        `--audio` pins the base to `qwen2-audio-7b-instruct` and emits a
      
        26
        schema-v11 scaffold with a sample `::audio::` fence. The initial
      
        27
        body references `clips/your-clip.wav` (non-existent by default —
      
        28
        drop a real clip at that path before the first train).
      
        29
        
        30
        ## Step 2 — Author audio sections
      
        31
        
        32
        Two ways to supply audio. Inline each fence with the transcript:
      
        33
        
        34
        ```dlm
      
        35
        ::audio path="clips/intro.wav" transcript="Welcome to the podcast."::
      
        36
        
        37
        ::instruction::
      
        38
        ### Q
      
        39
        What did the speaker say?
      
        40
        
        41
        ### A
      
        42
        "Welcome to the podcast."
      
        43
        ```
      
        44
        
        45
        Or ingest a directory through a source directive. Audio files need
      
        46
        a matching `<stem>.txt` sidecar with the transcript:
      
        47
        
        48
        ```
      
        49
        corpus/
      
        50
        ├── intro.wav
      
        51
        ├── intro.txt         ← transcript for intro.wav
      
        52
        ├── outro.flac
      
        53
        └── outro.txt
      
        54
        ```
      
        55
        
        56
        ```dlm
      
        57
        ---
      
        58
        dlm_id: 01JZ...
      
        59
        dlm_version: 11
      
        60
        base_model: qwen2-audio-7b-instruct
      
        61
        training:
      
        62
          sources:
      
        63
            - path: ./corpus
      
        64
              include: ["**/*.wav", "**/*.flac"]
      
        65
        ---
      
        66
        ```
      
        67
        
        68
        Each `.wav`/`.flac`/`.ogg` with a sibling `.txt` becomes an
      
        69
        `::audio::` section. Files without a sidecar are silently skipped +
      
        70
        counted in provenance (`dlm show --json` surfaces the skip count
      
        71
        under `source_directives[].skipped_audio_no_transcript`).
      
        72
        
        73
        ## Step 3 — Train
      
        74
        
        75
        ```bash
      
        76
        dlm train my-audio.dlm
      
        77
        ```
      
        78
        
        79
        The trainer:
      
        80
        
        81
        1. Loads Qwen2-Audio via `Qwen2AudioForConditionalGeneration` + its
      
        82
           matching `AutoProcessor` (feature extractor + tokenizer).
      
        83
        2. Walks `training.sources` directives, copies each audio file's
      
        84
           bytes into the content-addressed blob store at
      
        85
           `~/.dlm/store/<dlm_id>/blobs/`.
      
        86
        3. Emits training rows shaped
      
        87
           `{audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}`.
      
        88
        4. Runs our `AudioLmCollator` (custom — TRL 1.2 has no audio
      
        89
           auto-dispatch). The collator decodes each waveform via
      
        90
           `soundfile`, truncates to 30 s, hands the batch to the processor,
      
        91
           and emits `{input_ids, attention_mask, input_features, labels}`.
      
        92
        5. Commits the adapter under `adapter/versions/v0001/`.
      
        93
        
        94
        **Sample-rate policy.** By default the trainer refuses audio whose
      
        95
        native rate doesn't match the base's pinned `sample_rate`
      
        96
        (Qwen2-Audio: 16 kHz). Two ways to reconcile:
      
        97
        
        98
        **(1) Manual re-encode** — preferred for archive-stable corpora:
      
        99
        
        100
        ```bash
      
        101
        ffmpeg -i in.mp3 -ar 16000 out.wav
      
        102
        ```
      
        103
        
        104
        **(2) Opt into automatic resampling** — flip the frontmatter knob:
      
        105
        
        106
        ```yaml
      
        107
        training:
      
        108
          audio:
      
        109
            auto_resample: true
      
        110
        ```
      
        111
        
        112
        With `auto_resample: true`, any clip whose native rate disagrees
      
        113
        resamples on-the-fly via `dlm.data.audio_resample` (soxr if
      
        114
        installed, else scipy.signal.resample_poly). Resampled waveforms
      
        115
        cache separately from native-rate ones — toggling the flag on an
      
        116
        existing corpus doesn't serve stale entries. Install soxr for best
      
        117
        quality + speed (`pip install dlm[audio]` pulls it in), or
      
        118
        `pip install scipy` as a fallback. Without either, the trainer
      
        119
        raises `AudioResampleUnavailable` at first mismatched decode
      
        120
        rather than training on the wrong rate.
      
        121
        
        122
        **Wall-clock expectations.** A 5-clip (30 s each) corpus + 3 epochs
      
        123
        on an RTX 4090 at `micro_batch_size=1` + `grad_accum=4` takes about
      
        124
        20 minutes. Apple Silicon is ~4× slower.
      
        125
        
        126
        ## Step 4 — Prompt the trained adapter
      
        127
        
        128
        ```bash
      
        129
        dlm prompt my-audio.dlm --audio clips/new-clip.wav \
      
        130
          "What did the speaker say?"
      
        131
        ```
      
        132
        
        133
        `--audio` is required for audio bases. Repeat the flag for multi-clip
      
        134
        prompts; each occurrence expands to one `<|AUDIO|>` placeholder that
      
        135
        the processor replaces with 750 audio tokens (30 s × 25 tokens/s).
      
        136
        
        137
        `--image` and `--audio` cannot be combined — each targets a different
      
        138
        modality.
      
        139
        
        140
        ## Step 5 — Export
      
        141
        
        142
        Audio bases take the HF-snapshot path (audio architectures aren't on
      
        143
        `llama.cpp`'s roadmap, so GGUF isn't available):
      
        144
        
        145
        ```bash
      
        146
        dlm export my-audio.dlm
      
        147
        ```
      
        148
        
        149
        Writes to `~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/`:
      
        150
        
        151
        ```
      
        152
        hf-audio-snapshot/
      
        153
          adapter/                  # PEFT LoRA weights
      
        154
          processor/                # AutoProcessor config + feature extractor
      
        155
          snapshot_manifest.json    # export_target=hf_snapshot + sha256s
      
        156
          README.md                 # how to load downstream
      
        157
        ```
      
        158
        
        159
        Load on the other side:
      
        160
        
        161
        ```python
      
        162
        from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
      
        163
        from peft import PeftModel
      
        164
        
        165
        base = Qwen2AudioForConditionalGeneration.from_pretrained(
      
        166
            "Qwen/Qwen2-Audio-7B-Instruct",
      
        167
        )
      
        168
        model = PeftModel.from_pretrained(base, "./adapter")
      
        169
        processor = AutoProcessor.from_pretrained("./processor")
      
        170
        ```
      
        171
        
        172
        The base isn't bundled — recipients download it on first use.
      
        173
        
        174
        ## Troubleshooting
      
        175
        
        176
        ### "audio not found: clips/your-clip.wav"
      
        177
        
        178
        The `--audio` scaffold points at a placeholder; drop a real clip at
      
        179
        that path or edit the `::audio path="..."::` fence.
      
        180
        
        181
        ### "native sample_rate=44100 Hz does not match pinned 16000 Hz"
      
        182
        
        183
        Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz.
      
        184
        Either re-encode manually:
      
        185
        
        186
        ```bash
      
        187
        ffmpeg -i in.wav -ar 16000 out.wav
      
        188
        ```
      
        189
        
        190
        Or opt into on-the-fly resampling by setting
      
        191
        `training.audio.auto_resample: true` in the frontmatter. The error
      
        192
        message from the trainer now names this knob directly.
      
        193
        
        194
        ### "AudioResampleUnavailable: requires either soxr or scipy"
      
        195
        
        196
        You set `auto_resample: true` but neither resampler is importable.
      
        197
        Install one: `pip install soxr` (recommended, ships with the
      
        198
        `dlm[audio]` extra) or `pip install scipy` as a pure-Python
      
        199
        fallback.
      
        200
        
        201
        ### "audio-language base requires at least one --audio PATH"
      
        202
        
        203
        You ran `dlm prompt` on an audio `.dlm` without attaching a clip.
      
        204
        Audio bases always expect a waveform — even a throwaway question
      
        205
        about transcript content needs an audio input to anchor the
      
        206
        placeholder token.
      
        207
        
        208
        ### "AUDIO section has empty transcript"
      
        209
        
        210
        Both the inline `transcript="..."` form and the sibling `<stem>.txt`
      
        211
        form must produce a non-empty transcript. Whitespace-only transcripts
      
        212
        are refused (the trainer has no target text to predict).
      
        213
        
        214
        ### Disk / memory issues
      
        215
        
        216
        Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at
      
        217
        fp16. Close other GPU consumers, use `--max-steps 1` to dry-run, or
      
        218
        wait for the audio-QLoRA path (deferred).
      
        219
        
        220
        ## What's not yet in Sprint 35.2
      
        221
        
        222
        - **Resampling.** ~~v1 refuses sample-rate mismatches.~~ Opt-in
      
        223
          automatic resampling via `training.audio.auto_resample: true`
      
        224
          landed as a deferred-item follow-up (soxr preferred,
      
        225
          scipy.signal.resample_poly fallback). Defaults to off so the
      
        226
          refuse-on-mismatch contract stays backward-compatible.
      
        227
        - **MP3 support.** `soundfile` needs libsndfile ≥ 1.1 for MP3;
      
        228
          we lock to `.wav` / `.flac` / `.ogg` in v1 to avoid shipping a
      
        229
          libsndfile hard-pin.
      
        230
        - **Audio feature caching in training.** `AudioCache` is wired for
      
        231
          the standalone inference path and the slow integration test;
      
        232
          the training hot path doesn't re-use the cache yet (each epoch
      
        233
          re-extracts features). Meaningful speed-up lands alongside
      
        234
          multi-epoch audio corpora where re-extraction dominates.
      
        235
        - **QLoRA for audio.** 4-bit audio training needs extra safety
      
        236
          testing for the audio encoder weights; deferred.
      
        237
        - **Multiple audio clips per section.** Each `::audio::` fence carries
      
        238
          one clip; prompts can stack multiple `<|AUDIO|>` tokens by repeating
      
        239
          `--audio` on the CLI.

1	# Audio training (audio + Qwen2-Audio)
2
3	Sprint 35.2 adds audio sections to `.dlm` files. This recipe walks a
4	spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts
5	→ train → query the adapter against new audio.
6
7	## Prerequisites
8
9	- Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥
10	24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB
11	consumer GPUs don't fit this base without quantization (4-bit audio
12	training is deferred).
13	- Qwen2-Audio cached locally (`huggingface-cli download
14	Qwen/Qwen2-Audio-7B-Instruct`). First train without this triggers
15	the download automatically.
16	- The `audio` extra installed: `uv sync --extra audio` (pulls
17	`soundfile` for decoding `.wav` / `.flac` / `.ogg`).
18
19	## Step 1 — Scaffold an audio `.dlm`
20
21	```bash
22	dlm init my-audio.dlm --audio
23	```
24
25	`--audio` pins the base to `qwen2-audio-7b-instruct` and emits a
26	schema-v11 scaffold with a sample `::audio::` fence. The initial
27	body references `clips/your-clip.wav` (non-existent by default —
28	drop a real clip at that path before the first train).
29
30	## Step 2 — Author audio sections
31
32	Two ways to supply audio. Inline each fence with the transcript:
33
34	```dlm
35	::audio path="clips/intro.wav" transcript="Welcome to the podcast."::
36
37	::instruction::
38	### Q
39	What did the speaker say?
40
41	### A
42	"Welcome to the podcast."
43	```
44
45	Or ingest a directory through a source directive. Audio files need
46	a matching `<stem>.txt` sidecar with the transcript:
47
48	```
49	corpus/
50	├── intro.wav
51	├── intro.txt ← transcript for intro.wav
52	├── outro.flac
53	└── outro.txt
54	```
55
56	```dlm
57	---
58	dlm_id: 01JZ...
59	dlm_version: 11
60	base_model: qwen2-audio-7b-instruct
61	training:
62	sources:
63	- path: ./corpus
64	include: ["*/.wav", "*/.flac"]
65	---
66	```
67
68	Each `.wav`/`.flac`/`.ogg` with a sibling `.txt` becomes an
69	`::audio::` section. Files without a sidecar are silently skipped +
70	counted in provenance (`dlm show --json` surfaces the skip count
71	under `source_directives[].skipped_audio_no_transcript`).
72
73	## Step 3 — Train
74
75	```bash
76	dlm train my-audio.dlm
77	```
78
79	The trainer:
80
81	1. Loads Qwen2-Audio via `Qwen2AudioForConditionalGeneration` + its
82	matching `AutoProcessor` (feature extractor + tokenizer).
83	2. Walks `training.sources` directives, copies each audio file's
84	bytes into the content-addressed blob store at
85	`~/.dlm/store/<dlm_id>/blobs/`.
86	3. Emits training rows shaped
87	`{audio_blob_sha, audio_path, text: "<\|AUDIO\|>\n<transcript>"}`.
88	4. Runs our `AudioLmCollator` (custom — TRL 1.2 has no audio
89	auto-dispatch). The collator decodes each waveform via
90	`soundfile`, truncates to 30 s, hands the batch to the processor,
91	and emits `{input_ids, attention_mask, input_features, labels}`.
92	5. Commits the adapter under `adapter/versions/v0001/`.
93
94	Sample-rate policy. By default the trainer refuses audio whose
95	native rate doesn't match the base's pinned `sample_rate`
96	(Qwen2-Audio: 16 kHz). Two ways to reconcile:
97
98	(1) Manual re-encode — preferred for archive-stable corpora:
99
100	```bash
101	ffmpeg -i in.mp3 -ar 16000 out.wav
102	```
103
104	(2) Opt into automatic resampling — flip the frontmatter knob:
105
106	```yaml
107	training:
108	audio:
109	auto_resample: true
110	```
111
112	With `auto_resample: true`, any clip whose native rate disagrees
113	resamples on-the-fly via `dlm.data.audio_resample` (soxr if
114	installed, else scipy.signal.resample_poly). Resampled waveforms
115	cache separately from native-rate ones — toggling the flag on an
116	existing corpus doesn't serve stale entries. Install soxr for best
117	quality + speed (`pip install dlm[audio]` pulls it in), or
118	`pip install scipy` as a fallback. Without either, the trainer
119	raises `AudioResampleUnavailable` at first mismatched decode
120	rather than training on the wrong rate.
121
122	Wall-clock expectations. A 5-clip (30 s each) corpus + 3 epochs
123	on an RTX 4090 at `micro_batch_size=1` + `grad_accum=4` takes about
124	20 minutes. Apple Silicon is ~4× slower.
125
126	## Step 4 — Prompt the trained adapter
127
128	```bash
129	dlm prompt my-audio.dlm --audio clips/new-clip.wav \
130	"What did the speaker say?"
131	```
132
133	`--audio` is required for audio bases. Repeat the flag for multi-clip
134	prompts; each occurrence expands to one `<\|AUDIO\|>` placeholder that
135	the processor replaces with 750 audio tokens (30 s × 25 tokens/s).
136
137	`--image` and `--audio` cannot be combined — each targets a different
138	modality.
139
140	## Step 5 — Export
141
142	Audio bases take the HF-snapshot path (audio architectures aren't on
143	`llama.cpp`'s roadmap, so GGUF isn't available):
144
145	```bash
146	dlm export my-audio.dlm
147	```
148
149	Writes to `~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/`:
150
151	```
152	hf-audio-snapshot/
153	adapter/ # PEFT LoRA weights
154	processor/ # AutoProcessor config + feature extractor
155	snapshot_manifest.json # export_target=hf_snapshot + sha256s
156	README.md # how to load downstream
157	```
158
159	Load on the other side:
160
161	```python
162	from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
163	from peft import PeftModel
164
165	base = Qwen2AudioForConditionalGeneration.from_pretrained(
166	"Qwen/Qwen2-Audio-7B-Instruct",
167	)
168	model = PeftModel.from_pretrained(base, "./adapter")
169	processor = AutoProcessor.from_pretrained("./processor")
170	```
171
172	The base isn't bundled — recipients download it on first use.
173
174	## Troubleshooting
175
176	### "audio not found: clips/your-clip.wav"
177
178	The `--audio` scaffold points at a placeholder; drop a real clip at
179	that path or edit the `::audio path="..."::` fence.
180
181	### "native sample_rate=44100 Hz does not match pinned 16000 Hz"
182
183	Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz.
184	Either re-encode manually:
185
186	```bash
187	ffmpeg -i in.wav -ar 16000 out.wav
188	```
189
190	Or opt into on-the-fly resampling by setting
191	`training.audio.auto_resample: true` in the frontmatter. The error
192	message from the trainer now names this knob directly.
193
194	### "AudioResampleUnavailable: requires either soxr or scipy"
195
196	You set `auto_resample: true` but neither resampler is importable.
197	Install one: `pip install soxr` (recommended, ships with the
198	`dlm[audio]` extra) or `pip install scipy` as a pure-Python
199	fallback.
200
201	### "audio-language base requires at least one --audio PATH"
202
203	You ran `dlm prompt` on an audio `.dlm` without attaching a clip.
204	Audio bases always expect a waveform — even a throwaway question
205	about transcript content needs an audio input to anchor the
206	placeholder token.
207
208	### "AUDIO section has empty transcript"
209
210	Both the inline `transcript="..."` form and the sibling `<stem>.txt`
211	form must produce a non-empty transcript. Whitespace-only transcripts
212	are refused (the trainer has no target text to predict).
213
214	### Disk / memory issues
215
216	Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at
217	fp16. Close other GPU consumers, use `--max-steps 1` to dry-run, or
218	wait for the audio-QLoRA path (deferred).
219
220	## What's not yet in Sprint 35.2
221
222	- Resampling. ~~v1 refuses sample-rate mismatches.~~ Opt-in
223	automatic resampling via `training.audio.auto_resample: true`
224	landed as a deferred-item follow-up (soxr preferred,
225	scipy.signal.resample_poly fallback). Defaults to off so the
226	refuse-on-mismatch contract stays backward-compatible.
227	- MP3 support. `soundfile` needs libsndfile ≥ 1.1 for MP3;
228	we lock to `.wav` / `.flac` / `.ogg` in v1 to avoid shipping a
229	libsndfile hard-pin.
230	- Audio feature caching in training. `AudioCache` is wired for
231	the standalone inference path and the slow integration test;
232	the training hot path doesn't re-use the cache yet (each epoch
233	re-extracts features). Meaningful speed-up lands alongside
234	multi-epoch audio corpora where re-extraction dominates.
235	- QLoRA for audio. 4-bit audio training needs extra safety
236	testing for the audio encoder weights; deferred.
237	- Multiple audio clips per section. Each `::audio::` fence carries
238	one clip; prompts can stack multiple `<\|AUDIO\|>` tokens by repeating
239	`--audio` on the CLI.