# Audio training (audio + Qwen2-Audio)

Sprint 35.2 adds audio sections to `.dlm` files. This recipe walks a
spoken-corpus workflow end-to-end: scaffold → drop clips + transcripts
→ train → query the adapter against new audio.

## Prerequisites

- Apple Silicon with ≥ 32 GB unified memory, or CUDA ≥ SM 8.0 with ≥
  24 GB VRAM. Qwen2-Audio-7B-Instruct fp16 weighs ~15 GB; the 16 GB
  consumer GPUs don't fit this base without quantization (4-bit audio
  training is deferred).
- Qwen2-Audio cached locally (`huggingface-cli download
  Qwen/Qwen2-Audio-7B-Instruct`). First train without this triggers
  the download automatically.
- The `audio` extra installed: `uv sync --extra audio` (pulls
  `soundfile` for decoding `.wav` / `.flac` / `.ogg`).

## Step 1 — Scaffold an audio `.dlm`

```bash
dlm init my-audio.dlm --audio
```

`--audio` pins the base to `qwen2-audio-7b-instruct` and emits a
schema-v11 scaffold with a sample `::audio::` fence. The initial
body references `clips/your-clip.wav` (non-existent by default —
drop a real clip at that path before the first train).

## Step 2 — Author audio sections

Two ways to supply audio. Inline each fence with the transcript:

```dlm
::audio path="clips/intro.wav" transcript="Welcome to the podcast."::

::instruction::
### Q
What did the speaker say?

### A
"Welcome to the podcast."
```

Or ingest a directory through a source directive. Audio files need
a matching `<stem>.txt` sidecar with the transcript:

```
corpus/
├── intro.wav
├── intro.txt         ← transcript for intro.wav
├── outro.flac
└── outro.txt
```

```dlm
---
dlm_id: 01JZ...
dlm_version: 11
base_model: qwen2-audio-7b-instruct
training:
  sources:
    - path: ./corpus
      include: ["**/*.wav", "**/*.flac"]
---
```

Each `.wav`/`.flac`/`.ogg` with a sibling `.txt` becomes an
`::audio::` section. Files without a sidecar are silently skipped +
counted in provenance (`dlm show --json` surfaces the skip count
under `source_directives[].skipped_audio_no_transcript`).

## Step 3 — Train

```bash
dlm train my-audio.dlm
```

The trainer:

1. Loads Qwen2-Audio via `Qwen2AudioForConditionalGeneration` + its
   matching `AutoProcessor` (feature extractor + tokenizer).
2. Walks `training.sources` directives, copies each audio file's
   bytes into the content-addressed blob store at
   `~/.dlm/store/<dlm_id>/blobs/`.
3. Emits training rows shaped
   `{audio_blob_sha, audio_path, text: "<|AUDIO|>\n<transcript>"}`.
4. Runs our `AudioLmCollator` (custom — TRL 1.2 has no audio
   auto-dispatch). The collator decodes each waveform via
   `soundfile`, truncates to 30 s, hands the batch to the processor,
   and emits `{input_ids, attention_mask, input_features, labels}`.
5. Commits the adapter under `adapter/versions/v0001/`.

**Sample-rate policy.** By default the trainer refuses audio whose
native rate doesn't match the base's pinned `sample_rate`
(Qwen2-Audio: 16 kHz). Two ways to reconcile:

**(1) Manual re-encode** — preferred for archive-stable corpora:

```bash
ffmpeg -i in.mp3 -ar 16000 out.wav
```

**(2) Opt into automatic resampling** — flip the frontmatter knob:

```yaml
training:
  audio:
    auto_resample: true
```

With `auto_resample: true`, any clip whose native rate disagrees
resamples on-the-fly via `dlm.data.audio_resample` (soxr if
installed, else scipy.signal.resample_poly). Resampled waveforms
cache separately from native-rate ones — toggling the flag on an
existing corpus doesn't serve stale entries. Install soxr for best
quality + speed (`pip install dlm[audio]` pulls it in), or
`pip install scipy` as a fallback. Without either, the trainer
raises `AudioResampleUnavailable` at first mismatched decode
rather than training on the wrong rate.

**Wall-clock expectations.** A 5-clip (30 s each) corpus + 3 epochs
on an RTX 4090 at `micro_batch_size=1` + `grad_accum=4` takes about
20 minutes. Apple Silicon is ~4× slower.

## Step 4 — Prompt the trained adapter

```bash
dlm prompt my-audio.dlm --audio clips/new-clip.wav \
  "What did the speaker say?"
```

`--audio` is required for audio bases. Repeat the flag for multi-clip
prompts; each occurrence expands to one `<|AUDIO|>` placeholder that
the processor replaces with 750 audio tokens (30 s × 25 tokens/s).

`--image` and `--audio` cannot be combined — each targets a different
modality.

## Step 5 — Export

Audio bases take the HF-snapshot path (audio architectures aren't on
`llama.cpp`'s roadmap, so GGUF isn't available):

```bash
dlm export my-audio.dlm
```

Writes to `~/.dlm/store/<dlm_id>/exports/hf-audio-snapshot/`:

```
hf-audio-snapshot/
  adapter/                  # PEFT LoRA weights
  processor/                # AutoProcessor config + feature extractor
  snapshot_manifest.json    # export_target=hf_snapshot + sha256s
  README.md                 # how to load downstream
```

Load on the other side:

```python
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel

base = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
)
model = PeftModel.from_pretrained(base, "./adapter")
processor = AutoProcessor.from_pretrained("./processor")
```

The base isn't bundled — recipients download it on first use.

## Troubleshooting

### "audio not found: clips/your-clip.wav"

The `--audio` scaffold points at a placeholder; drop a real clip at
that path or edit the `::audio path="..."::` fence.

### "native sample_rate=44100 Hz does not match pinned 16000 Hz"

Your clip is at 44.1 kHz (CD rate) but Qwen2-Audio expects 16 kHz.
Either re-encode manually:

```bash
ffmpeg -i in.wav -ar 16000 out.wav
```

Or opt into on-the-fly resampling by setting
`training.audio.auto_resample: true` in the frontmatter. The error
message from the trainer now names this knob directly.

### "AudioResampleUnavailable: requires either soxr or scipy"

You set `auto_resample: true` but neither resampler is importable.
Install one: `pip install soxr` (recommended, ships with the
`dlm[audio]` extra) or `pip install scipy` as a pure-Python
fallback.

### "audio-language base requires at least one --audio PATH"

You ran `dlm prompt` on an audio `.dlm` without attaching a clip.
Audio bases always expect a waveform — even a throwaway question
about transcript content needs an audio input to anchor the
placeholder token.

### "AUDIO section has empty transcript"

Both the inline `transcript="..."` form and the sibling `<stem>.txt`
form must produce a non-empty transcript. Whitespace-only transcripts
are refused (the trainer has no target text to predict).

### Disk / memory issues

Qwen2-Audio-7B is ~15 GB on disk and another ~15 GB in memory at
fp16. Close other GPU consumers, use `--max-steps 1` to dry-run, or
wait for the audio-QLoRA path (deferred).

## What's not yet in Sprint 35.2

- **Resampling.** ~~v1 refuses sample-rate mismatches.~~ Opt-in
  automatic resampling via `training.audio.auto_resample: true`
  landed as a deferred-item follow-up (soxr preferred,
  scipy.signal.resample_poly fallback). Defaults to off so the
  refuse-on-mismatch contract stays backward-compatible.
- **MP3 support.** `soundfile` needs libsndfile ≥ 1.1 for MP3;
  we lock to `.wav` / `.flac` / `.ogg` in v1 to avoid shipping a
  libsndfile hard-pin.
- **Audio feature caching in training.** `AudioCache` is wired for
  the standalone inference path and the slow integration test;
  the training hot path doesn't re-use the cache yet (each epoch
  re-extracts features). Meaningful speed-up lands alongside
  multi-epoch audio corpora where re-extraction dominates.
- **QLoRA for audio.** 4-bit audio training needs extra safety
  testing for the audio encoder weights; deferred.
- **Multiple audio clips per section.** Each `::audio::` fence carries
  one clip; prompts can stack multiple `<|AUDIO|>` tokens by repeating
  `--audio` on the CLI.