`336bedd`

docs(hardware): ROCm support tier + arch matrix + llama.cpp HIP rebuild

Authored by

espadonne 3 weeks ago

SHA: 336beddcf6c09b853236375cd6ba52d49dff5cfb
Parents: 12ccdcc
Tree: 5b83d6e

2 changed files

Status	File	+	-
A	`docs/hardware/rocm.md`	96	0
M	`mkdocs.yml`	2	0

docs/hardware/rocm.mdadded

++# AMD ROCm support
++
++DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training
++and inference work, but QLoRA is refused and the CI coverage is
++weaker than the CUDA path.
++
++## What works
++
++- **Training**: LoRA on bf16-capable AMD GPUs.
++- **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and
++  bitsandbytes are not involved.
++- **Export**: llama.cpp produces GGUF quantized weights. For
++  ROCm-accelerated quantization, rebuild llama.cpp with HIP (see
++  below); otherwise the default CPU build is used.
++
++## Supported GPUs
++
++| Generation | Arch codes       | Example SKUs                | bf16 | FA2 (via `flash_attn`) |
++|------------|------------------|-----------------------------|------|------------------------|
++| CDNA2      | `gfx90a`         | Instinct MI200/MI210/MI250  | yes  | yes                    |
++| CDNA3      | `gfx942`         | Instinct MI300              | yes  | yes                    |
++| RDNA3      | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes  | yes (experimental)     |
++| RDNA4      | `gfx1200`/`1201` | RX 9000-series              | yes  | varies                 |
++| RDNA2      | `gfx1030`/`1031` | RX 6000-series              | no   | no                     |
++| CDNA1      | `gfx908`         | MI100                       | no   | no                     |
++| Vega20     | `gfx906`         | Radeon VII / MI50           | no   | no                     |
++
++The bf16-capable allowlist is enforced in `dlm.hardware.capabilities`
++based on `torch.cuda.get_device_properties(0).gcnArchName`.
++Unsupported arches fall back to fp16 (still functional, just slower
++per token on weight-heavy layers).
++
++## What doesn't work
++
++**QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are
++upstream-unstable — 4-bit quantized matmuls silently return wrong
++values on several arch/driver combinations. We refuse the
++combination rather than risk corrupt gradients. Use `adapter: lora`
++in your `.dlm` frontmatter.
++
++**Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's
++multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.
++
++## Software prerequisites
++
++- **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
++- **PyTorch** with HIP build — install via the ROCm wheels from
++  pytorch.org. The `torch.version.hip` attribute must be non-None.
++- **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the
++  package name on ROCm. Install for CDNA (MI200/MI300); RDNA3
++  support is experimental. If `flash_attn` is not importable or the
++  arch is not on the allowlist, SDPA is used instead.
++
++## Determinism posture
++
++The doctor reports `determinism_class: best-effort` on ROCm. ROCm's
++deterministic kernels exist but are not as thorough as CUDA's; fp
++match may drift across PyTorch/ROCm upgrades even with a pinned
++seed.
++
++## Rebuilding llama.cpp with ROCm
++
++The default vendored llama.cpp binary is CPU-only. Build a ROCm
++version once for faster quantization:
++
++```bash
++# Set your GPU arch
++export AMDGPU_TARGETS="gfx1100"   # RDNA3
++# export AMDGPU_TARGETS="gfx90a"  # MI200
++# export AMDGPU_TARGETS="gfx942"  # MI300
++
++scripts/build-llama-cpp-rocm.sh
++```
++
++The script writes to `vendor/llama.cpp/build-rocm/`. To make
++`dlm export` prefer this build, point the runner at it:
++
++```bash
++export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm
++```
++
++(Environment-variable plumbing in `dlm.export.vendoring` lands as
++part of the next ROCm polish pass — for now, manually invoke the
++ROCm binaries if you need them.)
++
++## CI / testing
++
++No default ROCm CI runner exists. Contributors with ROCm hardware can
++run the gated smoke:
++
++```bash
++DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v
++```
++
++A scheduled self-hosted runner is the intended deployment; contact
++the maintainers if you'd like to host one.

mkdocs.ymlmodified

`@@ -70,4 +70,6 @@` nav:
70	- Multi-adapter composition: cookbook/multi-adapter.md	70	- Multi-adapter composition: cookbook/multi-adapter.md
71	- Architecture: architecture.md	71	- Architecture: architecture.md
72	- Determinism: determinism.md	72	- Determinism: determinism.md
		73	+ - Hardware:
		74	+ - AMD ROCm: hardware/rocm.md
73	- Troubleshooting: troubleshooting.md	75	- Troubleshooting: troubleshooting.md