@@ -0,0 +1,96 @@ |
| 1 | +# AMD ROCm support |
| 2 | + |
| 3 | +DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training |
| 4 | +and inference work, but QLoRA is refused and the CI coverage is |
| 5 | +weaker than the CUDA path. |
| 6 | + |
| 7 | +## What works |
| 8 | + |
| 9 | +- **Training**: LoRA on bf16-capable AMD GPUs. |
| 10 | +- **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and |
| 11 | + bitsandbytes are not involved. |
| 12 | +- **Export**: llama.cpp produces GGUF quantized weights. For |
| 13 | + ROCm-accelerated quantization, rebuild llama.cpp with HIP (see |
| 14 | + below); otherwise the default CPU build is used. |
| 15 | + |
| 16 | +## Supported GPUs |
| 17 | + |
| 18 | +| Generation | Arch codes | Example SKUs | bf16 | FA2 (via `flash_attn`) | |
| 19 | +|------------|------------------|-----------------------------|------|------------------------| |
| 20 | +| CDNA2 | `gfx90a` | Instinct MI200/MI210/MI250 | yes | yes | |
| 21 | +| CDNA3 | `gfx942` | Instinct MI300 | yes | yes | |
| 22 | +| RDNA3 | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes | yes (experimental) | |
| 23 | +| RDNA4 | `gfx1200`/`1201` | RX 9000-series | yes | varies | |
| 24 | +| RDNA2 | `gfx1030`/`1031` | RX 6000-series | no | no | |
| 25 | +| CDNA1 | `gfx908` | MI100 | no | no | |
| 26 | +| Vega20 | `gfx906` | Radeon VII / MI50 | no | no | |
| 27 | + |
| 28 | +The bf16-capable allowlist is enforced in `dlm.hardware.capabilities` |
| 29 | +based on `torch.cuda.get_device_properties(0).gcnArchName`. |
| 30 | +Unsupported arches fall back to fp16 (still functional, just slower |
| 31 | +per token on weight-heavy layers). |
| 32 | + |
| 33 | +## What doesn't work |
| 34 | + |
| 35 | +**QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are |
| 36 | +upstream-unstable — 4-bit quantized matmuls silently return wrong |
| 37 | +values on several arch/driver combinations. We refuse the |
| 38 | +combination rather than risk corrupt gradients. Use `adapter: lora` |
| 39 | +in your `.dlm` frontmatter. |
| 40 | + |
| 41 | +**Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's |
| 42 | +multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on. |
| 43 | + |
| 44 | +## Software prerequisites |
| 45 | + |
| 46 | +- **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2. |
| 47 | +- **PyTorch** with HIP build — install via the ROCm wheels from |
| 48 | + pytorch.org. The `torch.version.hip` attribute must be non-None. |
| 49 | +- **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the |
| 50 | + package name on ROCm. Install for CDNA (MI200/MI300); RDNA3 |
| 51 | + support is experimental. If `flash_attn` is not importable or the |
| 52 | + arch is not on the allowlist, SDPA is used instead. |
| 53 | + |
| 54 | +## Determinism posture |
| 55 | + |
| 56 | +The doctor reports `determinism_class: best-effort` on ROCm. ROCm's |
| 57 | +deterministic kernels exist but are not as thorough as CUDA's; fp |
| 58 | +match may drift across PyTorch/ROCm upgrades even with a pinned |
| 59 | +seed. |
| 60 | + |
| 61 | +## Rebuilding llama.cpp with ROCm |
| 62 | + |
| 63 | +The default vendored llama.cpp binary is CPU-only. Build a ROCm |
| 64 | +version once for faster quantization: |
| 65 | + |
| 66 | +```bash |
| 67 | +# Set your GPU arch |
| 68 | +export AMDGPU_TARGETS="gfx1100" # RDNA3 |
| 69 | +# export AMDGPU_TARGETS="gfx90a" # MI200 |
| 70 | +# export AMDGPU_TARGETS="gfx942" # MI300 |
| 71 | + |
| 72 | +scripts/build-llama-cpp-rocm.sh |
| 73 | +``` |
| 74 | + |
| 75 | +The script writes to `vendor/llama.cpp/build-rocm/`. To make |
| 76 | +`dlm export` prefer this build, point the runner at it: |
| 77 | + |
| 78 | +```bash |
| 79 | +export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm |
| 80 | +``` |
| 81 | + |
| 82 | +(Environment-variable plumbing in `dlm.export.vendoring` lands as |
| 83 | +part of the next ROCm polish pass — for now, manually invoke the |
| 84 | +ROCm binaries if you need them.) |
| 85 | + |
| 86 | +## CI / testing |
| 87 | + |
| 88 | +No default ROCm CI runner exists. Contributors with ROCm hardware can |
| 89 | +run the gated smoke: |
| 90 | + |
| 91 | +```bash |
| 92 | +DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v |
| 93 | +``` |
| 94 | + |
| 95 | +A scheduled self-hosted runner is the intended deployment; contact |
| 96 | +the maintainers if you'd like to host one. |