@@ -0,0 +1,96 @@ |
| | 1 | +# AMD ROCm support |
| | 2 | + |
| | 3 | +DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training |
| | 4 | +and inference work, but QLoRA is refused and the CI coverage is |
| | 5 | +weaker than the CUDA path. |
| | 6 | + |
| | 7 | +## What works |
| | 8 | + |
| | 9 | +- **Training**: LoRA on bf16-capable AMD GPUs. |
| | 10 | +- **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and |
| | 11 | + bitsandbytes are not involved. |
| | 12 | +- **Export**: llama.cpp produces GGUF quantized weights. For |
| | 13 | + ROCm-accelerated quantization, rebuild llama.cpp with HIP (see |
| | 14 | + below); otherwise the default CPU build is used. |
| | 15 | + |
| | 16 | +## Supported GPUs |
| | 17 | + |
| | 18 | +| Generation | Arch codes | Example SKUs | bf16 | FA2 (via `flash_attn`) | |
| | 19 | +|------------|------------------|-----------------------------|------|------------------------| |
| | 20 | +| CDNA2 | `gfx90a` | Instinct MI200/MI210/MI250 | yes | yes | |
| | 21 | +| CDNA3 | `gfx942` | Instinct MI300 | yes | yes | |
| | 22 | +| RDNA3 | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes | yes (experimental) | |
| | 23 | +| RDNA4 | `gfx1200`/`1201` | RX 9000-series | yes | varies | |
| | 24 | +| RDNA2 | `gfx1030`/`1031` | RX 6000-series | no | no | |
| | 25 | +| CDNA1 | `gfx908` | MI100 | no | no | |
| | 26 | +| Vega20 | `gfx906` | Radeon VII / MI50 | no | no | |
| | 27 | + |
| | 28 | +The bf16-capable allowlist is enforced in `dlm.hardware.capabilities` |
| | 29 | +based on `torch.cuda.get_device_properties(0).gcnArchName`. |
| | 30 | +Unsupported arches fall back to fp16 (still functional, just slower |
| | 31 | +per token on weight-heavy layers). |
| | 32 | + |
| | 33 | +## What doesn't work |
| | 34 | + |
| | 35 | +**QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are |
| | 36 | +upstream-unstable — 4-bit quantized matmuls silently return wrong |
| | 37 | +values on several arch/driver combinations. We refuse the |
| | 38 | +combination rather than risk corrupt gradients. Use `adapter: lora` |
| | 39 | +in your `.dlm` frontmatter. |
| | 40 | + |
| | 41 | +**Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's |
| | 42 | +multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on. |
| | 43 | + |
| | 44 | +## Software prerequisites |
| | 45 | + |
| | 46 | +- **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2. |
| | 47 | +- **PyTorch** with HIP build — install via the ROCm wheels from |
| | 48 | + pytorch.org. The `torch.version.hip` attribute must be non-None. |
| | 49 | +- **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the |
| | 50 | + package name on ROCm. Install for CDNA (MI200/MI300); RDNA3 |
| | 51 | + support is experimental. If `flash_attn` is not importable or the |
| | 52 | + arch is not on the allowlist, SDPA is used instead. |
| | 53 | + |
| | 54 | +## Determinism posture |
| | 55 | + |
| | 56 | +The doctor reports `determinism_class: best-effort` on ROCm. ROCm's |
| | 57 | +deterministic kernels exist but are not as thorough as CUDA's; fp |
| | 58 | +match may drift across PyTorch/ROCm upgrades even with a pinned |
| | 59 | +seed. |
| | 60 | + |
| | 61 | +## Rebuilding llama.cpp with ROCm |
| | 62 | + |
| | 63 | +The default vendored llama.cpp binary is CPU-only. Build a ROCm |
| | 64 | +version once for faster quantization: |
| | 65 | + |
| | 66 | +```bash |
| | 67 | +# Set your GPU arch |
| | 68 | +export AMDGPU_TARGETS="gfx1100" # RDNA3 |
| | 69 | +# export AMDGPU_TARGETS="gfx90a" # MI200 |
| | 70 | +# export AMDGPU_TARGETS="gfx942" # MI300 |
| | 71 | + |
| | 72 | +scripts/build-llama-cpp-rocm.sh |
| | 73 | +``` |
| | 74 | + |
| | 75 | +The script writes to `vendor/llama.cpp/build-rocm/`. To make |
| | 76 | +`dlm export` prefer this build, point the runner at it: |
| | 77 | + |
| | 78 | +```bash |
| | 79 | +export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm |
| | 80 | +``` |
| | 81 | + |
| | 82 | +(Environment-variable plumbing in `dlm.export.vendoring` lands as |
| | 83 | +part of the next ROCm polish pass — for now, manually invoke the |
| | 84 | +ROCm binaries if you need them.) |
| | 85 | + |
| | 86 | +## CI / testing |
| | 87 | + |
| | 88 | +No default ROCm CI runner exists. Contributors with ROCm hardware can |
| | 89 | +run the gated smoke: |
| | 90 | + |
| | 91 | +```bash |
| | 92 | +DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v |
| | 93 | +``` |
| | 94 | + |
| | 95 | +A scheduled self-hosted runner is the intended deployment; contact |
| | 96 | +the maintainers if you'd like to host one. |