documentlanguagemodel Public
AMD ROCm support
DLM supports AMD GPUs via ROCm as a Tier 2 backend: LoRA training and inference work, but QLoRA is refused and the CI coverage is weaker than the CUDA path.
What works
- Training: LoRA on bf16-capable AMD GPUs.
- Inference:
dlm promptuses the standard PyTorch path; MLX and bitsandbytes are not involved. - Export: llama.cpp produces GGUF quantized weights. For ROCm-accelerated quantization, rebuild llama.cpp with HIP (see below); otherwise the default CPU build is used.
Supported GPUs
| Generation | Arch codes | Example SKUs | bf16 | FA2 (via flash_attn) |
|---|---|---|---|---|
| CDNA2 | gfx90a |
Instinct MI200/MI210/MI250 | yes | yes |
| CDNA3 | gfx942 |
Instinct MI300 | yes | yes |
| RDNA3 | gfx1100/1101/1102 |
RX 7900 XTX/XT/7800/7700 | yes | yes (experimental) |
| RDNA4 | gfx1200/1201 |
RX 9000-series | yes | varies |
| RDNA2 | gfx1030/1031 |
RX 6000-series | no | no |
| CDNA1 | gfx908 |
MI100 | no | no |
| Vega20 | gfx906 |
Radeon VII / MI50 | no | no |
The bf16-capable allowlist is enforced in dlm.hardware.capabilities
based on torch.cuda.get_device_properties(0).gcnArchName.
Unsupported arches fall back to fp16 (still functional, just slower
per token on weight-heavy layers).
What doesn't work
QLoRA is refused on ROCm. bitsandbytes ROCm builds are
upstream-unstable — 4-bit quantized matmuls silently return wrong
values on several arch/driver combinations. We refuse the
combination rather than risk corrupt gradients. Use adapter: lora
in your .dlm frontmatter.
Multi-GPU ROCm is out of scope for this sprint. Sprint 23's multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.
Software prerequisites
- ROCm ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
- PyTorch with HIP build — install via the ROCm wheels from
pytorch.org. The
torch.version.hipattribute must be non-None. - FlashAttention 2 (optional): AMD's
flash_attnfork is the package name on ROCm. Install for CDNA (MI200/MI300); RDNA3 support is experimental. Ifflash_attnis not importable or the arch is not on the allowlist, SDPA is used instead.
Determinism posture
The doctor reports determinism_class: best-effort on ROCm. ROCm's
deterministic kernels exist but are not as thorough as CUDA's; fp
match may drift across PyTorch/ROCm upgrades even with a pinned
seed.
Rebuilding llama.cpp with ROCm
The default vendored llama.cpp binary is CPU-only. Build a ROCm version once for faster quantization:
# Set your GPU arch
export AMDGPU_TARGETS="gfx1100" # RDNA3
# export AMDGPU_TARGETS="gfx90a" # MI200
# export AMDGPU_TARGETS="gfx942" # MI300
scripts/build-llama-cpp-rocm.sh
The script writes to vendor/llama.cpp/build-rocm/. To make
dlm export prefer this build, point the runner at it:
export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm
DLM_LLAMA_CPP_BUILD is honored by dlm.export.vendoring — when
set, it's checked before the default vendor dir for each binary, so
the ROCm-accelerated llama-quantize / llama-imatrix win over any
CPU-only build left behind from scripts/bump-llama-cpp.sh build.
CI / testing
No default ROCm CI runner exists. Contributors with ROCm hardware can run the gated smoke:
DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v
A scheduled self-hosted runner is the intended deployment; contact the maintainers if you'd like to host one.
View source
| 1 | # AMD ROCm support |
| 2 | |
| 3 | DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training |
| 4 | and inference work, but QLoRA is refused and the CI coverage is |
| 5 | weaker than the CUDA path. |
| 6 | |
| 7 | ## What works |
| 8 | |
| 9 | - **Training**: LoRA on bf16-capable AMD GPUs. |
| 10 | - **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and |
| 11 | bitsandbytes are not involved. |
| 12 | - **Export**: llama.cpp produces GGUF quantized weights. For |
| 13 | ROCm-accelerated quantization, rebuild llama.cpp with HIP (see |
| 14 | below); otherwise the default CPU build is used. |
| 15 | |
| 16 | ## Supported GPUs |
| 17 | |
| 18 | | Generation | Arch codes | Example SKUs | bf16 | FA2 (via `flash_attn`) | |
| 19 | |------------|------------------|-----------------------------|------|------------------------| |
| 20 | | CDNA2 | `gfx90a` | Instinct MI200/MI210/MI250 | yes | yes | |
| 21 | | CDNA3 | `gfx942` | Instinct MI300 | yes | yes | |
| 22 | | RDNA3 | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes | yes (experimental) | |
| 23 | | RDNA4 | `gfx1200`/`1201` | RX 9000-series | yes | varies | |
| 24 | | RDNA2 | `gfx1030`/`1031` | RX 6000-series | no | no | |
| 25 | | CDNA1 | `gfx908` | MI100 | no | no | |
| 26 | | Vega20 | `gfx906` | Radeon VII / MI50 | no | no | |
| 27 | |
| 28 | The bf16-capable allowlist is enforced in `dlm.hardware.capabilities` |
| 29 | based on `torch.cuda.get_device_properties(0).gcnArchName`. |
| 30 | Unsupported arches fall back to fp16 (still functional, just slower |
| 31 | per token on weight-heavy layers). |
| 32 | |
| 33 | ## What doesn't work |
| 34 | |
| 35 | **QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are |
| 36 | upstream-unstable — 4-bit quantized matmuls silently return wrong |
| 37 | values on several arch/driver combinations. We refuse the |
| 38 | combination rather than risk corrupt gradients. Use `adapter: lora` |
| 39 | in your `.dlm` frontmatter. |
| 40 | |
| 41 | **Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's |
| 42 | multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on. |
| 43 | |
| 44 | ## Software prerequisites |
| 45 | |
| 46 | - **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2. |
| 47 | - **PyTorch** with HIP build — install via the ROCm wheels from |
| 48 | pytorch.org. The `torch.version.hip` attribute must be non-None. |
| 49 | - **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the |
| 50 | package name on ROCm. Install for CDNA (MI200/MI300); RDNA3 |
| 51 | support is experimental. If `flash_attn` is not importable or the |
| 52 | arch is not on the allowlist, SDPA is used instead. |
| 53 | |
| 54 | ## Determinism posture |
| 55 | |
| 56 | The doctor reports `determinism_class: best-effort` on ROCm. ROCm's |
| 57 | deterministic kernels exist but are not as thorough as CUDA's; fp |
| 58 | match may drift across PyTorch/ROCm upgrades even with a pinned |
| 59 | seed. |
| 60 | |
| 61 | ## Rebuilding llama.cpp with ROCm |
| 62 | |
| 63 | The default vendored llama.cpp binary is CPU-only. Build a ROCm |
| 64 | version once for faster quantization: |
| 65 | |
| 66 | ```bash |
| 67 | # Set your GPU arch |
| 68 | export AMDGPU_TARGETS="gfx1100" # RDNA3 |
| 69 | # export AMDGPU_TARGETS="gfx90a" # MI200 |
| 70 | # export AMDGPU_TARGETS="gfx942" # MI300 |
| 71 | |
| 72 | scripts/build-llama-cpp-rocm.sh |
| 73 | ``` |
| 74 | |
| 75 | The script writes to `vendor/llama.cpp/build-rocm/`. To make |
| 76 | `dlm export` prefer this build, point the runner at it: |
| 77 | |
| 78 | ```bash |
| 79 | export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm |
| 80 | ``` |
| 81 | |
| 82 | `DLM_LLAMA_CPP_BUILD` is honored by `dlm.export.vendoring` — when |
| 83 | set, it's checked before the default vendor dir for each binary, so |
| 84 | the ROCm-accelerated `llama-quantize` / `llama-imatrix` win over any |
| 85 | CPU-only build left behind from `scripts/bump-llama-cpp.sh build`. |
| 86 | |
| 87 | ## CI / testing |
| 88 | |
| 89 | No default ROCm CI runner exists. Contributors with ROCm hardware can |
| 90 | run the gated smoke: |
| 91 | |
| 92 | ```bash |
| 93 | DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v |
| 94 | ``` |
| 95 | |
| 96 | A scheduled self-hosted runner is the intended deployment; contact |
| 97 | the maintainers if you'd like to host one. |