markdown · 3903 bytes Raw Blame History

AMD ROCm support

DLM supports AMD GPUs via ROCm as a Tier 2 backend: LoRA training and inference work, but QLoRA is refused and the CI coverage is weaker than the CUDA path.

What works

  • Training: LoRA on bf16-capable AMD GPUs.
  • Inference: dlm prompt uses the standard PyTorch path; MLX and bitsandbytes are not involved.
  • Export: llama.cpp produces GGUF quantized weights. For ROCm-accelerated quantization, rebuild llama.cpp with HIP (see below); otherwise the default CPU build is used.

Supported GPUs

Generation Arch codes Example SKUs bf16 FA2 (via flash_attn)
CDNA2 gfx90a Instinct MI200/MI210/MI250 yes yes
CDNA3 gfx942 Instinct MI300 yes yes
RDNA3 gfx1100/1101/1102 RX 7900 XTX/XT/7800/7700 yes yes (experimental)
RDNA4 gfx1200/1201 RX 9000-series yes varies
RDNA2 gfx1030/1031 RX 6000-series no no
CDNA1 gfx908 MI100 no no
Vega20 gfx906 Radeon VII / MI50 no no

The bf16-capable allowlist is enforced in dlm.hardware.capabilities based on torch.cuda.get_device_properties(0).gcnArchName. Unsupported arches fall back to fp16 (still functional, just slower per token on weight-heavy layers).

What doesn't work

QLoRA is refused on ROCm. bitsandbytes ROCm builds are upstream-unstable — 4-bit quantized matmuls silently return wrong values on several arch/driver combinations. We refuse the combination rather than risk corrupt gradients. Use adapter: lora in your .dlm frontmatter.

Multi-GPU ROCm is out of scope for this sprint. Sprint 23's multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.

Software prerequisites

  • ROCm ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
  • PyTorch with HIP build — install via the ROCm wheels from pytorch.org. The torch.version.hip attribute must be non-None.
  • FlashAttention 2 (optional): AMD's flash_attn fork is the package name on ROCm. Install for CDNA (MI200/MI300); RDNA3 support is experimental. If flash_attn is not importable or the arch is not on the allowlist, SDPA is used instead.

Determinism posture

The doctor reports determinism_class: best-effort on ROCm. ROCm's deterministic kernels exist but are not as thorough as CUDA's; fp match may drift across PyTorch/ROCm upgrades even with a pinned seed.

Rebuilding llama.cpp with ROCm

The default vendored llama.cpp binary is CPU-only. Build a ROCm version once for faster quantization:

# Set your GPU arch
export AMDGPU_TARGETS="gfx1100"   # RDNA3
# export AMDGPU_TARGETS="gfx90a"  # MI200
# export AMDGPU_TARGETS="gfx942"  # MI300

scripts/build-llama-cpp-rocm.sh

The script writes to vendor/llama.cpp/build-rocm/. To make dlm export prefer this build, point the runner at it:

export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm

DLM_LLAMA_CPP_BUILD is honored by dlm.export.vendoring — when set, it's checked before the default vendor dir for each binary, so the ROCm-accelerated llama-quantize / llama-imatrix win over any CPU-only build left behind from scripts/bump-llama-cpp.sh build.

CI / testing

No default ROCm CI runner exists. Contributors with ROCm hardware can run the gated smoke:

DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v

A scheduled self-hosted runner is the intended deployment; contact the maintainers if you'd like to host one.

View source
1 # AMD ROCm support
2
3 DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training
4 and inference work, but QLoRA is refused and the CI coverage is
5 weaker than the CUDA path.
6
7 ## What works
8
9 - **Training**: LoRA on bf16-capable AMD GPUs.
10 - **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and
11 bitsandbytes are not involved.
12 - **Export**: llama.cpp produces GGUF quantized weights. For
13 ROCm-accelerated quantization, rebuild llama.cpp with HIP (see
14 below); otherwise the default CPU build is used.
15
16 ## Supported GPUs
17
18 | Generation | Arch codes | Example SKUs | bf16 | FA2 (via `flash_attn`) |
19 |------------|------------------|-----------------------------|------|------------------------|
20 | CDNA2 | `gfx90a` | Instinct MI200/MI210/MI250 | yes | yes |
21 | CDNA3 | `gfx942` | Instinct MI300 | yes | yes |
22 | RDNA3 | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes | yes (experimental) |
23 | RDNA4 | `gfx1200`/`1201` | RX 9000-series | yes | varies |
24 | RDNA2 | `gfx1030`/`1031` | RX 6000-series | no | no |
25 | CDNA1 | `gfx908` | MI100 | no | no |
26 | Vega20 | `gfx906` | Radeon VII / MI50 | no | no |
27
28 The bf16-capable allowlist is enforced in `dlm.hardware.capabilities`
29 based on `torch.cuda.get_device_properties(0).gcnArchName`.
30 Unsupported arches fall back to fp16 (still functional, just slower
31 per token on weight-heavy layers).
32
33 ## What doesn't work
34
35 **QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are
36 upstream-unstable — 4-bit quantized matmuls silently return wrong
37 values on several arch/driver combinations. We refuse the
38 combination rather than risk corrupt gradients. Use `adapter: lora`
39 in your `.dlm` frontmatter.
40
41 **Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's
42 multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.
43
44 ## Software prerequisites
45
46 - **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
47 - **PyTorch** with HIP build — install via the ROCm wheels from
48 pytorch.org. The `torch.version.hip` attribute must be non-None.
49 - **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the
50 package name on ROCm. Install for CDNA (MI200/MI300); RDNA3
51 support is experimental. If `flash_attn` is not importable or the
52 arch is not on the allowlist, SDPA is used instead.
53
54 ## Determinism posture
55
56 The doctor reports `determinism_class: best-effort` on ROCm. ROCm's
57 deterministic kernels exist but are not as thorough as CUDA's; fp
58 match may drift across PyTorch/ROCm upgrades even with a pinned
59 seed.
60
61 ## Rebuilding llama.cpp with ROCm
62
63 The default vendored llama.cpp binary is CPU-only. Build a ROCm
64 version once for faster quantization:
65
66 ```bash
67 # Set your GPU arch
68 export AMDGPU_TARGETS="gfx1100" # RDNA3
69 # export AMDGPU_TARGETS="gfx90a" # MI200
70 # export AMDGPU_TARGETS="gfx942" # MI300
71
72 scripts/build-llama-cpp-rocm.sh
73 ```
74
75 The script writes to `vendor/llama.cpp/build-rocm/`. To make
76 `dlm export` prefer this build, point the runner at it:
77
78 ```bash
79 export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm
80 ```
81
82 `DLM_LLAMA_CPP_BUILD` is honored by `dlm.export.vendoring` — when
83 set, it's checked before the default vendor dir for each binary, so
84 the ROCm-accelerated `llama-quantize` / `llama-imatrix` win over any
85 CPU-only build left behind from `scripts/bump-llama-cpp.sh build`.
86
87 ## CI / testing
88
89 No default ROCm CI runner exists. Contributors with ROCm hardware can
90 run the gated smoke:
91
92 ```bash
93 DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v
94 ```
95
96 A scheduled self-hosted runner is the intended deployment; contact
97 the maintainers if you'd like to host one.