tenseleyflow/documentlanguagemodel / 336bedd

Browse files

docs(hardware): ROCm support tier + arch matrix + llama.cpp HIP rebuild

Authored by espadonne
SHA
336beddcf6c09b853236375cd6ba52d49dff5cfb
Parents
12ccdcc
Tree
5b83d6e

2 changed files

StatusFile+-
A docs/hardware/rocm.md 96 0
M mkdocs.yml 2 0
docs/hardware/rocm.mdadded
@@ -0,0 +1,96 @@
1
+# AMD ROCm support
2
+
3
+DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training
4
+and inference work, but QLoRA is refused and the CI coverage is
5
+weaker than the CUDA path.
6
+
7
+## What works
8
+
9
+- **Training**: LoRA on bf16-capable AMD GPUs.
10
+- **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and
11
+  bitsandbytes are not involved.
12
+- **Export**: llama.cpp produces GGUF quantized weights. For
13
+  ROCm-accelerated quantization, rebuild llama.cpp with HIP (see
14
+  below); otherwise the default CPU build is used.
15
+
16
+## Supported GPUs
17
+
18
+| Generation | Arch codes       | Example SKUs                | bf16 | FA2 (via `flash_attn`) |
19
+|------------|------------------|-----------------------------|------|------------------------|
20
+| CDNA2      | `gfx90a`         | Instinct MI200/MI210/MI250  | yes  | yes                    |
21
+| CDNA3      | `gfx942`         | Instinct MI300              | yes  | yes                    |
22
+| RDNA3      | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes  | yes (experimental)     |
23
+| RDNA4      | `gfx1200`/`1201` | RX 9000-series              | yes  | varies                 |
24
+| RDNA2      | `gfx1030`/`1031` | RX 6000-series              | no   | no                     |
25
+| CDNA1      | `gfx908`         | MI100                       | no   | no                     |
26
+| Vega20     | `gfx906`         | Radeon VII / MI50           | no   | no                     |
27
+
28
+The bf16-capable allowlist is enforced in `dlm.hardware.capabilities`
29
+based on `torch.cuda.get_device_properties(0).gcnArchName`.
30
+Unsupported arches fall back to fp16 (still functional, just slower
31
+per token on weight-heavy layers).
32
+
33
+## What doesn't work
34
+
35
+**QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are
36
+upstream-unstable — 4-bit quantized matmuls silently return wrong
37
+values on several arch/driver combinations. We refuse the
38
+combination rather than risk corrupt gradients. Use `adapter: lora`
39
+in your `.dlm` frontmatter.
40
+
41
+**Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's
42
+multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.
43
+
44
+## Software prerequisites
45
+
46
+- **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
47
+- **PyTorch** with HIP build — install via the ROCm wheels from
48
+  pytorch.org. The `torch.version.hip` attribute must be non-None.
49
+- **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the
50
+  package name on ROCm. Install for CDNA (MI200/MI300); RDNA3
51
+  support is experimental. If `flash_attn` is not importable or the
52
+  arch is not on the allowlist, SDPA is used instead.
53
+
54
+## Determinism posture
55
+
56
+The doctor reports `determinism_class: best-effort` on ROCm. ROCm's
57
+deterministic kernels exist but are not as thorough as CUDA's; fp
58
+match may drift across PyTorch/ROCm upgrades even with a pinned
59
+seed.
60
+
61
+## Rebuilding llama.cpp with ROCm
62
+
63
+The default vendored llama.cpp binary is CPU-only. Build a ROCm
64
+version once for faster quantization:
65
+
66
+```bash
67
+# Set your GPU arch
68
+export AMDGPU_TARGETS="gfx1100"   # RDNA3
69
+# export AMDGPU_TARGETS="gfx90a"  # MI200
70
+# export AMDGPU_TARGETS="gfx942"  # MI300
71
+
72
+scripts/build-llama-cpp-rocm.sh
73
+```
74
+
75
+The script writes to `vendor/llama.cpp/build-rocm/`. To make
76
+`dlm export` prefer this build, point the runner at it:
77
+
78
+```bash
79
+export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm
80
+```
81
+
82
+(Environment-variable plumbing in `dlm.export.vendoring` lands as
83
+part of the next ROCm polish pass — for now, manually invoke the
84
+ROCm binaries if you need them.)
85
+
86
+## CI / testing
87
+
88
+No default ROCm CI runner exists. Contributors with ROCm hardware can
89
+run the gated smoke:
90
+
91
+```bash
92
+DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v
93
+```
94
+
95
+A scheduled self-hosted runner is the intended deployment; contact
96
+the maintainers if you'd like to host one.
mkdocs.ymlmodified
@@ -70,4 +70,6 @@ nav:
70
       - Multi-adapter composition: cookbook/multi-adapter.md
70
       - Multi-adapter composition: cookbook/multi-adapter.md
71
   - Architecture: architecture.md
71
   - Architecture: architecture.md
72
   - Determinism: determinism.md
72
   - Determinism: determinism.md
73
+  - Hardware:
74
+      - AMD ROCm: hardware/rocm.md
73
   - Troubleshooting: troubleshooting.md
75
   - Troubleshooting: troubleshooting.md