documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 3903 bytes Raw Blame History

AMD ROCm support

DLM supports AMD GPUs via ROCm as a Tier 2 backend: LoRA training and inference work, but QLoRA is refused and the CI coverage is weaker than the CUDA path.

What works

Training: LoRA on bf16-capable AMD GPUs.
Inference: dlm prompt uses the standard PyTorch path; MLX and bitsandbytes are not involved.
Export: llama.cpp produces GGUF quantized weights. For ROCm-accelerated quantization, rebuild llama.cpp with HIP (see below); otherwise the default CPU build is used.

Supported GPUs

Generation	Arch codes	Example SKUs	bf16	FA2 (via `flash_attn`)
CDNA2	`gfx90a`	Instinct MI200/MI210/MI250	yes	yes
CDNA3	`gfx942`	Instinct MI300	yes	yes
RDNA3	`gfx1100`/`1101`/`1102`	RX 7900 XTX/XT/7800/7700	yes	yes (experimental)
RDNA4	`gfx1200`/`1201`	RX 9000-series	yes	varies
RDNA2	`gfx1030`/`1031`	RX 6000-series	no	no
CDNA1	`gfx908`	MI100	no	no
Vega20	`gfx906`	Radeon VII / MI50	no	no

The bf16-capable allowlist is enforced in dlm.hardware.capabilities based on torch.cuda.get_device_properties(0).gcnArchName. Unsupported arches fall back to fp16 (still functional, just slower per token on weight-heavy layers).

What doesn't work

QLoRA is refused on ROCm. bitsandbytes ROCm builds are upstream-unstable — 4-bit quantized matmuls silently return wrong values on several arch/driver combinations. We refuse the combination rather than risk corrupt gradients. Use adapter: lora in your .dlm frontmatter.

Multi-GPU ROCm is out of scope for this sprint. Sprint 23's multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.

Software prerequisites

ROCm ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
PyTorch with HIP build — install via the ROCm wheels from pytorch.org. The torch.version.hip attribute must be non-None.
FlashAttention 2 (optional): AMD's flash_attn fork is the package name on ROCm. Install for CDNA (MI200/MI300); RDNA3 support is experimental. If flash_attn is not importable or the arch is not on the allowlist, SDPA is used instead.

Determinism posture

The doctor reports determinism_class: best-effort on ROCm. ROCm's deterministic kernels exist but are not as thorough as CUDA's; fp match may drift across PyTorch/ROCm upgrades even with a pinned seed.

Rebuilding llama.cpp with ROCm

The default vendored llama.cpp binary is CPU-only. Build a ROCm version once for faster quantization:

# Set your GPU arch
export AMDGPU_TARGETS="gfx1100"   # RDNA3
# export AMDGPU_TARGETS="gfx90a"  # MI200
# export AMDGPU_TARGETS="gfx942"  # MI300

scripts/build-llama-cpp-rocm.sh

The script writes to vendor/llama.cpp/build-rocm/. To make dlm export prefer this build, point the runner at it:

export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm

DLM_LLAMA_CPP_BUILD is honored by dlm.export.vendoring — when set, it's checked before the default vendor dir for each binary, so the ROCm-accelerated llama-quantize / llama-imatrix win over any CPU-only build left behind from scripts/bump-llama-cpp.sh build.

CI / testing

No default ROCm CI runner exists. Contributors with ROCm hardware can run the gated smoke:

DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v

A scheduled self-hosted runner is the intended deployment; contact the maintainers if you'd like to host one.

View source

  
        1
        # AMD ROCm support
      
        2
        
        3
        DLM supports AMD GPUs via ROCm as a **Tier 2** backend: LoRA training
      
        4
        and inference work, but QLoRA is refused and the CI coverage is
      
        5
        weaker than the CUDA path.
      
        6
        
        7
        ## What works
      
        8
        
        9
        - **Training**: LoRA on bf16-capable AMD GPUs.
      
        10
        - **Inference**: `dlm prompt` uses the standard PyTorch path; MLX and
      
        11
          bitsandbytes are not involved.
      
        12
        - **Export**: llama.cpp produces GGUF quantized weights. For
      
        13
          ROCm-accelerated quantization, rebuild llama.cpp with HIP (see
      
        14
          below); otherwise the default CPU build is used.
      
        15
        
        16
        ## Supported GPUs
      
        17
        
        18
        | Generation | Arch codes       | Example SKUs                | bf16 | FA2 (via `flash_attn`) |
      
        19
        |------------|------------------|-----------------------------|------|------------------------|
      
        20
        | CDNA2      | `gfx90a`         | Instinct MI200/MI210/MI250  | yes  | yes                    |
      
        21
        | CDNA3      | `gfx942`         | Instinct MI300              | yes  | yes                    |
      
        22
        | RDNA3      | `gfx1100`/`1101`/`1102` | RX 7900 XTX/XT/7800/7700 | yes  | yes (experimental)     |
      
        23
        | RDNA4      | `gfx1200`/`1201` | RX 9000-series              | yes  | varies                 |
      
        24
        | RDNA2      | `gfx1030`/`1031` | RX 6000-series              | no   | no                     |
      
        25
        | CDNA1      | `gfx908`         | MI100                       | no   | no                     |
      
        26
        | Vega20     | `gfx906`         | Radeon VII / MI50           | no   | no                     |
      
        27
        
        28
        The bf16-capable allowlist is enforced in `dlm.hardware.capabilities`
      
        29
        based on `torch.cuda.get_device_properties(0).gcnArchName`.
      
        30
        Unsupported arches fall back to fp16 (still functional, just slower
      
        31
        per token on weight-heavy layers).
      
        32
        
        33
        ## What doesn't work
      
        34
        
        35
        **QLoRA is refused on ROCm.** `bitsandbytes` ROCm builds are
      
        36
        upstream-unstable — 4-bit quantized matmuls silently return wrong
      
        37
        values on several arch/driver combinations. We refuse the
      
        38
        combination rather than risk corrupt gradients. Use `adapter: lora`
      
        39
        in your `.dlm` frontmatter.
      
        40
        
        41
        **Multi-GPU ROCm** is out of scope for this sprint. Sprint 23's
      
        42
        multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.
      
        43
        
        44
        ## Software prerequisites
      
        45
        
        46
        - **ROCm** ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
      
        47
        - **PyTorch** with HIP build — install via the ROCm wheels from
      
        48
          pytorch.org. The `torch.version.hip` attribute must be non-None.
      
        49
        - **FlashAttention 2 (optional)**: AMD's `flash_attn` fork is the
      
        50
          package name on ROCm. Install for CDNA (MI200/MI300); RDNA3
      
        51
          support is experimental. If `flash_attn` is not importable or the
      
        52
          arch is not on the allowlist, SDPA is used instead.
      
        53
        
        54
        ## Determinism posture
      
        55
        
        56
        The doctor reports `determinism_class: best-effort` on ROCm. ROCm's
      
        57
        deterministic kernels exist but are not as thorough as CUDA's; fp
      
        58
        match may drift across PyTorch/ROCm upgrades even with a pinned
      
        59
        seed.
      
        60
        
        61
        ## Rebuilding llama.cpp with ROCm
      
        62
        
        63
        The default vendored llama.cpp binary is CPU-only. Build a ROCm
      
        64
        version once for faster quantization:
      
        65
        
        66
        ```bash
      
        67
        # Set your GPU arch
      
        68
        export AMDGPU_TARGETS="gfx1100"   # RDNA3
      
        69
        # export AMDGPU_TARGETS="gfx90a"  # MI200
      
        70
        # export AMDGPU_TARGETS="gfx942"  # MI300
      
        71
        
        72
        scripts/build-llama-cpp-rocm.sh
      
        73
        ```
      
        74
        
        75
        The script writes to `vendor/llama.cpp/build-rocm/`. To make
      
        76
        `dlm export` prefer this build, point the runner at it:
      
        77
        
        78
        ```bash
      
        79
        export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm
      
        80
        ```
      
        81
        
        82
        `DLM_LLAMA_CPP_BUILD` is honored by `dlm.export.vendoring` — when
      
        83
        set, it's checked before the default vendor dir for each binary, so
      
        84
        the ROCm-accelerated `llama-quantize` / `llama-imatrix` win over any
      
        85
        CPU-only build left behind from `scripts/bump-llama-cpp.sh build`.
      
        86
        
        87
        ## CI / testing
      
        88
        
        89
        No default ROCm CI runner exists. Contributors with ROCm hardware can
      
        90
        run the gated smoke:
      
        91
        
        92
        ```bash
      
        93
        DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v
      
        94
        ```
      
        95
        
        96
        A scheduled self-hosted runner is the intended deployment; contact
      
        97
        the maintainers if you'd like to host one.

1	# AMD ROCm support
2
3	DLM supports AMD GPUs via ROCm as a Tier 2 backend: LoRA training
4	and inference work, but QLoRA is refused and the CI coverage is
5	weaker than the CUDA path.
6
7	## What works
8
9	- Training: LoRA on bf16-capable AMD GPUs.
10	- Inference: `dlm prompt` uses the standard PyTorch path; MLX and
11	bitsandbytes are not involved.
12	- Export: llama.cpp produces GGUF quantized weights. For
13	ROCm-accelerated quantization, rebuild llama.cpp with HIP (see
14	below); otherwise the default CPU build is used.
15
16	## Supported GPUs
17
18	\| Generation \| Arch codes \| Example SKUs \| bf16 \| FA2 (via `flash_attn`) \|
19	\|------------\|------------------\|-----------------------------\|------\|------------------------\|
20	\| CDNA2 \| `gfx90a` \| Instinct MI200/MI210/MI250 \| yes \| yes \|
21	\| CDNA3 \| `gfx942` \| Instinct MI300 \| yes \| yes \|
22	\| RDNA3 \| `gfx1100`/`1101`/`1102` \| RX 7900 XTX/XT/7800/7700 \| yes \| yes (experimental) \|
23	\| RDNA4 \| `gfx1200`/`1201` \| RX 9000-series \| yes \| varies \|
24	\| RDNA2 \| `gfx1030`/`1031` \| RX 6000-series \| no \| no \|
25	\| CDNA1 \| `gfx908` \| MI100 \| no \| no \|
26	\| Vega20 \| `gfx906` \| Radeon VII / MI50 \| no \| no \|
27
28	The bf16-capable allowlist is enforced in `dlm.hardware.capabilities`
29	based on `torch.cuda.get_device_properties(0).gcnArchName`.
30	Unsupported arches fall back to fp16 (still functional, just slower
31	per token on weight-heavy layers).
32
33	## What doesn't work
34
35	QLoRA is refused on ROCm. `bitsandbytes` ROCm builds are
36	upstream-unstable — 4-bit quantized matmuls silently return wrong
37	values on several arch/driver combinations. We refuse the
38	combination rather than risk corrupt gradients. Use `adapter: lora`
39	in your `.dlm` frontmatter.
40
41	Multi-GPU ROCm is out of scope for this sprint. Sprint 23's
42	multi-GPU work targets CUDA first; ROCm multi-GPU is a follow-on.
43
44	## Software prerequisites
45
46	- ROCm ≥ 5.7; 6.0+ preferred. We test against 6.0 and 6.2.
47	- PyTorch with HIP build — install via the ROCm wheels from
48	pytorch.org. The `torch.version.hip` attribute must be non-None.
49	- FlashAttention 2 (optional): AMD's `flash_attn` fork is the
50	package name on ROCm. Install for CDNA (MI200/MI300); RDNA3
51	support is experimental. If `flash_attn` is not importable or the
52	arch is not on the allowlist, SDPA is used instead.
53
54	## Determinism posture
55
56	The doctor reports `determinism_class: best-effort` on ROCm. ROCm's
57	deterministic kernels exist but are not as thorough as CUDA's; fp
58	match may drift across PyTorch/ROCm upgrades even with a pinned
59	seed.
60
61	## Rebuilding llama.cpp with ROCm
62
63	The default vendored llama.cpp binary is CPU-only. Build a ROCm
64	version once for faster quantization:
65
66	```bash
67	# Set your GPU arch
68	export AMDGPU_TARGETS="gfx1100" # RDNA3
69	# export AMDGPU_TARGETS="gfx90a" # MI200
70	# export AMDGPU_TARGETS="gfx942" # MI300
71
72	scripts/build-llama-cpp-rocm.sh
73	```
74
75	The script writes to `vendor/llama.cpp/build-rocm/`. To make
76	`dlm export` prefer this build, point the runner at it:
77
78	```bash
79	export DLM_LLAMA_CPP_BUILD=vendor/llama.cpp/build-rocm
80	```
81
82	`DLM_LLAMA_CPP_BUILD` is honored by `dlm.export.vendoring` — when
83	set, it's checked before the default vendor dir for each binary, so
84	the ROCm-accelerated `llama-quantize` / `llama-imatrix` win over any
85	CPU-only build left behind from `scripts/bump-llama-cpp.sh build`.
86
87	## CI / testing
88
89	No default ROCm CI runner exists. Contributors with ROCm hardware can
90	run the gated smoke:
91
92	```bash
93	DLM_ENABLE_ROCM_SMOKE=1 uv run pytest tests/integration/hardware/test_rocm_train_smoke.py -v
94	```
95
96	A scheduled self-hosted runner is the intended deployment; contact
97	the maintainers if you'd like to host one.