AMD Radeon™ RX 9070 XT qwen_3.6 MTP

by pdxjohnny 20 days ago

GNU/Linux • • bash 4 views

Qwen3.6-35B-A3B-MTP local on RX 9070 XT

What worked

docker/model-runner:mtp (image 7b6f81c6dc4b) has the MTP-patched llama.cpp baked in (FROM llama-rocm:full). Retag it as :latest because docker model status/run auto-pull and clobber :latest:
```
docker tag docker/model-runner:mtp docker/model-runner:latest
```
Pulled GGUF into the runner volume (one-time, via the managed runner):
```
docker model pull hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ2_XXS
```
Stored at /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf in the docker-model-runner-models volume.
The managed docker model run SIGSEGVs on this model. Crashes traced via dmesg to GPF in libamdhip64.so when llama.cpp enumerates both ROCm0 (GPU) and ROCm1 (CPU-as-ROCm). Also crashes during warmup with n_parallel=4.

Bypass docker model CLI entirely — run /app/llama-server directly from the patched image:

docker rm -f docker-model-runner llama-mtp 2>/dev/null
docker run -d --name llama-mtp \
  --device /dev/dri --device /dev/kfd \
  -e HIP_VISIBLE_DEVICES=0 -e ROCR_VISIBLE_DEVICES=0 \
  -v docker-model-runner-models:/models \
  -p 127.0.0.1:12434:12434 \
  --entrypoint /app/llama-server \
  docker/model-runner:mtp \
  -m /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf \
  --host 0.0.0.0 --port 12434 \
  -c 131072 \
  -np 1 \
  -ngl 999 \
  --device ROCm0 \
  -fa on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --reasoning-budget 0 \
  --no-mmproj

Key flags: HIP_VISIBLE_DEVICES=0 + --device ROCm0 (avoid CPU-as-ROCm GPF), --no-warmup, -np 1 (default 4 OOMs slot init), -ngl 999 (all layers on GPU), --jinja (enable tool-call template).

Example: chat completion with tool call

Can also add "chat_template_kwargs": {"enable_thinking": false}, at top level request obj for fast mode.

curl -s http://127.0.0.1:12434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.6-mtp",
    "messages": [
      {"role": "user", "content": "Whats the weather in Portland, OR right now?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city":  {"type": "string", "description": "City name"},
            "units": {"type": "string", "enum": ["celsius","fahrenheit"]}
          },
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 256
  }'

The model returns choices[0].message.tool_calls[*] with function.name and JSON function.arguments. reasoning_content holds the chain-of-thought when thinking is enabled (use /no_think in the user message or "chat_template_kwargs": {"enable_thinking": false} to suppress).

Stats and Logs

$ rocm-smi --showmeminfo vram -f -t -p -u | python -u ~/Downloads/parse_gpu_mem.py
WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status

GPU VRAM Usage (GB)
===================================
GPU[0]: 13.96884 GB
GPU[1]: 0.01552 GB

prompt eval time =     139.10 ms /    34 tokens (    4.09 ms per token,   244.43 tokens per second)
       eval time =   25613.34 ms /  2644 tokens (    9.69 ms per token,   103.23 tokens per second)
      total time =   25752.43 ms /  2678 tokens
draft acceptance rate = 0.61607 ( 1717 accepted /  2787 generated)
0.52.569.131 I statistics draft-mtp: #calls(b,g,a) = 2 934 934, #gen drafts = 934, #acc drafts = 726, #gen tokens = 2802, #acc tokens = 1726, dur(b,g,a) = 0.002, 5476.322, 0.454 ms
0.52.569.142 I slot      release: id  0 | task 8 | stop processing: n_tokens = 2680, truncated = 0

More recordings by pdxjohnny

Browse all

2022-12-08T12:17:35-08:00: fedora https://github.com/intel/dffml/issues/1247: operation: run datafow: DevCloud 2:05

by pdxjohnny 3 years ago

2023-01-06T14:37:24-08:00: wonderland alice: shouldi: contribute: More CI/CD validation 7:04

by pdxjohnny 3 years ago

2023-04-23-08-46-1682264809s 1:22

by pdxjohnny 3 years ago

2023-11-24-01-15-1700784917s 0:19

by pdxjohnny 2 years ago

https://asciinema.org/a/1067473

Copied!

Append ?t=30 to start the playback at 30s, ?t=3:20 to start the playback at 3m 20s.

See sharing docs for more link customization options.

Embed as image link

Use snippets below to display a preview image linking to this recording.
Ideal for places where scripts are not allowed, such as project README files.

HTML:

<a href="https://asciinema.org/a/1067473" target="_blank"><img src="https://asciinema.org/a/1067473.svg" /></a>

Copied!

Markdown:

[![asciicast](https://asciinema.org/a/1067473.svg)](https://asciinema.org/a/1067473)

Copied!

Embed the player

If you're embedding on your own page or on a site which permits script tags, you can use the full player widget:

<script src="https://asciinema.org/a/1067473.js" id="asciicast-1067473" async="true"></script>

Copied!

Paste the above script tag where you want the player to be displayed on your page.

See embedding docs for more player customization options.

You can download this recording in asciicast v3 format, as a .cast file.

Download

Replay in terminal

You can replay the downloaded recording in your terminal using the asciinema play command:

asciinema play 1067473.cast

Copied!

If you don't have asciinema CLI installed then see installation instructions.

Use with stand-alone player on your website

Download asciinema player from the releases page (you only need .js and .css file), then use it like this:

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" type="text/css" href="/assets/asciinema-player.css" />
</head>
<body>
  <div id="player"></div>
  <script src="/assets/asciinema-player.min.js"></script>
  <script>
    AsciinemaPlayer.create(
      '/assets/1067473.cast',
      document.getElementById('player'),
      { cols: 213, rows: 109 }
    );
  </script>
</body>
</html>

See asciinema player quick-start guide for full usage instructions.

While this site doesn't provide GIF conversion at the moment, you can still do it yourself with the help of asciinema GIF generator utility - agg.

Once you have it installed, generate a GIF with the following command:

agg https://asciinema.org/a/1067473 demo.gif

Copied!

Or, if you already downloaded the recording file:

agg demo.cast demo.gif

Copied!

Check agg --help for all available options. You can change font family and size, select color theme, adjust speed and more.

See agg manual for full usage instructions.