GNU/Linux xterm-256color bash 4 views

Qwen3.6-35B-A3B-MTP local on RX 9070 XT

What worked

  1. docker/model-runner:mtp (image 7b6f81c6dc4b) has the MTP-patched llama.cpp baked in (FROM llama-rocm:full). Retag it as :latest because docker model status/run auto-pull and clobber :latest:

    docker tag docker/model-runner:mtp docker/model-runner:latest
  2. Pulled GGUF into the runner volume (one-time, via the managed runner):

    docker model pull hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ2_XXS

    Stored at /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf in the docker-model-runner-models volume.

  3. The managed docker model run SIGSEGVs on this model. Crashes traced via dmesg to GPF in libamdhip64.so when llama.cpp enumerates both ROCm0 (GPU) and ROCm1 (CPU-as-ROCm). Also crashes during warmup with n_parallel=4.

  4. Bypass docker model CLI entirely — run /app/llama-server directly from the patched image:

    docker rm -f docker-model-runner llama-mtp 2>/dev/null
    docker run -d --name llama-mtp \
      --device /dev/dri --device /dev/kfd \
      -e HIP_VISIBLE_DEVICES=0 -e ROCR_VISIBLE_DEVICES=0 \
      -v docker-model-runner-models:/models \
      -p 127.0.0.1:12434:12434 \
      --entrypoint /app/llama-server \
      docker/model-runner:mtp \
      -m /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf \
      --host 0.0.0.0 --port 12434 \
      -c 131072 \
      -np 1 \
      -ngl 999 \
      --device ROCm0 \
      -fa on \
      --cache-type-k q8_0 \
      --cache-type-v q8_0 \
      --spec-type draft-mtp \
      --spec-draft-n-max 3 \
      --reasoning-budget 0 \
      --no-mmproj

Key flags: HIP_VISIBLE_DEVICES=0 + --device ROCm0 (avoid CPU-as-ROCm GPF), --no-warmup, -np 1 (default 4 OOMs slot init), -ngl 999 (all layers on GPU), --jinja (enable tool-call template).

Example: chat completion with tool call

Can also add "chat_template_kwargs": {"enable_thinking": false}, at top level request obj for fast mode.

curl -s http://127.0.0.1:12434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.6-mtp",
    "messages": [
      {"role": "user", "content": "Whats the weather in Portland, OR right now?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city":  {"type": "string", "description": "City name"},
            "units": {"type": "string", "enum": ["celsius","fahrenheit"]}
          },
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 256
  }'

The model returns choices[0].message.tool_calls[*] with function.name and JSON function.arguments. reasoning_content holds the chain-of-thought when thinking is enabled (use /no_think in the user message or "chat_template_kwargs": {"enable_thinking": false} to suppress).

Stats and Logs

$ rocm-smi --showmeminfo vram -f -t -p -u | python -u ~/Downloads/parse_gpu_mem.py
WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status

GPU VRAM Usage (GB)
===================================
GPU[0]: 13.96884 GB
GPU[1]: 0.01552 GB
prompt eval time =     139.10 ms /    34 tokens (    4.09 ms per token,   244.43 tokens per second)
       eval time =   25613.34 ms /  2644 tokens (    9.69 ms per token,   103.23 tokens per second)
      total time =   25752.43 ms /  2678 tokens
draft acceptance rate = 0.61607 ( 1717 accepted /  2787 generated)
0.52.569.131 I statistics draft-mtp: #calls(b,g,a) = 2 934 934, #gen drafts = 934, #acc drafts = 726, #gen tokens = 2802, #acc tokens = 1726, dur(b,g,a) = 0.002, 5476.322, 0.454 ms
0.52.569.142 I slot      release: id  0 | task 8 | stop processing: n_tokens = 2680, truncated = 0