Qwen3.6-35B-A3B-MTP local on RX 9070 XT
What worked
-
docker/model-runner:mtp (image 7b6f81c6dc4b) has the MTP-patched llama.cpp baked in (FROM llama-rocm:full). Retag it as :latest because docker model status/run auto-pull and clobber :latest:
docker tag docker/model-runner:mtp docker/model-runner:latest
-
Pulled GGUF into the runner volume (one-time, via the managed runner):
docker model pull hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ2_XXS
Stored at /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf in the docker-model-runner-models volume.
-
The managed docker model run SIGSEGVs on this model. Crashes traced via dmesg to GPF in libamdhip64.so when llama.cpp enumerates both ROCm0 (GPU) and ROCm1 (CPU-as-ROCm). Also crashes during warmup with n_parallel=4.
-
Bypass docker model CLI entirely — run /app/llama-server directly from the patched image:
docker rm -f docker-model-runner llama-mtp 2>/dev/null
docker run -d --name llama-mtp \
--device /dev/dri --device /dev/kfd \
-e HIP_VISIBLE_DEVICES=0 -e ROCR_VISIBLE_DEVICES=0 \
-v docker-model-runner-models:/models \
-p 127.0.0.1:12434:12434 \
--entrypoint /app/llama-server \
docker/model-runner:mtp \
-m /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf \
--host 0.0.0.0 --port 12434 \
-c 131072 \
-np 1 \
-ngl 999 \
--device ROCm0 \
-fa on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--reasoning-budget 0 \
--no-mmproj
Key flags: HIP_VISIBLE_DEVICES=0 + --device ROCm0 (avoid CPU-as-ROCm GPF), --no-warmup, -np 1 (default 4 OOMs slot init), -ngl 999 (all layers on GPU), --jinja (enable tool-call template).
Example: chat completion with tool call
Can also add "chat_template_kwargs": {"enable_thinking": false}, at top level
request obj for fast mode.
curl -s http://127.0.0.1:12434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3.6-mtp",
"messages": [
{"role": "user", "content": "Whats the weather in Portland, OR right now?"}
],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {"type": "string", "enum": ["celsius","fahrenheit"]}
},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 256
}'
The model returns choices[0].message.tool_calls[*] with function.name and JSON function.arguments. reasoning_content holds the chain-of-thought when thinking is enabled (use /no_think in the user message or "chat_template_kwargs": {"enable_thinking": false} to suppress).
Stats and Logs
$ rocm-smi --showmeminfo vram -f -t -p -u | python -u ~/Downloads/parse_gpu_mem.py
WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status
GPU VRAM Usage (GB)
===================================
GPU[0]: 13.96884 GB
GPU[1]: 0.01552 GB
prompt eval time = 139.10 ms / 34 tokens ( 4.09 ms per token, 244.43 tokens per second)
eval time = 25613.34 ms / 2644 tokens ( 9.69 ms per token, 103.23 tokens per second)
total time = 25752.43 ms / 2678 tokens
draft acceptance rate = 0.61607 ( 1717 accepted / 2787 generated)
0.52.569.131 I statistics draft-mtp: #calls(b,g,a) = 2 934 934, #gen drafts = 934, #acc drafts = 726, #gen tokens = 2802, #acc tokens = 1726, dur(b,g,a) = 0.002, 5476.322, 0.454 ms
0.52.569.142 I slot release: id 0 | task 8 | stop processing: n_tokens = 2680, truncated = 0