← back to index
Round 2 · Speed + DFlash

Qwen3.6-27B speed analysis

by Kyle Hessling

Tokens-per-second, time-to-first-token, and long-context prefill cost measured across 10 streaming runs on a single RTX 5090.

View the DFlash epic page — built by the model itself
52.1avg tok/s gen
0.17 smedian TTFT
15,183total tokens
32 Kcontext window
20 GBVRAM used

Plan vs. reality — the spec-decoding detour

The original plan was FP8 + MTP speculative decoding via vLLM for ~85–120 tok/s. Every speculative-decoding path on Blackwell + CUDA 12.8 hit a concrete blocker: Honest finding: spec decoding on Qwen3.6-27B on a single 5090 is not shippable today. Fall-back is llama.cpp GGUF Q5_K_XL, which is proven and stable on the same hardware.

Setup

ItemValue
Modelunsloth/Qwen3.6-27B-GGUF — UD-Q5_K_XL
Runtimellama.cpp cuda-12.8, --flash-attn on, --jinja
Context32,768 tokens, q8_0 K and V cache, single slot
GPURTX 5090 (32 GB), all 65 layers offloaded
VRAM in use~20 GB of 32 GB (12 GB headroom)
Spec decodingnone (MTP path was blocked upstream)

Per-run measurements

prompttokensTTFT (s)total (s)gen tok/s
short_1_haiku210.150.5258.9
short_2_explain_mtp780.151.6054.7
short_3_math1170.192.3754.2
short_4_code420.170.9555.6
short_5_json1030.172.0854.2
medium_1_tutorial (thinking)220041.2253.4 *
medium_2_code_review16540.2531.1453.6
long_ctx_1_summary (~8k prompt)911.493.4846.3
long_ctx_2_extraction (same ctx)1020.332.5646.2
epic_dflash_page (~11k gen)107751.00241.2144.9

* medium_1 ran with thinking enabled; the entire 2200-token budget was spent inside <think> so no content chunks were emitted — TTFT/gen not measurable separately, but end-to-end rate still ~53 tok/s.

Gen tok/s — visual

short_1_haiku
58.9
short_2_explain_mtp
54.7
short_3_math
54.2
short_4_code
55.6
short_5_json
54.2
medium_1_tutorial*
53.4
medium_2_code_review
53.6
long_ctx_1_summary
46.3
long_ctx_2_extraction
46.2
epic_dflash_page
44.9

What the numbers say

Extrapolation: what spec decoding would have given

Qwen3.6-27B ships with MTP weights in its FP8 release. Published vLLM benchmarks on other 27-30 B dense targets with MTP @ num_speculative_tokens=2 show ~1.8-2.2× speedup over baseline autoregressive. On this hardware that projects to 95-115 tok/s, not delivered today, but gated purely on CUDA toolchain versioning and a looser VRAM budget (~40 GB card or 48 GB Pro would fit FP8 + KV + MTP comfortably).

DFlash with a Qwen3.6-27B-specific drafter (not yet released) would stack on top — the paper reports 6× lossless acceleration and 2.5× over EAGLE-3. Revisit when z-lab/Qwen3.6-27B-DFlash ships.

Open the DFlash epic page (generated by the model itself)