Round 2 · Speed + DFlash

Qwen3.6-27B speed analysis

Tokens-per-second, time-to-first-token, and long-context prefill cost measured across 10 streaming runs on a single RTX 5090.

View the DFlash epic page — built by the model itself

52.1avg tok/s gen

0.17 smedian TTFT

15,183total tokens

32 Kcontext window

20 GBVRAM used

Plan vs. reality — the spec-decoding detour

The original plan was FP8 + MTP speculative decoding via vLLM for ~85–120 tok/s. Every speculative-decoding path on Blackwell + CUDA 12.8 hit a concrete blocker:

FP8 (30.9 GB): model weights alone consumed 30.32 GB of 31.35 GB VRAM. KV cache + CUDA graphs tried to allocate 2.37 GB into 1.70 GB free. OOM.
AWQ-INT4 (20.4 GB): community quant uses group_size=32. Every Blackwell-capable kernel (Conch, Marlin, AllSpark) requires group_size=128 or does not support zero-points. No kernel match.
NVFP4 (19.7 GB): loaded cleanly, then the Triton kernel failed with No supported CUDA architectures found for major versions [12] — NVFP4 on sm_120 needs CUDA ≥ 12.9; system has 12.8. Cannot run.
SGLang 0.5.10: sgl_kernel ships no sm_120 binaries and ABI-mismatches against torch 2.10. Import-time fail.

Honest finding: spec decoding on Qwen3.6-27B on a single 5090 is not shippable today. Fall-back is llama.cpp GGUF Q5_K_XL, which is proven and stable on the same hardware.

Setup

Item	Value
Model	`unsloth/Qwen3.6-27B-GGUF — UD-Q5_K_XL`
Runtime	llama.cpp cuda-12.8, `--flash-attn on`, `--jinja`
Context	32,768 tokens, q8_0 K and V cache, single slot
GPU	RTX 5090 (32 GB), all 65 layers offloaded
VRAM in use	~20 GB of 32 GB (12 GB headroom)
Spec decoding	none (MTP path was blocked upstream)

Per-run measurements

prompt	tokens	TTFT (s)	total (s)	gen tok/s
short_1_haiku	21	0.15	0.52	58.9
short_2_explain_mtp	78	0.15	1.60	54.7
short_3_math	117	0.19	2.37	54.2
short_4_code	42	0.17	0.95	55.6
short_5_json	103	0.17	2.08	54.2
medium_1_tutorial (thinking)	2200	—	41.22	53.4 *
medium_2_code_review	1654	0.25	31.14	53.6
long_ctx_1_summary (~8k prompt)	91	1.49	3.48	46.3
long_ctx_2_extraction (same ctx)	102	0.33	2.56	46.2
epic_dflash_page (~11k gen)	10775	1.00	241.21	44.9

* medium_1 ran with thinking enabled; the entire 2200-token budget was spent inside <think> so no content chunks were emitted — TTFT/gen not measurable separately, but end-to-end rate still ~53 tok/s.

Gen tok/s — visual

short_1_haiku

58.9

short_2_explain_mtp

54.7

short_3_math

54.2

short_4_code

55.6

short_5_json

54.2

medium_1_tutorial*

53.4

medium_2_code_review

53.6

long_ctx_1_summary

46.3

long_ctx_2_extraction

46.2

epic_dflash_page

44.9

What the numbers say

Short-prompt ceiling is ~56 tok/s. Single-stream, small KV, no spec decoding. Matches yesterday's 55.3 baseline within noise.
Gen rate drops ~8 tok/s on a populated KV. Long-context runs (8 K prompt) held 46 tok/s; the 11 K-token epic HTML held 44.9 tok/s. Not catastrophic — predictable memory-bandwidth cost of moving the KV cache each step.
TTFT is dominated by prefill. 0.15 s on tiny prompts, 1.49 s on ~8 K tokens — roughly 5 K tok/s prefill throughput. About 10× the generation rate, as expected for a GPU-parallelized prefill vs. serial decode.
Prefix caching is free money. long_ctx_2 ran the same ambient document as long_ctx_1 and saw TTFT drop from 1.49 s to 0.33 s — 4.5× faster prefill for the reused prefix.
Thinking mode is a token sink if you don't budget for it. 2 200 tokens of reasoning with zero content emitted in medium_1. Disable thinking or allocate ≥ 4 K tokens for tasks that combine reasoning + output.

Extrapolation: what spec decoding would have given

Qwen3.6-27B ships with MTP weights in its FP8 release. Published vLLM benchmarks on other 27-30 B dense targets with MTP @ num_speculative_tokens=2 show ~1.8-2.2× speedup over baseline autoregressive. On this hardware that projects to 95-115 tok/s, not delivered today, but gated purely on CUDA toolchain versioning and a looser VRAM budget (~40 GB card or 48 GB Pro would fit FP8 + KV + MTP comfortably).

DFlash with a Qwen3.6-27B-specific drafter (not yet released) would stack on top — the paper reports 6× lossless acceleration and 2.5× over EAGLE-3. Revisit when z-lab/Qwen3.6-27B-DFlash ships.

Open the DFlash epic page (generated by the model itself)