Tokens-per-second, time-to-first-token, and long-context prefill cost measured across 10 streaming runs on a single RTX 5090.
View the DFlash epic page — built by the model itselfgroup_size=32. Every Blackwell-capable kernel (Conch, Marlin, AllSpark) requires group_size=128 or does not support zero-points. No kernel match.No supported CUDA architectures found for major versions [12] — NVFP4 on sm_120 needs CUDA ≥ 12.9; system has 12.8. Cannot run.sgl_kernel ships no sm_120 binaries and ABI-mismatches against torch 2.10. Import-time fail.| Item | Value |
|---|---|
| Model | unsloth/Qwen3.6-27B-GGUF — UD-Q5_K_XL |
| Runtime | llama.cpp cuda-12.8, --flash-attn on, --jinja |
| Context | 32,768 tokens, q8_0 K and V cache, single slot |
| GPU | RTX 5090 (32 GB), all 65 layers offloaded |
| VRAM in use | ~20 GB of 32 GB (12 GB headroom) |
| Spec decoding | none (MTP path was blocked upstream) |
| prompt | tokens | TTFT (s) | total (s) | gen tok/s |
|---|---|---|---|---|
| short_1_haiku | 21 | 0.15 | 0.52 | 58.9 |
| short_2_explain_mtp | 78 | 0.15 | 1.60 | 54.7 |
| short_3_math | 117 | 0.19 | 2.37 | 54.2 |
| short_4_code | 42 | 0.17 | 0.95 | 55.6 |
| short_5_json | 103 | 0.17 | 2.08 | 54.2 |
| medium_1_tutorial (thinking) | 2200 | — | 41.22 | 53.4 * |
| medium_2_code_review | 1654 | 0.25 | 31.14 | 53.6 |
| long_ctx_1_summary (~8k prompt) | 91 | 1.49 | 3.48 | 46.3 |
| long_ctx_2_extraction (same ctx) | 102 | 0.33 | 2.56 | 46.2 |
| epic_dflash_page (~11k gen) | 10775 | 1.00 | 241.21 | 44.9 |
* medium_1 ran with thinking enabled; the entire 2200-token budget was spent inside <think> so no content chunks were emitted — TTFT/gen not measurable separately, but end-to-end rate still ~53 tok/s.
long_ctx_2 ran the same ambient document as long_ctx_1 and saw TTFT drop from 1.49 s to 0.33 s — 4.5× faster prefill for the reused prefix.content emitted in medium_1. Disable thinking or allocate ≥ 4 K tokens for tasks that combine reasoning + output.Qwen3.6-27B ships with MTP weights in its FP8 release. Published vLLM benchmarks on other 27-30 B dense targets with MTP @ num_speculative_tokens=2 show ~1.8-2.2× speedup over baseline autoregressive. On this hardware that projects to 95-115 tok/s, not delivered today, but gated purely on CUDA toolchain versioning and a looser VRAM budget (~40 GB card or 48 GB Pro would fit FP8 + KV + MTP comfortably).
DFlash with a Qwen3.6-27B-specific drafter (not yet released) would stack on top — the paper reports 6× lossless acceleration and 2.5× over EAGLE-3. Revisit when z-lab/Qwen3.6-27B-DFlash ships.