Block Diffusion for
Flash Speculative Decoding

Replace the autoregressive drafter with a lightweight block-diffusion model. Produce entire blocks of tokens in a single parallel pass.

Get Started Read Paper
Target Verify T1 T2 T3 Parallel Drafting Single Forward Pass

The Autoregressive Bottleneck

Classical speculative decoding uses a small autoregressive drafter. While faster than the target model, the drafter itself must generate tokens sequentially. This creates a memory-bound serial bottleneck that limits acceleration.

T-1
T0
T1
T2
T3
...
Serial Memory Bound

The DFlash Idea

DFlash replaces the drafter with a lightweight block-diffusion model. It produces an entire block of draft tokens in a single parallel forward pass, conditioned on hidden-state features from the target model.

Standard Speculative

Drafter generates token by token. Each step depends on the previous one.

T
D1
D2
D3

DFlash Block Diffusion

Drafter generates the whole block at once. Target verifies in one pass.

T
V

Reported Results

Benchmarked across multiple model families and tasks.

0
x Lossless Acceleration
vs Baseline AR Decoding
0
x Higher Speedup
vs EAGLE-3 (SOTA)
0
Forward Pass
For entire block drafting

How to Run It

Installation and usage examples for vLLM and SGLang.

Installation
git clone https://github.com/z-lab/dflash.git
cd dflash
uv pip install -e ".[vllm]"
uv pip install -U vllm --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly
vLLM Serve
vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method":"dflash",
    "model":"z-lab/Qwen3.5-27B-DFlash",
    "num_speculative_tokens":15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768
SGLang Serve
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend trtllm_mha \
  --speculative-draft-attention-backend fa4 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \
  --trust-remote-code
Benchmark
python -m dflash.benchmark --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --model Qwen/Qwen3.5-27B \
  --dataset gsm8k --num-prompts 128 --concurrency 1

Drafter Collection

Available drafters as of 2026-04-22. Select the right drafter for your target model.

Qwen3.6-35B-A3B-DFlash
0.5B Parameters
Target: Qwen3.6-35B-A3B
Qwen3.5-27B-DFlash
2B Parameters
Target: Qwen3.5-27B
Qwen3.5-4B-DFlash
Lightweight
Target: Qwen3.5-4B
Qwen3.5-9B-DFlash
Lightweight
Target: Qwen3.5-9B
Qwen3.5-35B-A3B-DFlash
Balanced
Target: Qwen3.5-35B-A3B
gpt-oss-20b-DFlash
Open Source
Target: gpt-oss-20b
gpt-oss-120b-DFlash
High Capacity
Target: gpt-oss-120b
Kimi-K2.5-DFlash
3B Parameters
Target: Kimi-K2.5